This is the artifact repository for the NDSS'26 paper "Chasing Shadows: Pitfalls in LLM Security Research".
If you used our data or want to cite our work, please use the following BibTeX entry:
@inproceedings{evertz-26-chasing,
title = {Chasing Shadows: Pitfalls in LLM Security Research},
author = {Evertz, Jonathan and Risse, Niklas and Neuer, Nicolai and M{\"u}ller, Andreas and Normann, Philipp and Sapia, Gaetano and Gupta, Srishti and Pape, David and Shaw, Soumya and Srivastav, Devansh and Wressnegger, Christian and Quiring, Erwin and Eisenhofer, Thorsten and Arp, Daniel and Sch{\"o}nherr, Lea},
booktitle = {Network and Distributed System Security Symposium (NDSS)},
year = {2026}
}The review data (with omitted paper titles) can be found here.
The project website with the living appendix can be found here.
Every experiment is contained in its own directory (e.g., experiment_a_1 for Experiment A.1). Each experiment has its own Dockerfile that contains all dependencies required to run the experiment. Simply follow the instructions below for each experiment to build the Docker image and run the experiment.
Most experiments require API keys. Insert/Update the necessary keys in the .env.template file in the respective experiment directory and rename it to .env.
Note
You do not need docker necessarily. You can also just create a virtual environment, install all dependencies via pip install -r requirements, and then just run the python scripts as you need.
-
Request dataset access from https://huggingface.co/datasets/keyan96/New-Wave-Hate Then, after getting access, go to the tab: "Files and Versions" (next to Dataset card and Data Studio). Download data_ground_truth.csv. Save the data_ground_truth.csv file in the experiment_a_1 directory.
-
To use the repository, you need to set up the env file with the OpenAI key.
cd experiment_a_1
cp .env.template .env
nano -w .env
Replace the OpenAI API key in .env with your API key.
- Build Docker and Run the Experiment
cd experiment_a_1
docker build -t experiment_a_1 .
docker run -e QUICK_CHECK=true experiment_a_1You have two options:
-
QUICK_CHECK=True: Here, you run the experiment on 10 samples (Good to check the functionality quickly). -
QUICK_CHECK=False(default) Here, you run the experiment on the full set of all samples (of the considered quarters).
- Excepted output
- You will get a few debug outputs.
- The previous script will run 3 experiments for three categories.
- The final output after running all experiments and aggregating the results should look like this (here, with QUICK_CHECK=True):
Found Results/results_vaccine_gpt-4-0613_gpt-4-0125-preview_gpt-4.1-2025-04-14.pck
Found Results/results_us_capitol_gpt-4-0613_gpt-4-0125-preview_gpt-4.1-2025-04-14.pck
Found Results/results_rus_ukr_gpt-4-0613_gpt-4-0125-preview_gpt-4.1-2025-04-14.pck
LLM:gpt-4-0613 & 96.67\% & 90.91\% & 100.0\% \\
LLM:gpt-4-0125-preview & 73.33\% & 100.0\% & 20.0\% \\
LLM:gpt-4.1-2025-04-14 & 86.67\% & 100.0\% & 60.0\% \\
The columns are: Accuracy, Precision, and Recall. Although we just used a small number of samples, we can already see a trend as reported in the paper. To see the numbers from the paper, you need to re-run the experiment with QUICK_CHECK=False. Note that the reported and observed numbers may slightly differ, as LLM outputs are not deterministic.
For the evaluation of the LLM robustness experiments we used the LLM confidentiality framework.
-
Add your OpenAI API key to the
.env.templatefile. -
Build the Docker and run the experiment with:
cd experiment_a_2
cp .env.template .env
nano -w .env # open and edit the file to add the API keys
docker build -t experiment_a_2 .
docker run --gpus all experiment_a_2 --llm_type <model_specifier> --strategy secret-key --attacks all --iterations 1000 --temperature 0.1 --device <cpu, cuda, or mps>Replace the <model_specifier> with the model to evaluate. For our experiments, we used the following models:
| Model Name | Specifier |
|---|---|
| ChatGPT 3.5 Turbo | gpt-3.5-turbo-1106 and gpt-3.5-turbo-0125 |
| ChatGPT 4 | gpt-4-0314 and gpt-4-0613 |
| ChatGPT 4 Turbo | gpt-4-1106-preview, gpt-4-0125-preview, and gpt-4-turbo-2024-04-09 |
| ChatGPT 4o | gpt-4o-2024-05-13, gpt-4o-2024-08-06, and gpt-4o-2024-11-20 |
| CodeLlama 7b (Ollama) | codellama-7b-fp16 (no quant.) or codellama-7b-quant-<2,3,4,5,6,8>bit |
| CodeLlama 7b (TheBloke) | codellama-7b (no quant.) or codellama-7b-quant-<2,3,4,5,6,8>bit-bloke |
| Llama2 7b (Ollama) | llama2-7b-fp16 (no quant.) or llama2-7b-quant-<2,3,4,5,6,8>bit |
| Llama2 7b (TheBloke) | llama2-7b-fp16-bloke (no quant.) or llama2-7b-quant-<2,3,4,5,6,8>bit-bloke |
| Llama3.1 8b (Ollama) | llama3.1-8b-fp16 (no quant.) or llama3.1-8b-quant-<2,3,4,5,6,8>bit |
To keep the runtime reasonable, we recommend changing the --iterations parameter to a lower value (e.g., 100) when necessary.
The results will then be printet to the console (stdout) after the experiment has finished. The results and logs will also be saved to logs/ directory. Files ending in _logs.txtcontain the detailed logs, while files ending in _results.txt contain the plain result metrics.
Note
Deprecation warnings and potentially missing Huggingface tokens can be ignored as they do not affect the models used in our experiments.
- The experiment requires a huggingface token to download the datasets from huggingface (free-of-charge). Create an account here and generate an access token following this guide. The experiments will not incur any costs.
cd experiment_b_1
cp .env.template .env
nano -w .env
- Afterwards, build the Docker and run the experiment with:
docker build -t experiment_b_1 .
docker run --rm -v "$(pwd)/../plots:/app/plots" experiment_b_1This will run the data leakage lab setting experiment for all three datasets (Devign, PrimeVul, DiverseVul) and all six leakage ratios (0, 20%, 40%, 60%, 80%, 100%) that were presented in the paper. The resulting plot will be saved to the plots directory.
Note
Since the experiment uses fine-tuning, it is normal that the experiment takes a while to run. On our system with an Nvidia A100 GPU the experiment took around 5-6 days.
- Scaled-down Version: We have implemented a
--fastoption to execute a scaled down version of the experiment (smaller model, only one dataset, less epochs, only 600 samples) which runs on a commodity desktop in 1-10 hours. Run with:
docker build -t experiment_b_1 .
docker run --rm -v "$(pwd)/../plots:/app/plots" experiment_b_1 --fast- It is also possible to run the script for one dataset and one leakage ratio only. In this case, no plot will be generated.
python run_experiment.py --dataset <Devign, DiverseVul, or PrimeVul> --ratio=<0, 0.2, 0.4, 0.6, 0.8, or 1.0>This experiment evaluates commit-message and code completion performance across commercial and local OpenAI-compatible LLM endpoints.
We evaluate the following models:
| Backend | Model Identifiers |
|---|---|
| OpenAI | gpt-3.5-turbo-0125, gpt-4o-2024-08-06 |
| Anthropic | claude-3-5-haiku-20241022 |
| DeepSeek | DeepSeek-V3-0324 |
| Local | meta-llama-3-8b-instruct, qwen3-14b, qwen2.5-coder-14b |
- The experiment requires API tokens from huggingface (free of charge), openai, deepseek and anthropic. Create a Huggingface account here and generate an access token following this guide. The experiments will not incur any costs on HuggingFace.
cd experiment_b_2
cp .env.template .env
nano -w .env
-
For local models, we used LM Studio to host models with an OpenAI-compatible REST API.
Download LM Studio here and set up the local API for the three local models according to this Tutorial. -
Afterwards, build the Docker and run the experiment with:
docker build -t experiment_b_2 .
docker run experiment_b_2To run the experiment for a specific model:
python run_experiment.py --backend <local, openai, anthropic, or deepseek> --model <one of the model identifiers from the table above>- The experiment requires a huggingface token to download the datasets from huggingface (free-of-charge). Create an account here and generate an access token following this guide. The experiments will not incur any costs.
cd experiment_c
cp .env.template .env
nano -w .env
- Afterwards, build the Docker and run the experiment with:
docker build -t experiment_c .
docker run --rm -v "$(pwd)/../plots:/app/plots" experiment_cThe resulting latex table (exact latex that was used in the paper) will be saved to the plots directory.
Build the Docker and run the experiment with:
cd experiment_d
docker build -t experiment_d .
docker run --gpus all experiment_d --device <cpu, cuda, or mps> --num_generations <number_of_generations>This will run the experiment on the specified device with the following default parameters:
- Model: unsloth/Qwen2.5-Coder-0.5B
- Dataset: bigcode/self-oss-instruct-sc2-exec-filter-50k
- Block Size: 128
The resulting perplexity plots will be saved to the root directory of the docker image. You may need to copy them out of the docker container with
docker cp <container_id>:/app/plots/perplexity_histogram_bs128_Qwen2.5-Coder-0.5B-Instruct.png <destination_path>Note
Since the experiment uses fine-tuning, it is normal that the experiment in its original scope (10 generations) takes a while to run. On our system with an Nvidia L40S GPU the experiment took around 4-5 days.
To run the experiment on a reduced set of generations, use the --num_generations argument. For example, to run the experiment with only 1 generation:
docker run --gpus all experiment_d --device <cpu, cuda, or mps> --num_generations 1This will significantly reduce the runtime of the experiment.