- 11/2025: 💻 We release our full codebase.
- 09/2025: 🎉 Our paper is accepted to NeurIPS. See you at the conference!
Large language models (LLMs) are increasingly used in sensitive domains, where their ability to infer personal data from seemingly benign text introduces emerging privacy risks. While recent LLM-based anonymization methods help mitigate such risks, they often rely on proprietary models (e.g., GPT-4), raising concerns about cost and the potential exposure of sensitive data to untrusted external systems.
To address this, we introduce SElf-refining Anonymization with Language model (SEAL), a novel distillation framework for training small language models (SLMs) to perform effective anonymization without relying on external models at inference time. SEAL leverages adversarial interactions between an LLM anonymizer and an inference model to collect trajectories of anonymized texts and inferred attributes, which are then used to distill anonymization and critique capabilities into SLMs through supervised fine-tuning and preference learning. The resulting models learn both to anonymize text and to evaluate their outputs, enabling iterative improvement of anonymization quality via self-refinement. Experiments on SynthPAI, a dataset of synthetic personal profiles and text comments, demonstrate that SLMs trained with SEAL achieve substantial improvements in anonymization capabilities. Notably, 8B models attain a privacy-utility trade-off comparable to that of the GPT-4 anonymizer and, with self-refinement, even surpass it in terms of privacy protection. These results highlight the effectiveness of our adversarial distillation framework for training SLMs as efficient anonymizers.
The main Python training code lives in //scripts. Concrete training and evaluation examples for Llama-3-8B are provided as shell scripts in //scripts/sh. All training and evaluation datasets are stored in //data. The training set was generated by using GPT-4o to anonymize samples from the SynthPAI dataset. Our training pipeline is based on the excellent alignment-handbook repository.
To run the code in this repo, create a Python virtual environment, e.g., using Conda, as follows:
# Clone codebase
git clone git@github.com/kykim0/SEAL.git && cd SEAL
# Prepare environment
conda create -y -n seal python=3.10
conda activate seal
# Install dependencies
pip install -r requirements.txtTo push models to Hugging Face Hub, log into your account
huggingface-cli loginand install Git LFS
sudo apt-get install git-lfsThe following command trains the Llama-3-8B model on both the anonymization and critique tasks with SFT, using the uncertainty score as part of the privacy scoring. See a complete SFT example in //scripts/sh/run_sft.sh.
ACCELERATE_LOG_LEVEL=info accelerate launch \
--config_file=recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 \
scripts/run_sft.py recipes/sft/config_lora.yaml \
--model_name_or_path=meta-llama/Llama-3.1-8B-Instruct \
--learning_rate=2.0e-04 \
--num_train_epochs=1 \
--eval_num_anons=5 \
--reasoning=false \
--critic_train=true \
--disable_feedback=false \
--certainty=true \
--seed=42 \
--output_dir=save/Llama-3.1-8B-InstructTo apply DPO on top of the SFT model, run the following command with --model_name_or_path set to the model directory.
ACCELERATE_LOG_LEVEL=info accelerate launch \
--config_file=recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 \
scripts/run_dpo.py recipes/dpo/config_lora.yaml \
--model_name_or_path=save/Llama-3.1-8B-Instruct \
--learning_rate=5.0e-06 \
--reasoning=false \
--critic_train=true \
--disable_feedback=false \
--use_all_texts=true \
--certainty=true \
--eval_num_anons=5 \
--seed=42 \
--output_dir=save/Llama-3.1-8B-Instruct/dpoThe anonymization outputs are stored under the eval subdirectory for both SFT and DPO. Use the following commands to first evaluate them using a GPT model judge, and then compute aggregate scores from the evaluation. See a complete evaluation example in //scripts/sh/run_eval.sh.
python eval/evaluate.py \
--eval_model=${EVAL_MODEL}$ \
--api_key=${API_KEY} \
--infer_file=${INFER_FILE}
python eval/analyze.py --eval_file=${EVAL_FILE}If our work is useful in your research, please consider citing it:
@article{kim2025self,
title={Self-Refining Language Model Anonymizers via Adversarial Distillation},
author={Kim, Kyuyoung and Jeon, Hyunjun and Shin, Jinwoo},
journal={Advances in Neural Information Processing Systems},
year={2025}
}