Self-Refining Language Model Anonymizers
via Adversarial Distillation

Updates

11/2025: 💻 We release our full codebase.
09/2025: 🎉 Our paper is accepted to NeurIPS. See you at the conference!

Overview

Large language models (LLMs) are increasingly used in sensitive domains, where their ability to infer personal data from seemingly benign text introduces emerging privacy risks. While recent LLM-based anonymization methods help mitigate such risks, they often rely on proprietary models (e.g., GPT-4), raising concerns about cost and the potential exposure of sensitive data to untrusted external systems.

To address this, we introduce SElf-refining Anonymization with Language model (SEAL), a novel distillation framework for training small language models (SLMs) to perform effective anonymization without relying on external models at inference time. SEAL leverages adversarial interactions between an LLM anonymizer and an inference model to collect trajectories of anonymized texts and inferred attributes, which are then used to distill anonymization and critique capabilities into SLMs through supervised fine-tuning and preference learning. The resulting models learn both to anonymize text and to evaluate their outputs, enabling iterative improvement of anonymization quality via self-refinement. Experiments on SynthPAI, a dataset of synthetic personal profiles and text comments, demonstrate that SLMs trained with SEAL achieve substantial improvements in anonymization capabilities. Notably, 8B models attain a privacy-utility trade-off comparable to that of the GPT-4 anonymizer and, with self-refinement, even surpass it in terms of privacy protection. These results highlight the effectiveness of our adversarial distillation framework for training SLMs as efficient anonymizers.

Usage

The main Python training code lives in //scripts. Concrete training and evaluation examples for Llama-3-8B are provided as shell scripts in //scripts/sh. All training and evaluation datasets are stored in //data. The training set was generated by using GPT-4o to anonymize samples from the SynthPAI dataset. Our training pipeline is based on the excellent alignment-handbook repository.

Installation

To run the code in this repo, create a Python virtual environment, e.g., using Conda, as follows:

# Clone codebase
git clone git@github.com/kykim0/SEAL.git && cd SEAL

# Prepare environment
conda create -y -n seal python=3.10
conda activate seal

# Install dependencies
pip install -r requirements.txt

To push models to Hugging Face Hub, log into your account

huggingface-cli login

and install Git LFS

sudo apt-get install git-lfs

Supervised fine-tuning

The following command trains the Llama-3-8B model on both the anonymization and critique tasks with SFT, using the uncertainty score as part of the privacy scoring. See a complete SFT example in //scripts/sh/run_sft.sh.

ACCELERATE_LOG_LEVEL=info accelerate launch \
    --config_file=recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 \
    scripts/run_sft.py recipes/sft/config_lora.yaml \
    --model_name_or_path=meta-llama/Llama-3.1-8B-Instruct \
    --learning_rate=2.0e-04 \
    --num_train_epochs=1 \
    --eval_num_anons=5 \
    --reasoning=false \
    --critic_train=true \
    --disable_feedback=false \
    --certainty=true \
    --seed=42 \
    --output_dir=save/Llama-3.1-8B-Instruct

Preference optimization

To apply DPO on top of the SFT model, run the following command with --model_name_or_path set to the model directory.

ACCELERATE_LOG_LEVEL=info accelerate launch \
    --config_file=recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 \
    scripts/run_dpo.py recipes/dpo/config_lora.yaml \
    --model_name_or_path=save/Llama-3.1-8B-Instruct \
    --learning_rate=5.0e-06 \
    --reasoning=false \
    --critic_train=true \
    --disable_feedback=false \
    --use_all_texts=true \
    --certainty=true \
    --eval_num_anons=5 \
    --seed=42 \
    --output_dir=save/Llama-3.1-8B-Instruct/dpo

Evaluation

The anonymization outputs are stored under the eval subdirectory for both SFT and DPO. Use the following commands to first evaluate them using a GPT model judge, and then compute aggregate scores from the evaluation. See a complete evaluation example in //scripts/sh/run_eval.sh.

python eval/evaluate.py \
    --eval_model=${EVAL_MODEL}$ \
    --api_key=${API_KEY} \
    --infer_file=${INFER_FILE}

python eval/analyze.py --eval_file=${EVAL_FILE}

Citation

If our work is useful in your research, please consider citing it:

@article{kim2025self,
  title={Self-Refining Language Model Anonymizers via Adversarial Distillation},
  author={Kim, Kyuyoung and Jeon, Hyunjun and Shin, Jinwoo},
  journal={Advances in Neural Information Processing Systems},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
asset		asset
data		data
eval		eval
recipes		recipes
scripts		scripts
src/alignment		src/alignment
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Self-Refining Language Model Anonymizers
via Adversarial Distillation

Updates

Overview

Usage

Installation

Supervised fine-tuning

Preference optimization

Evaluation

Citation

About

Uh oh!

Releases

Packages

Languages

kykim0/SEAL

Folders and files

Latest commit

History

Repository files navigation

Self-Refining Language Model Anonymizersvia Adversarial Distillation

Updates

Overview

Usage

Installation

Supervised fine-tuning

Preference optimization

Evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Self-Refining Language Model Anonymizers
via Adversarial Distillation

Packages