This project demonstrates how to fine-tune a LLaMA-3 model using LoRA (Low-Rank Adaptation), merge the adapter back into the base model, apply GPTQ (4-bit quantization) for compression, and finally perform efficient inference using vLLM.
Huggingface model page: rightpunch/Llama-3.2-3B-Instruct_PEFT_GPTQ
Pipeline Summary:
- Fine-tune a LLaMA-3 model using LoRA on WikiText-2.
- Merge LoRA weights into the base model.
- Quantize the merged model using GPTQ (4-bit).
- Load and infer with the quantized model via
vLLM. - Evaluate inference throughput and perplexity (PPL).
To set up the environment and reproduce the experiment, follow the steps below.
These instructions assume you are using a GPU-enabled system with at least 16GB of VRAM and CUDA support.
git clone https://github.com/rightpunchChen/edgeAI_final_report.git
conda create -n llama_env python=3.10
conda activate llama_env
cd edgeAI_final_report
pip install huggingface-hub[cli]
pip install transformers==4.51.1
pip install torch==2.6.0
pip install torchvision==0.21.0
pip install torchaudio==2.6.0
pip install triton==3.2.0
pip install timm==1.0.15
pip install datasets==3.5.0
pip install accelerate==1.6.0
pip install gemlite==0.4.4
pip install hqq==0.2.5
pip install vllm
pip install optimum
pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
pip install peft
huggingface-cli login
python main.pyNote: Since installing gemlite requires torch to be installed first, using requirements.txt may cause errors, so we recommend installing related dependencies step by step.
If you want to train the new model run the full training pipeline:
cd lora_gptq
python train_lora_gptq_vllm.pyThis will:
- Fine-tune a LLaMA-3 model using LoRA (if not already trained).
- Merge the LoRA adapter and quantize the model using GPTQ.
- Launch vLLM to generate responses and benchmark performance.
- Print throughput and perplexity.
- Save metrics into result.csv.
You can change the output directories for your models by modifying the following variables in the main() function in lora_gptq/train_lora_gptq_vllm.py:
peft_model_path = "your_custom_peft_path"
merged_path = "./your_custom_merged_path"
gptq_path = './your_custom_gptq_path'This allows you to organize multiple experiments or avoid overwriting results.
All experiments were performed on a T4 gpu.
Prompt: How to learn a new language?
Response: Learning a new language can be a challenging but rewarding experience. Here are some steps you can take to learn a new language:
1. Set your goals : Decide what you want to achieve with your language learning. Are you looking to travel to a foreign country, communicate with a foreign family member, or simply improve your language skills ?
2. Choose your learning method: There are many ways to learn a new language, including
* Language classes : Enroll in a class at a language school or community college
* Language exchange programs: Find a language partner to practice with
* Language learning apps : Use apps like Duolingo, Babbel, or Rosetta Stone
* Language learning software : Use software like Rosetta Stone or Pimseler
* Language learning books : Use books like " Language Hacking " or " Fluent Forever "
* Language learning podcasts : Listen to podcasts like " Coffee Break " or " News in Slow "
* Language learning YouTube channels : Watch YouTube channels like " EnglishClass101 " or
" English With Lucy "
β’ Learn the basics : Start with the basics of the language, such as the alphabet, basic grammar rules, and common phrases .
4 . Practice regularly : Practice speaking
Throughput: 87.0 toks/s
Perplexity (PPL): 9.75
CSV format:
Id,value
0,9.75 # PPL
1,87.0 # Throughput