Custom Vision Transformer (ViT) model fine-tuned for 5 specific classes using google/vit-base-patch16-224.
This project:
- Uses the pre-trained
google/vit-base-patch16-224model - Modifies it from 1000 classes to 5 custom classes: my_cat, my_dog, my_car, my_house, my_phone
- Trains it on your custom image dataset
- Tests it on your own photos
The base google/vit-base-patch16-224 model is pre-trained on ImageNet with 1000 general classes (like "cat", "dog", "car", etc.). This project customizes it for 5 specific classes:
Before (Base Model):
- 1000 output classes (ImageNet categories)
- Generic labels like "Egyptian cat", "golden retriever", "sports car"
- Classifier head:
Linear(768, 1000)
After (Custom Model):
- 5 output classes:
my_cat,my_dog,my_car,my_house,my_phone - Personalized labels for your specific objects
- Classifier head:
Linear(768, 5)
-
Model Architecture Modification (
model_custom.py):- Loads the pre-trained ViT model (weights preserved)
- Replaces the final classification layer from 1000 → 5 outputs
- Updates label mappings (
id2label,label2id) - Keeps all pre-trained feature extraction layers (transfer learning)
-
Fine-Tuning (
train.py):- Freezes most layers, trains only the new classifier head
- Uses your custom images to learn class-specific features
- Adapts the model to recognize your specific objects
-
Key Benefits:
- Leverages pre-trained features (no training from scratch)
- Fast training (only classifier head needs learning)
- Personalized for your specific objects
- Requires less data than training from scratch
- Base Model: Vision Transformer (ViT) with patch size 16×16, 224×224 input
- Feature Dimension: 768 (hidden size)
- Modification: Final linear layer changed from
768 → 1000to768 → 5 - Training: Fine-tuning with custom dataset using Hugging Face Trainer
- Install dependencies:
pip install -r requirements.txtRun the script to create a custom model with your 5 classes:
python model_custom.pyThis creates a custom model in ./custom_vit_model with 5 classes instead of 1000.
Your classes:
- my_cat
- my_dog
- my_car
- my_house
- my_phone
Organize your images in this structure:
data/
my_cat/
image1.jpg
image2.jpg
...
my_dog/
image1.jpg
...
my_car/
...
my_house/
...
my_phone/
...
Important:
- Folder names must match the class names exactly:
my_cat,my_dog,my_car,my_house,my_phone - Use common formats: .jpg, .jpeg, .png, .bmp, .gif
- Aim for at least 50-100 images per class for best results
- The
data/folder structure has been created for you - just add your images!
python train.py --data_dir ./data --epochs 5 --batch_size 8Parameters:
--data_dir: Directory with class subdirectories (default:./data)--model_path: Path to custom model (default:./custom_vit_model)--output_dir: Where to save trained model (default:./trained_model)--epochs: Number of training epochs (default: 5)--batch_size: Batch size (default: 8, reduce if out of memory)--learning_rate: Learning rate (default: 2e-5)
Example with custom parameters:
python train.py --data_dir ./data --epochs 10 --batch_size 16 --learning_rate 2e-5Single image:
python test.py --image my_photo.jpgAll images in a directory:
python test.py --directory ./my_test_photosUse a different model:
python test.py --image photo.jpg --model_path ./my_trained_modelhuggingface-image-project/
├── requirements.txt # Dependencies
├── model_custom.py # Create custom 5-class model
├── train.py # Training script
├── test.py # Testing script
├── README.md # This file
├── custom_vit_model/ # Created by model_custom.py
├── trained_model/ # Created by train.py
└── data/ # Your training images
├── my_cat/
├── my_dog/
├── my_car/
├── my_house/
└── my_phone/
# 1. Install dependencies
pip install -r requirements.txt
# 2. Create custom model
python model_custom.py
# 3. Add your images to data/ subdirectories
# - data/my_cat/your_cat_images.jpg
# - data/my_dog/your_dog_images.jpg
# - data/my_car/your_car_images.jpg
# - data/my_house/your_house_images.jpg
# - data/my_phone/your_phone_images.jpg
# 4. Train the model
python train.py --data_dir ./data --epochs 5
# 5. Test your photos
python test.py --image my_photo.jpg- More data = better accuracy: Use at least 50-100 images per class
- Image quality: Use clear, well-lit images
- Variety: Include different angles, backgrounds, lighting conditions
- Batch size: Reduce if you run out of memory (try 4 or 8)
- Epochs: Start with 5, increase if validation accuracy is still improving
- Balance: Try to have roughly equal number of images per class
No images found:
- Check that your data directory structure matches the expected format
- Verify folder names match exactly:
my_cat,my_dog,my_car,my_house,my_phone - Ensure images have supported extensions (.jpg, .png, etc.)
Out of memory:
- Reduce
--batch_size(try 4 or 8) - Use smaller images or resize before training
Low accuracy:
- Add more training images per class
- Ensure images are clear and representative
- Try training for more epochs
- Check that test images are similar to training data
Model not found:
- Make sure you've run
python model_custom.pybefore training - Check that
./custom_vit_modelexists
- Python 3.8+
- PyTorch 2.0+
- Transformers 4.30+
- See
requirements.txtfor full list
This project uses the google/vit-base-patch16-224 model from Hugging Face.