clip-hindi

CLIP (Contrastive Language-Image Pre-training) a vision-language model that connects images and text by learning a shared representation. The model is trained to understand and match images with their corresponding textual descriptions without relying on traditional supervised learning with labeled datasets. Instead, it leverages a large amount of image-text pairs from the web, which allows it to generalize well to new and unseen data.

In this case we used two pre-trained model to encode both images and texts:

l3cube-pune/hindi-bert-v2 from huggingface
resnet50 from timm

We also provide a notebook for finetuning the model

Examples

This model takes in a text or a prompt like घास पर कुत्ता and then a bunch of images pass the images and prompt to the text encoder and image encoder computes the similarities and then tell us which images is the most similar based on the prompt

Hyper-parameters during training

Hyperparameter	Value
batch_size	32
num_workers	2
head_lr	1e-3
image_encoder_lr	1e-4
text_encoder_lr	1e-5
weight_decay	1e-3
patience	1
factor	0.8

Optimizer

The optimizer used was AdamW, a variant of the Adam optimizer that includes weight decay. In this setup, the weight decay is set to 0.1, which enhances training stability and speeds up convergence by efficiently managing regularization.

Learning Rate

For this CLIP model, each component was assigned its own learning rate, tailored to its role and importance. Specifically:

The image encoder and text encoder each have distinct learning rates, allowing for more fine-grained control over their training dynamics. The projection layers, which are crucial for aligning image and text features, use a higher learning rate to expedite convergence in their specific task, coupled with weight decay to prevent overfitting. This approach ensures that each part of the model receives appropriate optimization attention, improving overall training efficiency.

Learning Rate Scheduler

A learning rate scheduler, ReduceLROnPlateau, was employed to adjust the learning rate based on validation performance. This scheduler decreases the learning rate when the validation loss plateaus, with parameters set to a patience of 1 epoch and a reduction factor of 0.8. This adaptive adjustment helps in fine-tuning the model further when the learning progress slows down, leading to better performance and more stable training outcomes.

Train Loss and Eval Loss

The training loss started at 18.4 then gradually decreases to 0.84 after 10 epochs
while the val loss decreases to almost 1.202 at the end
continual training did not decrease the loss any further due to training it on a very small datasets

Get Started

First you will have to git clone this repo

!git clone https://github.com/dame-cell/clip-hindi.git
%cd  clip-hindi
!pip install -r requirements.txt

Then you wil have to download the pytorch model from huggingface

from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="damerajee/clip-hindi", filename="model.pt",local_dir="model")

You can then load the model and the image as well

import torch 
from clip.modeling_clip import CLIPModel
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np
from transformers import AutoTokenizer 
import torch.nn.functional as F
import requests

model = CLIPModel().to("cuda")
model.load_state_dict(torch.load("where you stored you model.pt", map_location="cuda"))
tokenizer = AutoTokenizer.from_pretrained("l3cube-pune/hindi-bert-v2")

text_inputs = tokenizer([
    "एक बिल्ली की लेटी हुई तस्वीर",
    "एक कुत्ते की लेटी हुई तस्वीर",
], padding=True, truncation=True, return_tensors="pt")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

cat_image = Image.fromarray(np.uint8(image))
image_np = np.array(image)  
preprocessor = model.preprocess()  
preprocessed = preprocessor(image=image_np)['image'] 

preprocessed_tensor = torch.tensor(preprocessed).float()
preprocessed_tensor = preprocessed_tensor.permute(2, 0, 1)  # Change shape to [C, H, W]
preprocessed_tensor = preprocessed_tensor.unsqueeze(0)

processed_image =preprocessed_tensor.to("cuda")
text_inputs = text_inputs.to("cuda")

and then you can try it out

image_features , text_features = model(processed_image,text_input_ids=text_inputs['input_ids'],text_attention_mask=text_inputs['attention_mask'])

# normalize the features 
image_embeddings_n = F.normalize(image_features, p=2, dim=-1)
text_embeddings_n = F.normalize(text_features, p=2, dim=-1)

# calculate the similarities 
dot_similarity = text_embeddings_n @ image_embeddings_n.T
print("dot_similarity",dot_similarity)

# output - > dot_similarity tensor([[0.0502],[0.0465]], grad_fn=<MmBackward0>)

Reference

@software{Shariatnia_Simple_CLIP_2021,
author = {Shariatnia, M. Moein},
doi = {10.5281/zenodo.6845731},
month = {4},
title = {{Simple CLIP}},
version = {1.0.0},
year = {2021}
}

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
clip		clip
images		images
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

clip-hindi

Examples

Hyper-parameters during training

Optimizer

Learning Rate

Learning Rate Scheduler

Train Loss and Eval Loss

Get Started

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

dame-cell/clip-hindi

Folders and files

Latest commit

History

Repository files navigation

clip-hindi

Examples

Hyper-parameters during training

Optimizer

Learning Rate

Learning Rate Scheduler

Train Loss and Eval Loss

Get Started

Reference

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages