Skip to content

Implementation of FILIP embedding model includes padding vectors in similarity computation #31

@hyojeongyunn

Description

@hyojeongyunn

Hello, and thank you for your work on this repository.

I have a question regarding the implementation of the FILIP embedding model in this repository.

In the original FILIP paper, it is mentioned that padding vectors are excluded from similarity computation to prevent performance degradation.

"Unlike Khattab & Zaharia (2020), we discard the padded tokens and use average instead summation of token-wise maximum similarities when computing the image-text alignment, which enhances the cross-modal representation learning and stabilizes training."

However, based on my understanding of the code here, it seems that padding vectors are also being used in the similarity calculation.
In the implementation, FILIP use topk selection in "get_weighted_dense_logits" function of FILIP model.
However, if we use top k value (input argument of get_weighted_dense_logits function) as a larger value than the number of vectors for each text/image sample, then padding vector can be used in the similarity calculation.
And theoretically, selecting top k vectors and dropping vector for padded token is not the same.

https://github.com/Sense-GVT/DeCLIP/blob/main/experiments/filip_experiments/yfcc15m/yfcc15m_vit_filip/config.yaml#L22
https://github.com/Sense-GVT/DeCLIP/blob/main/prototype/model/filip.py#L71-L106

I would like to confirm whether my understanding is correct. If padding vectors are indeed included in the similarity computation, could you clarify the reason behind this design choice?

Thank you for your time and support!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions