Skip to content

Minhash Deduplication Between Two Datasets #370

@yjha9649

Description

@yjha9649

Hello,

I am trying to perform Minhash-based deduplication between two datasets: an existing dataset and a new dataset. The goal is to remove documents from the new dataset if they are similar to those in the existing dataset.

Currently, I’m following the steps below to perform deduplication:
https://colab.research.google.com/drive/1_nNRm8lc7KjGfj5K4UWemkis8uKfjQcz?usp=sharing

Does this approach make sense for cross-dataset deduplication?

Additionally, when I examine the generated .dups files after running the pipeline, I can identify document IDs from the new dataset. However, the corresponding document IDs from the existing dataset always appear as 4294967295, which I believe corresponds to a sentinel value (0xFFFFFFFF). Because of this, I cannot trace which document in the existing dataset matched.

Is there a way to retrieve or output the actual document IDs from the existing dataset in the .dups file or elsewhere?

Any help or guidance would be greatly appreciated. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions