Minhash Deduplication Between Two Datasets

Hello,

I am trying to perform Minhash-based deduplication between two datasets: an existing dataset and a new dataset. The goal is to remove documents from the new dataset if they are similar to those in the existing dataset.

Currently, I’m following the steps below to perform deduplication:
https://colab.research.google.com/drive/1_nNRm8lc7KjGfj5K4UWemkis8uKfjQcz?usp=sharing

Does this approach make sense for cross-dataset deduplication?

Additionally, when I examine the generated .dups files after running the pipeline, I can identify document IDs from the new dataset. However, the corresponding document IDs from the existing dataset always appear as 4294967295, which I believe corresponds to a sentinel value (0xFFFFFFFF). Because of this, I cannot trace which document in the existing dataset matched.

Is there a way to retrieve or output the actual document IDs from the existing dataset in the .dups file or elsewhere?

Any help or guidance would be greatly appreciated. Thank you!



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Minhash Deduplication Between Two Datasets #370

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Minhash Deduplication Between Two Datasets #370

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions