-
Notifications
You must be signed in to change notification settings - Fork 0
Lars800/DuplicateDetection
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
This project implements a duplicate detection algorithm based on Locality Sensitive Hashing (LSH) to perform approximate matching. It then implements a version Multi Component Similarity Method+ (Hartveld et al., 2018) It was written for the Computer Sience for Business Analytics course at Erasmus School of Economics. To use the code one must first prepare the data as discribed in the accompanying paper (Recursive LSH for scalable duplicate detection). It is important maintain the order of the arguments in this dataframe. This should by Model_ID (for training only, empty otherwise), Title, any number of split features (in this case brand and size), the full feature list for a product (dictionary), Shop The cleaned data can then be passed to to the get_my_candidates method. To perform furhter cleaning, a version of the MCSM+ method is implemented in the generate_duplicates method. In the example main, the paper test case is setup. We run the algorithm for several band sizes, each outputting a different number of candidates. For robustness we repeat this step 5 times using a booststrap resample of our data.
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published