-
Install Django, Dedupe and other dependencies in a virtual-env.
pip install django pip install dedupe -
Clone the repo.
-
Start Apache and MySQL on localhost.
-
In dedupe/settings.py set your database name, username and password. Head over to your localhost/phpmyadmin and create a new database with the same name.
-
Make migrations and run the server as:
python manage.py makemigrations python manage.py migrate python manage.py runserver
Caution: Refrain from refreshing page in between the process of deduping to avoid data loss.
-
Upload your data file, give a name to it and Submit.
-
Select the Unique Identity Column of your data-set and the Columns based on which the data is supposed to be deduped.
-
Train the system
Option Result Yes The machine treats all similar data pairs to be the same No The machine treats all similar data pairs to be different Unsure The machine skips that question from training Finish The machine stops training and starts deduping the data. -
Wait till the system redirects you to a page with an option to download the output file. Then download the file by clicking on it.
-
The output file consists of two new columns added in the front namely Cluster ID and Confidence Score. The Cluster ID is same for all the data rows belonging to the same Cluster.
-
Tip : Sort the Confidence Score in descending order and Cluster ID to estimate the accuracy of the Deduping.
- Thanks to my team-mates Aarti Barai and Piyush Agarwal.
- Thanks to Lakshya Foundation and Innovation Garage for providng us an awesome platform Makeathon - 6.0 to showcase our skills.
- Thanks to Almabse for providing the challenging Problem Statement and to our mentors Kalyan Verma and Vaibhav Awachat.
Drop a mail at ucssvamsi@gmail.com