This is our submitted solution for the SIGMOD 2022 Programming Contest.
Advisors:
- Yannis Foufoulas
- Theofilos Mailis
- 11th place out of 55 teams
- Total (average) Recall Score: 46.9% (1st place: 52.9%)
- 71% on D1 dataset
- 22.7% on D2 dataset
- < Runtime to be found > (1st place: 1914 secs)
The task is to perform blocking for Entity Resolution, i.e., quickly filter out non-matches (tuple pairs that are unlikely to represent the same real-world entity) in a limited time to generate a small candidate set that contains a limited number of tuple pairs for matching.
Participants are asked to solve the task on two product datasets. Each dataset is made of a list of instances (rows) and a list of properties describing them (columns). We will refer to each of these datasets as Di.
For each dataset Di, participants are provided with the following resources:
- Xi : a subset of the instances in Di
- Yi : matching pairs in Xi x Xi. (The pairs not in Yi are non-matching pairs.)
- Blocking Requirements: the size of the generated candidate set (i.e., the number of tuple pairs in the candidate set)
Note that matching pairs in Yi are transitively closed (i.e., if A matches with B and B matches with C, then A matches with C). For a matching pair id1 and id2 with id1 < id2, Yi only includes (id1, id2) and doesn't include (id2, id1).
The goal is to write a program that generates, for each Xi dataset, a candidate set of tuple pairs for matching Xi with Xi. The output must be stored in a CSV file containing the ids of tuple pairs in the candidate set. The CSV file must have two columns: "left;_instance_id" and "right;_instance_id" and the output file must be named "output.csv;".; The separator must be the comma. Note that we do not consider the trivial equi-joins (tuple pairs with left_instance_id = right_instance_id) as true matches. For a pair id1 and id2 (assume id1 < id2), we only include (id1, id2) and don't include (id2, id1) in "output.csv".
Solutions are evaluated over the complete dataset Di. Note that the instances in Di (except the sample Xi) are not provided to the participants. More details are available in the Evaluation Process section.
Both Xi and Yi are in CSV format.
Example of dataset Xi
| instance_id | attr_name_1 | attr_name_2 | ... | attr_name_k |
|---|---|---|---|---|
| 00001 | value_1 | null | ... | value_k |
| 00002 | null | value_2 | ... | value_k |
| ... | ... | ... | ... | ... |
Example of dataset Yi
| left_instance_id | right_instance_id |
|---|---|
| 00001 | 00002 |
| 00001 | 00003 |
| ... | ... |
More details about the datasets can be found in the dedicated Datasets section.
Example of output.csv
| left_instance_id | right_instance_id |
|---|---|
| 00001 | 00002 |
| 00001 | 00004 |
| ... | ... |
Output.csv format: The evaluation process expects "output.csv" to have 3000000 tuple pairs. The first 1000000 tuple pairs are for dataset X1 and the remaining pairs are for datasets X2. As a result, "output.csv" is formatted accordingly. You can check out the provided baseline solution on how to produce a valid "ouput.csv".
- Python 3.8 or newer
pandasfrozendict- ReproZip was used for packing the solution and executing the submitted solutions, but is not required.
- Python Versions:
- Python 3.8.10
- PyPy 7.3.9 (Python 3.9.2)
- OS:
- WSL Ubuntu 20.04
baselinedirectory:blocking.py: The provided baseline solution
datasetsdirectory:X1.csv(X1 dataset) &Y1.csv(matching pairs for X1)X2.csv(X2 dataset) &Y2.csv(matching pairs for X2)
output_miscdirectory: To store secondary.csvfiles, used for analyzing the mainoutput.csvfile (see below)srcdirectory:- Submitted files:
run.py: Starting point of the solutionx1_blocking.py: X1-specific solution logic, definitions & routinesx2_blocking.py: X2-specific solution logic, definitions & routinesutils.py: General definitions used by both solutions
output.csv: Non-formatted output for the given X1 dataset- Scripts for quick usage of ReproZip:
traceAndPack.sh: Runrun.pyand pack the execution insubmission.rpzcleanReprozip.sh: Clean all files and directories generated by ReproZip (includingsubmission.rpz)
- Scripts for analyzing the solution performance &
output.csvcompare.py: Find correct, missed & false positive pairs inoutput.csvand store them (with titles) in corresponding.csvfiles, in theoutput_miscdirectory. Also display the number of pairs in each category, as well as the Recall score.- Bash scripts for separating the
.csvfiles generated bycompare.pyby brand, and storing the brand-specific.csv's inoutput_misc/false,output_misc/missedandoutput_misc/common.
- Submitted files:
In src directory:
-
Simple Execution:
-
Choose the desired Dataset to run the experiment on: In
utils.py, setTARGET_DATASETaccordingly. -
If you wish to format
output.csvto have precisely 3,000,000 rows, setSUBMISSION_MODEtoTrue. To skip the solution for a dataset, setIGNORE_DATASETto'1'or'2'(''to not skip). -
Run the solution:
python3 run.py -
To see the stats for the answer generated by the solution:
python3 compare.py
-
-
Execute & Pack with ReproZip
-
Select the desired dataset & parameters as above.
-
Run the solution & pack in
submission.rpz:./traceAndPack -
To clean-up the generated files, including
submission.rpz:./cleanReprozip
-
...