SIGMOD 2022 Programming Contest

Team "BringBackML" - National & Kapodistrian University Of Athens

This is our submitted solution for the SIGMOD 2022 Programming Contest.

Team Members

Advisors:

Yannis Foufoulas
Theofilos Mailis

Contest Results

11th place out of 55 teams
Total (average) Recall Score: 46.9% (1st place: 52.9%)
- 71% on D1 dataset
- 22.7% on D2 dataset
< Runtime to be found > (1st place: 1914 secs)

Task

The task is to perform blocking for Entity Resolution, i.e., quickly filter out non-matches (tuple pairs that are unlikely to represent the same real-world entity) in a limited time to generate a small candidate set that contains a limited number of tuple pairs for matching.

Participants are asked to solve the task on two product datasets. Each dataset is made of a list of instances (rows) and a list of properties describing them (columns). We will refer to each of these datasets as D_i.

For each dataset D_i, participants are provided with the following resources:

X_i : a subset of the instances in D_i
Y_i : matching pairs in X_i x X_i. (The pairs not in Y_i are non-matching pairs.)
Blocking Requirements: the size of the generated candidate set (i.e., the number of tuple pairs in the candidate set)

Note that matching pairs in Y_i are transitively closed (i.e., if A matches with B and B matches with C, then A matches with C). For a matching pair id₁ and id₂ with id₁ < id₂, Y_i only includes (id₁, id₂) and doesn't include (id₂, id₁).

The goal is to write a program that generates, for each X_i dataset, a candidate set of tuple pairs for matching X_i with X_i. The output must be stored in a CSV file containing the ids of tuple pairs in the candidate set. The CSV file must have two columns: "left;_instance_id" and "right;_instance_id" and the output file must be named "output.csv;".; The separator must be the comma. Note that we do not consider the trivial equi-joins (tuple pairs with left_instance_id = right_instance_id) as true matches. For a pair id₁ and id₂ (assume id₁ < id₂), we only include (id₁, id₂) and don't include (id₂, id₁) in "output.csv".

Solutions are evaluated over the complete dataset D_i. Note that the instances in D_i (except the sample X_i) are not provided to the participants. More details are available in the Evaluation Process section.

Both X_i and Y_i are in CSV format.

Example of dataset X_i

instance_id	attr_name_1	attr_name_2	...	attr_name_k
00001	value_1	null	...	value_k
00002	null	value_2	...	value_k
...	...	...	...	...

Example of dataset Y_i

left_instance_id	right_instance_id
00001	00002
00001	00003
...	...

More details about the datasets can be found in the dedicated Datasets section.

Example of output.csv

left_instance_id	right_instance_id
00001	00002
00001	00004
...	...

Output.csv format: The evaluation process expects "output.csv" to have 3000000 tuple pairs. The first 1000000 tuple pairs are for dataset X₁ and the remaining pairs are for datasets X₂. As a result, "output.csv" is formatted accordingly. You can check out the provided baseline solution on how to produce a valid "ouput.csv".

Solution Requirements

Python 3.8 or newer
pandas
frozendict
ReproZip was used for packing the solution and executing the submitted solutions, but is not required.

Compatibility

Python Versions:
- Python 3.8.10
- PyPy 7.3.9 (Python 3.9.2)
OS:
- WSL Ubuntu 20.04

Repository Content

baseline directory:
- blocking.py: The provided baseline solution
datasets directory:
- X1.csv (X1 dataset) & Y1.csv (matching pairs for X1)
- X2.csv (X2 dataset) & Y2.csv (matching pairs for X2)
output_misc directory: To store secondary .csv files, used for analyzing the main output.csv file (see below)
src directory:
- Submitted files:
  - run.py: Starting point of the solution
  - x1_blocking.py: X1-specific solution logic, definitions & routines
  - x2_blocking.py: X2-specific solution logic, definitions & routines
  - utils.py: General definitions used by both solutions
- output.csv: Non-formatted output for the given X1 dataset
- Scripts for quick usage of ReproZip:
  - traceAndPack.sh: Run run.py and pack the execution in submission.rpz
  - cleanReprozip.sh: Clean all files and directories generated by ReproZip (including submission.rpz)
- Scripts for analyzing the solution performance & output.csv
  - compare.py: Find correct, missed & false positive pairs in output.csv and store them (with titles) in corresponding .csv files, in the output_misc directory. Also display the number of pairs in each category, as well as the Recall score.
  - Bash scripts for separating the .csv files generated by compare.py by brand, and storing the brand-specific .csv's in output_misc/false, output_misc/missed and output_misc/common.

Execution

In src directory:

Simple Execution:
- Choose the desired Dataset to run the experiment on: In utils.py, set TARGET_DATASET accordingly.
- If you wish to format output.csv to have precisely 3,000,000 rows, set SUBMISSION_MODE to True. To skip the solution for a dataset, set IGNORE_DATASET to '1' or '2' ('' to not skip).
- Run the solution:
```
  python3 run.py
```
- To see the stats for the answer generated by the solution:
```
  python3 compare.py
```
Execute & Pack with ReproZip
- Select the desired dataset & parameters as above.
- Run the solution & pack in submission.rpz:
```
  ./traceAndPack
```
- To clean-up the generated files, including submission.rpz:
```
  ./cleanReprozip
```

Algorithm Description

...

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
baseline		baseline
datasets		datasets
output_misc		output_misc
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SIGMOD 2022 Programming Contest

Team "BringBackML" - National & Kapodistrian University Of Athens

Team Members

Contest Results

Task

Solution Requirements

Compatibility

Repository Content

Execution

Algorithm Description

About

Uh oh!

Releases

Packages

Languages

pspanoudakis/SIGMOD-Contest-2022

Folders and files

Latest commit

History

Repository files navigation

SIGMOD 2022 Programming Contest

Team "BringBackML" - National & Kapodistrian University Of Athens

Team Members

Contest Results

Task

Solution Requirements

Compatibility

Repository Content

Execution

Algorithm Description

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages