This is our submitted solution for the SIGMOD 2022 Programming Contest.
Advisors:
- Yannis Foufoulas
- Theofilos Mailis
- 11th place out of 55 teams
- Total (average) Recall Score: 46.9% (1st place: 52.9%)
- 71% on D1 dataset
- 22.7% on D2 dataset
- < Runtime to be found > (1st place: 1914 secs)
The task is to perform blocking for Entity Resolution, i.e., quickly filter out non-matches (tuple pairs that are unlikely to represent the same real-world entity) in a limited time to generate a small candidate set that contains a limited number of tuple pairs for matching.
Participants are asked to solve the task on two product datasets. Each dataset is made of a list of instances (rows) and a list of properties describing them (columns). We will refer to each of these datasets as Di.
For each dataset Di, participants are provided with the following resources:
- Xi : a subset of the instances in Di
- Yi : matching pairs in Xi x Xi. (The pairs not in Yi are non-matching pairs.)
- Blocking Requirements: the size of the generated candidate set (i.e., the number of tuple pairs in the candidate set)
Note that matching pairs in Yi are transitively closed (i.e., if A matches with B and B matches with C, then A matches with C). For a matching pair id1 and id2 with id1 < id2, Yi only includes (id1, id2) and doesn't include (id2, id1).
The goal is to write a program that generates, for each Xi dataset, a candidate set of tuple pairs for matching Xi with Xi. The output must be stored in a CSV file containing the ids of tuple pairs in the candidate set. The CSV file must have two columns: "left;_instance_id" and "right;_instance_id" and the output file must be named "output.csv;".; The separator must be the comma. Note that we do not consider the trivial equi-joins (tuple pairs with left_instance_id = right_instance_id) as true matches. For a pair id1 and id2 (assume id1 < id2), we only include (id1, id2) and don't include (id2, id1) in "output.csv".
Solutions are evaluated over the complete dataset Di. Note that the instances in Di (except the sample Xi) are not provided to the participants. More details are available in the Evaluation Process section.
Both Xi and Yi are in CSV format.
Example of dataset Xi
instance_id | attr_name_1 | attr_name_2 | ... | attr_name_k |
---|---|---|---|---|
00001 | value_1 | null | ... | value_k |
00002 | null | value_2 | ... | value_k |
... | ... | ... | ... | ... |
Example of dataset Yi
left_instance_id | right_instance_id |
---|---|
00001 | 00002 |
00001 | 00003 |
... | ... |
More details about the datasets can be found in the dedicated Datasets section.
Example of output.csv
left_instance_id | right_instance_id |
---|---|
00001 | 00002 |
00001 | 00004 |
... | ... |
Output.csv format: The evaluation process expects "output.csv" to have 3000000 tuple pairs. The first 1000000 tuple pairs are for dataset X1 and the remaining pairs are for datasets X2. As a result, "output.csv" is formatted accordingly. You can check out the provided baseline solution on how to produce a valid "ouput.csv".
- Python 3.8 or newer
pandas
frozendict
- ReproZip was used for packing the solution and executing the submitted solutions, but is not required.
- Python Versions:
- Python 3.8.10
- PyPy 7.3.9 (Python 3.9.2)
- OS:
- WSL Ubuntu 20.04
baseline
directory:blocking.py
: The provided baseline solution
datasets
directory:X1.csv
(X1 dataset) &Y1.csv
(matching pairs for X1)X2.csv
(X2 dataset) &Y2.csv
(matching pairs for X2)
output_misc
directory: To store secondary.csv
files, used for analyzing the mainoutput.csv
file (see below)src
directory:- Submitted files:
run.py
: Starting point of the solutionx1_blocking.py
: X1-specific solution logic, definitions & routinesx2_blocking.py
: X2-specific solution logic, definitions & routinesutils.py
: General definitions used by both solutions
output.csv
: Non-formatted output for the given X1 dataset- Scripts for quick usage of ReproZip:
traceAndPack.sh
: Runrun.py
and pack the execution insubmission.rpz
cleanReprozip.sh
: Clean all files and directories generated by ReproZip (includingsubmission.rpz
)
- Scripts for analyzing the solution performance &
output.csv
compare.py
: Find correct, missed & false positive pairs inoutput.csv
and store them (with titles) in corresponding.csv
files, in theoutput_misc
directory. Also display the number of pairs in each category, as well as the Recall score.- Bash scripts for separating the
.csv
files generated bycompare.py
by brand, and storing the brand-specific.csv
's inoutput_misc/false
,output_misc/missed
andoutput_misc/common
.
- Submitted files:
In src
directory:
-
Simple Execution:
-
Choose the desired Dataset to run the experiment on: In
utils.py
, setTARGET_DATASET
accordingly. -
If you wish to format
output.csv
to have precisely 3,000,000 rows, setSUBMISSION_MODE
toTrue
. To skip the solution for a dataset, setIGNORE_DATASET
to'1'
or'2'
(''
to not skip). -
Run the solution:
python3 run.py
-
To see the stats for the answer generated by the solution:
python3 compare.py
-
-
Execute & Pack with ReproZip
-
Select the desired dataset & parameters as above.
-
Run the solution & pack in
submission.rpz
:./traceAndPack
-
To clean-up the generated files, including
submission.rpz
:./cleanReprozip
-
...