-
Notifications
You must be signed in to change notification settings - Fork 10
Notes on the recordlinkage Python library
Marco Fossati edited this page Dec 20, 2018
·
9 revisions
- uses
pandasfor data structures, typically theDataFrame,Series, andMultiIndexclasses; - https://pandas.pydata.org/pandas-docs/stable/dsintro.html;
- https://pandas.pydata.org/pandas-docs/stable/advanced.html;
- uses
jellyfishunder the hood for edit distances and phonetic algorithms.
- https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe;
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html;
- uses
pandas.DataFrameto represent datasets. It's basically a table with column headers; - conversion from a
dictis easy: key = column header, value = cell; - a value is a list, so
defaultdict(list)is helpful;
dataset = pandas.DataFrame(
{
'catalog_id': [666, 777, 888],
'name': ['huey', 'dewey', 'louie'],
...
}
)- remember the order of values, i.e.,
666->'huey';
- AKA pre-processing AKA normalization AKA standardization;
- https://recordlinkage.readthedocs.io/en/latest/ref-preprocessing.html;
- uses
pandas.Series, a list-like object; - the
cleanfunction seems interesting at a first glimpse; - by default, it removes text inside brackets. Might be useful, trivial to re-implement;
- terrible default regex, removes everything that is not an ASCII letter! Non-ASCII strings are just deleted! Use a custom regex or
Noneinreplace_by_none=kwarg to avoid this; - nice ASCII folding via
strip_accents='ascii', not done by default; -
strip_accents='unicode'keeps intact some Unicode chars, e.g.,œ; - non-latin scripts are just not handled;
- the
phoneticfunction has the same problems as injellyfish, see #79.
from recordlinkage.preprocessing import clean
names = pandas.Series(
[
'хартшорн, чарльз',
'charles hartshorne',
'チャールズ・ハートショーン',
'تشارلز هارتشورن',
'찰스 하츠혼',
àáâäæãåāèéêëēėęîïíīįìôöòóœøōõûüùúū'
]
)clean(names)Output:
0
1 charles hartshorne
2
3
4
5
dtype: object
clean(names, replace_by_none=None, strip_accents='ascii')Output:
0 ,
1 charles hartshorne
2
3
4
5 aaaaaaaeeeeeeeiiiiiioooooouuuuu
dtype: object
- AKA blocking AKA candidate acquisition;
- https://recordlinkage.readthedocs.io/en/latest/ref-index.html;
- make pairs of records to reduce the space complexity (quadratic);
- a simple call to the
Index.block(FIELD)function is not enough for names, as it makes pairs that exactly agree, i.e., like an exact match;
index = recordlinkage.Index()
index.block('name')
candidate_pairs = index.index(source_dataset, target_dataset)- we could inject the MariaDB full-text index #126 as a user-defined algorithm;
- https://recordlinkage.readthedocs.io/en/latest/ref-index.html#user-defined-algorithms;
- https://recordlinkage.readthedocs.io/en/latest/ref-index.html#examples.
- https://recordlinkage.readthedocs.io/en/latest/ref-compare.html;
- can be seen as feature extraction;
- probably useful for #143;
- the
Compare.datefunction can be useful for dates: https://recordlinkage.readthedocs.io/en/latest/ref-compare.html#recordlinkage.compare.Date; - the
Compare.stringfunction implementsjellyfishstring edit distances + others: https://recordlinkage.readthedocs.io/en/latest/ref-compare.html#recordlinkage.compare.String; - the string edit distance feature is binary, not scalar:
feature_vectors.sum(1).value_counts()below shows that; - the
thresholdkwarg gives a binary score for pairs above or below its value, i.e.,1or0. It's not really a threshold; - not clear how the feature is fired by default, i.e.,
threshold=None; - better always use the
thresholdkwarg then, typically3for Levenshtein and0.85for Jaro-Winkler;
comp = recordlinkage.Compare()
comp.string('name', 'label', threshold=3)
feature_vectors = comp.compute(candidate_pairs, source_dataset, target_dataset)
print(feature_vectors.sum(1).value_counts())- train with
fit(training_feature_vectors, match_index); - classify with
predict(classification_feature_vectors); - we could give SVM a try: https://recordlinkage.readthedocs.io/en/latest/notebooks/classifiers.html#Support-Vector-Machines;
- very recent docs update (Dec 17 2018!) on adapters: https://github.com/J535D165/recordlinkage/blob/master/docs/ref-classifiers.rst#adapters;
-
it seems possible to inject a neural network with
keras; - remember to set comparison of fields with missing values to
0, i.e., pair disagreement:- Most classifiers can not handle comparison vectors with missing values.;
- no worries,
compare.stringdoes that by default.
INPUT = training set = existing QIDs with target IDs = dict { QID: target_ID };
- get the QID statements from Wikidata;
- query MariaDB for target ID data;
- load both into 2
pandas.DataFrame; - pre-process;
- make the index with blocking ->
match_indexarg; - feature extraction with comparison ->
training_feature_vectorsarg.
- https://recordlinkage.readthedocs.io/en/latest/ref-classifiers.html#recordlinkage.NaiveBayesClassifier;
- https://recordlinkage.readthedocs.io/en/latest/notebooks/classifiers.html;
- Code example: https://github.com/J535D165/recordlinkage/blob/master/examples/supervised_learning_prob.py
-
recordlinkage.NaiveBayesClassifierclass; - works with binary features, also explains why the edit distance feature is binary;
-
not sure what the
binarizeargument means. The docs say: Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to consist of multilevel vectors; - the code example uses
binary_vectorsand sets toymanduprobabilities:- are comparison vectors (point 6 of the training workflow) the expected input?
- should we compute
manduon our own as well?