This repository provides a tool to turn DWUG EN: Diachronic Word Usage Graphs for English (Schlechtweg et al. 2024) into a knowledge graph to visualize and query the variation of the annotations in the dataset.
The needed packages are stored in environment.yml. Please create a conda environment with the following command:
conda env create -f environment.yml
This folder contains the dataset DWUG EN: Diachronic Word Usage Graphs for English (Schlechtweg et al. 2021). Please click on the link to see the documentation of the dataset.
This folder contains the turtle files of the created knowledge graphs.
The full knowledge graph of the dataset DWUG EN: Diachronic Word Usage Graphs for English (Schlechtweg et al. 2021).
A small sample of three words of the dataset DWUG EN: Diachronic Word Usage Graphs for English (Schlechtweg et al. 2021) in turtle format for testing purposes.
This folder contains the results for the executed SPARQL-SELECT queries.
This folder contains the annotator queries.
This folder contains the variation queries.
This folder contains the assigned positions and colors of the nodes as json-files. Please node that this folder is not pushed to GitHub due to file size limitations.
This file contains the positions of the nodes for the full knowledge graph.
This file contains the colors of the nodes for the full knowledge graph. The colors depend on the color_mode. If the mode is distinct, the colors define the number of distinct categories. If the mode is range, the colors define the range of distinct categories.
This folder contains the created visualizations.
This folder contains the created visualizations per annotator.
This folder contains the created visualizations for the full graph. The positions of the nodes are re-scaled. Each ring of nodes represents one category.
first ring (= the closest to the center): category 0 = Undecidable = No annotation possible
second ring: category 1 = Unrelated = Homonymy
third ring: category 2 = Distantly Related = Polysemy
fourth ring: category 3 = Closely Related = Context Variance
fifth ring: category 4 = Identical = Identity
This folder contains the created visualizations for one word pair. It resembles the structure of the knowledge graph.
This script creates the RDF-graph from the csv-files in the dataset DWUG EN: Diachronic Word Usage Graphs for English (Schlechtweg et al. 2024).
The knowledge graph has the following nodes and relations:
dataset: The dataset node which collects meta information about the dataset and connects all words to each other.
word: The word nodes which collect information about the token occurence. Each word is connected to its reference sentence and the annotation it occurs in.
sentence: The sentence nodes which collect information about the occurence. Each sentence is connected to a word.
annotation: The annotation nodes which collect information about the annotation. Each annotation is connected to two annotated words and its annotator.
annotator: The annotator nodes which collect information about the annotators. Each annotator is connected to the annotations they have annotated.
The knowledge graph relies mainly on the classes and properties on the NIF 2.0 Core Ontology which has been built for NLP tools, resources and annotations. The RDA namespace is for missing properties and classes from NIF (e.g. annotators). The dataset node is defined as a Dataset object of schema.org.
This script entails functions that query the original dataset, e.g. extracting the words from the dataset that are annotated by all annotators.
This script creates the data stored in the folders graphs, query_results, and visualizations. It can be seen as an example pipeline for the provided scripts.
This script contains the SPARQL queries to parse the RDF-graph. Available queries are:
category_stats: How often has a label been annotated?
annotations_per_annotator: Which annotations has a annotator done?
num_labels: How many distinct labels has a annotation and how much do they differ from each other? This query is used to create the annotator and full graph visualizations.
filter_variation: This query is a more refined version of num_labels, because one can decide how high the range of the number of distinct labels should be.
get_pos_tags: Which POS-tags are used in the dataset?
This script creates the visualizations of the RDF-graph on three different levels. Available visualizations are:
instance: A visualization of the annotations of a word pair via RDF Grapher
annotator: A visualization of the annotaions of one annotator via NetworkX
full: A visualization of all annotations in the graph via NetworkX
Schlechtweg, D. and Dubossarsky, H. and Hengchen, S. and McGillivray, B. and Tahmasebi, N. 2024. DWUG EN: Diachronic Word Usage Graphs for English (3.0.0).