- 
                Notifications
    You must be signed in to change notification settings 
- Fork 4
LapDevelopment_ToE
Download the English ToE Graph in Turtle format (June 3, 2015), a file of 421 megabytes, which we version as english.ttl in the public section of the LAP SVN repository.
To simplify parallelization, we extract the transcribed text of each speach into a separate file, also expanding escaped newlines (‘\n’), replacing non-breaking spaces with regular spaces, and stripping initial and trailing whitespace (‘split.py’):
  python split.py english.ttl /work/users/oe/toe/txt
The June 2015 version of the data comprises 292,379 speeches (for some 63 million whitespace-separated tokens). We create groups of 300 speeches (files) each:
  for i in txt/*.txt; do basename ${i%%.txt}; done \
  | split -l 300 -a 3 -d - batch.0;
Each of the groups can be processed sequentially (one speech at a time) using the ToE-specific ‘parse’ workflow, which comprises the following analysis steps:
- 
CIS Tokenizer for sentence segmentation; 
- 
REPP for PTB-style tokenization; 
- 
the B&N parser for PoS tagging and dependency parsing; 
- 
export to RDF, using the LAP ontology (see below). 
Assuming the LOGON ‘trickle’ script, parsing jobs can be submitted for processing on ABEL as follows:
  for i in /work/users/oe/toe/batch.0???; do \
    echo sbatch ~/src/lap/public/toe/parse "${i}"; 
  done > ~/src/lap/public/toe/toe.0.job
  cd log;
  $LOGONROOT/uio/titan/trickle --start --limit 400 ../toe.0.job
  while true; do $LOGONROOT/uio/titan/trickle --limit 400 ../toe.0.job; sleep 60; done
After watching the first several hours of parsing, it appears that processing the complete corpus will take about 170,000 cpu hours (which seems unreasonably high—even for the relatively costly B&N parser and our wasteful strategy of re-initializing the parser for each speech) and will give rise to approximately 2.7 billion triples.
Linguistic annotations of the text data available at http://linkedpolitics.ops.few.vu.nl/ are offered in the form of rdf named graphs. Such graphs are linked to the text content for each speech in a given day in the ToE data; an example file containing the full graph presented in this article is available at http://svn.emmtee.net/lap/snug/toe14/rdf/ns1:Speech_6.trig. Annotations from the tools in LAP are related to each other using the LAF data model. The information is encoded in rdf triples defined by a LAP ontology that can be viewed in full at http://svn.emmtee.net/lap/snug/toe14/rdf/lap.ttl.
The example contains sentence, token, part-of-speech and dependency annotations for ns1:Speech_6 (the 6th plenary speech held on 20-07-1999). The named graph ns1:Graph_Speech_6 is linked to ns1:Speech_6 via the predicate lap:hasLapAnnotations:
ns1:Speech_6 lap:hasLapAnnotations ns1:Graph_Speech_6.
The graph itself with the pertaining annotations exists in TriG syntax:
ns1:Graph_Speech_6 {
lap:bn_N29 a lap:Dependency;
  dependency:type "SBJ".
lap:bn_N30 a lap:Dependency;
  dependency:type "ROOT".
lap:bn_N31 a lap:Dependency;
  dependency:type "OBJ".
...
}
With a sparql database aware of the ToE data, the LAP ontology and the described named graph, obtaining all the tokens for ns1:Speech_6 can be achieved with the following query:
BASE <http://purl.org/linkedpolitics/English>
PREFIX lpv: <vocabulary/>
PREFIX ns1: <eu/plenary/1999-07-20/>
PREFIX lap: <http://lap.uio.no/rdf/lap/> 
PREFIX token: <http://lap.uio.no/rdf/lap/token_> 
select ?lapid ?token  
WHERE {
    ns1:Speech_6 lap:hasAnnotation ?b.
    GRAPH ?b {
        ?token a lap:Token.
        ?token token:surface ?v
    }
}
Output from the sparql engine:
-------------------------------
| lapid        | token        |
===============================
| lap:repp_N1  | "–"          |
| lap:repp_N2  | "Thank"      |
| lap:repp_N3  | "you"        |
| lap:repp_N4  | "for"        |
| lap:repp_N5  | "clarifying" |
| lap:repp_N6  | "that"       |
| lap:repp_N7  | "point"      |
| lap:repp_N8  | "."          |
| lap:repp_N9  | "However"    |
| lap:repp_N10 | ","          |
| lap:repp_N11 | "you"        |
| lap:repp_N12 | "don”t"      |
| lap:repp_N13 | "know"       |
| lap:repp_N14 | "how"        |
| lap:repp_N15 | "long"       |
| lap:repp_N16 | "this"       |
| lap:repp_N17 | "would"      |
| lap:repp_N18 | "have"       |
| lap:repp_N19 | "lasted"     |
| lap:repp_N20 | "had"        |
| lap:repp_N21 | "it"         |
| lap:repp_N22 | "been"       |
| lap:repp_N23 | "an"         |
| lap:repp_N24 | "address"    |
| lap:repp_N25 | "!"          |
| lap:repp_N26 | "("          |
| lap:repp_N27 | "Applause"   |
| lap:repp_N28 | ")"          |
-------------------------------
Home | Forum | Discussions | Events