Regular expression NER#118
Regular expression NER#118Franco Luque (francolq) wants to merge 6 commits intomachinalis:developfrom
Conversation
iepy/preprocess/ner/regexp.py
Outdated
| try: | ||
| m = next(i) | ||
| start, end = m.span() | ||
| # FIXME: do not count from the beggining |
There was a problem hiding this comment.
What should we do with this FIXME note? Is it already fixed and you just forgot to remove the comment?
There was a problem hiding this comment.
Sorry, my bad. This is not a FIXME, it is at most a TODO, because a small optimization can be done here. I think the comment can be safely removed.
iepy/preprocess/ner/regexp.py
Outdated
| # preprocess the regular expression | ||
| regexp = re.sub(r'\s', '', regexp) | ||
| # replace < and > only if not double (<< or >>): | ||
| # FIXME: avoid matching \< and \>. |
There was a problem hiding this comment.
Same question re FIXME here.
Can't be solved before merging into develop?
There was a problem hiding this comment.
Almost same answer here. My bad to call this a FIXME. It would be an enhancement to allow escaping '<' and '>'. The comment can be removed.
…nts (as discussed in the pull request).
|
|
||
| def __init__(self, tokens): | ||
| # replace < and > inside tokens with \< and \> | ||
| _raw = '><'.join(w.replace('<', '\<').replace('>', '\>') for w in tokens) |
There was a problem hiding this comment.
I'm no completely sure, but would it be a problem if there is a token with \< (or \>) inside it?
| def run_ner(self, doc): | ||
| entities = [] | ||
| tokens = doc.tokens | ||
| searcher = TokenSearcher(tokens) |
There was a problem hiding this comment.
TokenSearcher is implemented as a class but it is stateless in a practical sense and it is used more like a function than a class.
| import re | ||
| import codecs | ||
|
|
||
| from nltk.text import TokenSearcher as NLTKTokenSearcher |
There was a problem hiding this comment.
This import looks unused since (apparently) no method from NLTKTokenSearcher is used.
| token_start = self._raw[:start].count('><') | ||
| token_end = self._raw[:end].count('><') | ||
| yield MatchObject(m, token_start, token_end) | ||
| except: |
There was a problem hiding this comment.
This try...except is dangerous because it silently hides any error that can happen inside the loop.
Why not to use for m in i: instead?
Described here:
https://groups.google.com/forum/?hl=es-419#!topic/iepy/NqIP0nb0-ic