Skip to content

Conversation

@remusao
Copy link

@remusao remusao commented Nov 12, 2013

Hi,

Since " and ' are considered punctuation in English, I thought it would be a good idea to add this characters in the function strip_punctuation! in the preprocessing module. I don't know if there is a reason for not including them in the regex, but I needed them in a project of mine, so here is a patch if you think it could be useful for others too.

Bests,
Remusao

@johnmyleswhite
Copy link
Collaborator

This is tricky. Unlike other punctuation, single quote marks often occur within tokens, so stripping them causes a lot of problems. We should see what other systems do.

@remusao
Copy link
Author

remusao commented Nov 12, 2013

I agree. Why not letting the user choose? Or simply stripping ' and " at the beginning and end of the string instead of everywhere? It would preserve tokens containing this symbols? In my case I mainly liked to avoid tokens like "toto

@johnmyleswhite
Copy link
Collaborator

Let's see what R's tm and Python's NLTK do, then make a decision.

@karl-kurzke
Copy link

And is it possible to add "[" and "]" to exactly this regex?
I had some problems with the remove_words! function, because there where such brackets inside my corpus and the closing ] was missed.
But perhaps cleaner it would be to update the remove_words function and to clean regexSyntax out of of this word.
Something like:

regexSigns = split("[]{}*()","")
for sign in regexSigns
    word = replace(word, Regex(string("\\",sign)),string("\\",sign))
end 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants