Add some symbols to punctuation in strip_punctuation #9

remusao · 2013-11-12T11:48:31Z

Hi,

Since " and ' are considered punctuation in English, I thought it would be a good idea to add this characters in the function strip_punctuation! in the preprocessing module. I don't know if there is a reason for not including them in the regex, but I needed them in a project of mine, so here is a patch if you think it could be useful for others too.

Bests,
Remusao

johnmyleswhite · 2013-11-12T15:01:03Z

This is tricky. Unlike other punctuation, single quote marks often occur within tokens, so stripping them causes a lot of problems. We should see what other systems do.

remusao · 2013-11-12T15:38:55Z

I agree. Why not letting the user choose? Or simply stripping ' and " at the beginning and end of the string instead of everywhere? It would preserve tokens containing this symbols? In my case I mainly liked to avoid tokens like "toto

johnmyleswhite · 2013-11-12T16:34:23Z

Let's see what R's tm and Python's NLTK do, then make a decision.

karl-kurzke · 2014-02-08T21:29:06Z

And is it possible to add "[" and "]" to exactly this regex?
I had some problems with the remove_words! function, because there where such brackets inside my corpus and the closing ] was missed.
But perhaps cleaner it would be to update the remove_words function and to clean regexSyntax out of of this word.
Something like:

regexSigns = split("[]{}*()","")
for sign in regexSigns
    word = replace(word, Regex(string("\\",sign)),string("\\",sign))
end

Add some symbols to punctuation in strip_punctuation

c167c65

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add some symbols to punctuation in strip_punctuation #9

Add some symbols to punctuation in strip_punctuation #9

Uh oh!

remusao commented Nov 12, 2013

Uh oh!

johnmyleswhite commented Nov 12, 2013

Uh oh!

remusao commented Nov 12, 2013

Uh oh!

johnmyleswhite commented Nov 12, 2013

Uh oh!

karl-kurzke commented Feb 8, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add some symbols to punctuation in strip_punctuation #9

Are you sure you want to change the base?

Add some symbols to punctuation in strip_punctuation #9

Uh oh!

Conversation

remusao commented Nov 12, 2013

Uh oh!

johnmyleswhite commented Nov 12, 2013

Uh oh!

remusao commented Nov 12, 2013

Uh oh!

johnmyleswhite commented Nov 12, 2013

Uh oh!

karl-kurzke commented Feb 8, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants