GitHub - deguodedongxi/data-augmentation-nlp: Generates paraphrases for given English sentences.

Data Augmentation for Text - Paraphrase Generation

Here an attempt is made to generate sensible text data from a given seed set using NLP (Natural Language Processing) techniques.

The overall work flow is below:

A set of seed sentences are fed to the module as a CSV file.
Each line contains a single sentence in English.
A sentence is processed using Spacy nlp module to get the POS of each token.
For each token from the below category, synonyms are generated using Wordnet.
- Noun which is not a named entity.
- Adjectives
- Verbs
Once the synonyms are created for each of the tokens, the list is filtered to retain the most sensible synonyms for the given context.
Spacy's token-to-token similarity score is used to weight each of the synonyms. Those token pair whose similarity score is less than a previously set threshold, is removed from the synonym list.
After the filtering process, the resulted synonyms are used to generate new sentences.
Finally, the augmented data set is stored to disk as a CSV file.

Evaluation:

A sample data set is provided as seed set to check and evaluate the efficiency and usability of the approach. On manual inspection, the generated sentences look natural and grammatically valid almost always.

Sample Use Case:

One can try the approach by running the script generate_paraphrases.py. This takes as input the supplied sample data (input_data.csv) and generates paraphrases which is stored to disk as augmented_dataset.csv.

Dependencies:

Python 3.6 or higher
Spacy
Spacy language model - Large. Eg: en_core_web_lg-2.0.0
NLTK Wordnet
Pandas (optional for reading/writing files)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
LICENSE		LICENSE
README.md		README.md
augmented_dataset.csv		augmented_dataset.csv
generate_paraphrases.py		generate_paraphrases.py
generate_paraphrases_trec.py		generate_paraphrases_trec.py
input_data.csv		input_data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Augmentation for Text - Paraphrase Generation

The overall work flow is below:

Evaluation:

Sample Use Case:

Dependencies:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

deguodedongxi/data-augmentation-nlp

Folders and files

Latest commit

History

Repository files navigation

Data Augmentation for Text - Paraphrase Generation

The overall work flow is below:

Evaluation:

Sample Use Case:

Dependencies:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages