These are programs that find document similarity using word embeddings and cosine similarity. TF-IDF, co-occurence amtrix, Word2Vec, Fasttext and GloVe are used for obtaining word embeddings. The repository also contains a program to check the plagiarism of a pdf file against a local corpus. Programs that use GloVe need the glove.6B.50d.txt file downloaded in the working directory (not provided in the repository) Programs of basic level contains programs that check similarity of small sentences. Progressing to the medium lvel, there are programs that check similarity of 2 pdf files. In the advanced level, there are programs that check the similarity of documents in the 20newsgroups dataset.
-
Notifications
You must be signed in to change notification settings - Fork 1
Reginasabs/NLP-DocSimilarity
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
About
Programs to find document similarity using word embeddings and cosine similarity. Even checks for plagiarism with a local corpus.
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published