Scrape, Clean, and Mine NIH Research Abstracts and Determine Research Similarities

by MAS 2018

Introduction

The wordfreq2.py script scrapes the NIH database of published research abstracts and identifies keywords that define a researcher's corpus of published data. Keywords are identified by comparing term frequency (TF) values.

Once TFs are known, you can run the cosine_calc.py script to determine the cosine similarity among all researcheres in a network

There are two scripts included in this repo:

wordfreq2.py : scrapes and mines NIH NCBI PubMed database
cosinecalc.py : calculates cosine similarity for network analysis

Requirements

You need the following libraries:

biopython
elementtree
numpy
pylab/matplotlib
seaborn

Execution

The key input from the user is a .txt file that lists the names of the researchers you wish to analyze in NCBI format. This should look like:

Smith J

Nguyen C

Lee JY

Jones MM

...

Chen Z

You can also include additional NCBI search terms, for example:

Stetz MA AND University of Pennsylvania[ad]

There is also a list of trivial words that must be included, trivial.txt . This file is similar to the english "stop words" used in scikit-learn. I wrote all of the cleaning/munging functions myself for determining TFs so I included my own list of words to exclude.

First fill out the the user-specified data directories in wordfreq then run that script. After that run cosine_calc.py (you do not need to specifiy anything).

Example Results

Two files are generated by wordfreq for each author analyzed:

A bar plot showing the author's highest TFs (default is 10 highest TFs). An example is shown below:

An example correlation matrix for an entire network of scientists is shown below (network is the Biophysics Department at Johns Hopkins):

The entire output from the Johns Hopkins example is provided in the JHU_TEST directory.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
JHU_TEST		JHU_TEST
.gitignore		.gitignore
README.md		README.md
cosine_calc.py		cosine_calc.py
wordfreq2.py		wordfreq2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrape, Clean, and Mine NIH Research Abstracts and Determine Research Similarities

Introduction

Requirements

Execution

Smith J

Nguyen C

Lee JY

Jones MM

...

Chen Z

Stetz MA AND University of Pennsylvania[ad]

Example Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Scrape, Clean, and Mine NIH Research Abstracts and Determine Research Similarities

Introduction

Requirements

Execution

Smith J

Nguyen C

Lee JY

Jones MM

...

Chen Z

Stetz MA AND University of Pennsylvania[ad]

Example Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages