A Nextflow pipeline that retrieves and analyzes PubMed publication counts for human protein-coding genes. This pipeline demonstrates parallelized processing across thousands of genes using Nextflow's workflow orchestration.
This pipeline scrapes PubMed to count publications associated with each human protein-coding gene, providing insights into research activity across the genome.
The pipeline consists of three sequential processes:
Downloads and processes the NCBI Homo sapiens gene information database:
- Fetches gene data from NCBI FTP server
- Filters for protein-coding genes only
- Removes duplicate gene symbols
- Outputs:
genes.csv: List of unique gene IDs for processinggenes_hash_table.csv: Mapping of gene IDs to symbols
Web scrapes PubMed for each gene ID in parallel:
- Queries PubMed database for human-specific publications (1900-2030)
- Extracts publication counts from search results
- Configured with
maxForks 5to respect NCBI rate limits - Outputs: Individual CSV files per gene with publication counts
- Error handling: Creates error logs for failed requests
Aggregates results across all genes:
- Combines individual gene results
- Maps gene IDs to gene symbols
- Sorts by gene symbol
- Outputs:
summary.csvwith gene symbols and publication counts
- Nextflow (≥21.04)
- Docker or container runtime
- AWS credentials (for S3 output, optional)
# Clone the repository
git clone https://github.com/alethiotx/pubmed.git
cd pubmed
# Run with Docker (local profile)
nextflow run main.nf -profile local
# Test mode (processes only 20 genes)
nextflow run main.nf -profile local --env test# Run with S3 output and full gene set
nextflow run main.nf -profile seqera --outdir s3://your-bucket/path/local: Docker-based execution with local output directoryseqera: Cloud execution with container orchestration
params.outdir: Output directory path (default: S3 bucket in nextflow.config)params.env: Environment mode (prodortest). Test mode processes only 20 genes.
The NCBI E-utilities allow:
- 3 requests/second without API key
- 10 requests/second with API key
The pipeline uses maxForks 5 in the analyze process to stay within rate limits and avoid blocking.
Public ECR image: public.ecr.aws/alethiotx/pubmed:latest
Built from Ubuntu 25.10 with:
- Python 3 virtual environment
- Required packages: biopython, pandas, beautifulsoup4, urllib3
GitHub Actions workflow automatically builds and pushes Docker images to Amazon ECR Public on every push. See .github/workflows/docker-deploy.yaml for details.
Terraform configuration in terraform/ manages:
- ECR Public repository
- Public image pull policy
- IAM permissions for GitHub Actions OIDC
output/
├── prepare/
│ ├── genes.csv
│ └── genes_hash_table.csv
└── summarize/
└── summary.csv
bin/prepare.py: Gene list preparation from NCBI databin/analyze.py: PubMed scraping for individual genesbin/summarize.py: Result aggregation and formatting
Run in test mode to validate changes:
nextflow run main.nf -profile local --env testSee LICENSE for details.
Data sources:
- NCBI Gene database
- PubMed/NCBI E-utilities
