- Clone the repository.
- Because GitHub doesn't allow files bigger than 100 MB, please download this file here and place it in the "model" folder.
- Set up the environment and run
pip install -r requirements.txtin the terminal. - Run main.py for a demo. To change the query, edit the string on line 9.
See the project report below.
This project is an AI-powered search system designed to retrieve and synthesize relevant biomedical research papers from scientific sources. The system reranks and summarizes scientific literature to help researchers quickly find and understand key findings across multiple studies. By fine-tuning a BioBERT-based cross-encoder on biomedical question-answer pairs and integrating a BART summarization model, this project aims to deliver concise, relevant answers to complex research queries.
Biomedical researchers face an overwhelming volume of scientific literature, with millions of papers published annually on PubMed alone. Traditional keyword-based search often returns hundreds of results, requiring researchers to manually read through abstracts to identify relevant studies. This process is time consuming and inefficient, especially when trying to synthesize findings across multiple papers or answer specific research questions.
Efficient literature search is critical for advancing scientific research and medicine. Researchers need to quickly identify relevant prior work to avoid duplicating studies, build on existing knowledge, and make informed decisions. In clinical settings, healthcare professionals require rapid access to the latest research to inform treatment decisions. A more intelligent search system can accelerate scientific discovery, improve research quality, and ultimately contribute to better health outcomes.
This project employs a three-stage pipeline:
-
Initial retrieval using PubMed's search API.
-
Reranking using a fine-tuned BioBERT cross-encoder to score relevance.
-
Summarization of top papers using DistilBART.
BioBERT's pretraining on medical literature makes it well-suited for biomedical search tasks compared to general-purpose language models. Cross-encoders jointly process query-document pairs to produce relevance scores, offering stronger semantic matching than bi-encoders. I chose DistilBART for summarization due to its balance of speed and quality, matching the scope of this project.
The system is comprised of the following components:
-
A PubMed knowledge base interface that retrieves initial candidates.
-
A BioBERT-based reranker fine-tuned on BioASQ question-answer pairs.
-
A DistilBART summarizer that presents brief summaries of the top 5 papers.
-
An integration layer that connects the retrieval-rerank-summarize pipeline.
List of top papers where each paper has:
- Title
- Relevance score
- Source ID
- Brief summary (for top 5)
The overall methodology follows a retrieve, rerank, and summarize system. First, the candidates from the initial retrieval are then passed to a trained reranker that computes semantic similarity scores between the query and each paper's title and abstract. Finally, the top 5 papers are summarized individually using a summarization model, with results presented to the user in ranked order.
Reranker: BioBERT-based cross-encoder (dmis-lab/biobert-base-cased-v1.1) was used and fine-tuned on the BioASQ 12B dataset. The model takes [query, answer] pairs as input and outputs a relevance score. Training uses binary cross-entropy loss with positive pairs from question-answer matches and hard negatives. Hard negatives are generated by computing TF-IDF similarity between each query and the corpus, selecting high-similarity but non-relevant snippets as negatives. This forces the model to learn distinctions between relevant and superficially similar text.
Summarizer: DistilBART-CNN-12-6, a distilled version of BART so it can perform faster, was employed to produce concise 2-3 sentence summaries for each paper.
Assumptions:
- PubMed's initial retrieval has relevant papers
- Title and abstract contain sufficient information for relevance scoring without full text
- BioASQ training data generalizes well to diverse biomedical queries
- Summarization of the top 5 papers provides a sufficiently comprehensive overview altogether
Choices:
- BioBERT over general BERT - domain-specific pretraining improves biomedical understanding
- Cross-encoder over bi-encoder - better accuracy prioritized
- Hard negative mining via TF-IDF - forces model to distinguish semantic relevance from keyword overlap
- Individual paper summaries over multi-document summarization - reduces hallucinations
- DistilBART over full BART - faster inference with minimal quality loss
- Due to time, the model only searches PubMed and does not include papers in other databases, although extending it to include that functionality is not difficult.
- Effective queries partly depend on the user.
- The full text of articles is not considered by the reranker in consideration of computational time constraints for this project.
- Summarizer trained on large general domain corpora, not scientific literature, so it may not capture technical nuances
Training Data: BioASQ 12B question-answer-passages dataset, containing biomedical questions paired with relevant text snippets from PubMed abstracts.
Corpus Statistics:
- Training set: 5,049 questions from BioASQ 12B dev split
- Evaluation set: 340 questions from BioASQ 12B eval split
- Positive pairs: ~60,000
- Negative pairs: ~60,000
- Total training samples: ~120,000 pairs
Models:
- Reranker: BioBERT cross-encoder (dmis-lab/biobert-base-cased-v1.1)
- Summarizer: DistilBART-CNN-12-6 (sshleifer/distilbart-cnn-12-6) (untrained)
Parameters For Cross-Encoder:
- Batch size: 1024
- Epochs: 2
- Warmup steps: 100
Computing Environment:
- Platform: Google Colab Pro (Student License)
- Hardware: A100 GPU + High-RAM
Cross-Encoder Reranker:
- Base model: BioBERT (BERT-base architecture with 12 layers, 768 hidden dimensions, 12 attention heads)
- Pre-trained on: PubMed abstracts and PMC full-text articles
- Input: Query and document joined together with special separator tokens
- Output: Single relevance score via classification head
- Parameters: ~110M
Summarizer:
- Base model: DistilBART-CNN-12-6 (distilled BART with 12 encoder layers, 6 decoder layers)
- Pre-trained on: Large general domain corpora
- Fine-tuned on: CNN/DailyMail dataset
- Input: Abstract text
- Output: Summary
- Parameters: ~306M (distilled from 406M)
The fine-tuned BioBERT cross-encoder demonstrates strong performance on the BioASQ 12B evaluation set:
F1 Score: 95%
Accuracy: 92%
Percision: 97%
Recall: 93%
-
Batch Size: 1024
- Chosen to maximize GPU utilization on A100
- Smaller batches increased training time without performance gains
-
Epochs: 2
- Model converged after 2 epochs based on training loss
- Training time: ~30 minutes
-
Warmup Steps: 100
- Prevents early training instability and improves convergence
- Standard practice for transformer fine-tuning
This model tends to understand semantic meaning more than PubMed's Advanced Search engine.
Example prompt: "What are what are the effects of e-cigarettes on sleep quality?"
PubMed's Advanced Search ranking:
-
"2019 ACC/AHA Guideline on the Primary Prevention of Cardiovascular Disease: Executive Summary: A Report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines."
-
"Contemporary Concise Review 2024: Chronic Obstructive Pulmonary Disease."
-
"On the potential harmful effects of E-Cigarettes (EC) on the developing brain: The relationship between vaping-induced oxidative stress and adolescent/young adults social maladjustment."
-
"Deleterious Association of Inhalant Use on Sleep Quality during the COVID-19 Pandemic."
-
"Comparative effects of e-cigarette and conventional cigarette smoke on in vitro bronchial epithelial cell responses."
-
"The Effect of Cigarette Use and Dual-Use on Depression and Sleep Quality."
Model's ranking:
-
"The Lifestyle of Saudi Medical Students." (Does not seem related at first glance, but the abstract reveals that the paper reports e-cigarette usage and sleep data. PubMed ranks this paper as #9.)
-
"Bidirectional Relationships Between Sleep Quality and Nicotine Vaping: Studying Young Adult e-cigarette Users in Real Time and Real Life."
-
"Main and Interactive Effects of Nicotine Product Type on Sleep Health Among Dual Combustible and E-Cigarette Users."
-
"Sleep disturbances among young adult dual users of cigarettes and e-cigarettes: Analysis of the 2020 National Health Interview Survey."
-
"Deleterious Association of Inhalant Use on Sleep Quality during the COVID-19 Pandemic."
-
"Dual use of e-cigarettes with conventional tobacco is associated with increased sleep latency in cross-sectional Study."
PubMed's top 3 results don't seem to be relevant, while the model's top 6 papers seem to be all relevant.
-
Find and include the full article text when reranking.
-
Instead of calling PubMed's API, download its full data with a mechanism to update it daily.
-
Incorporate citation relationships to boost papers that are highly cited by other relevant papers
-
Extend beyond PubMed to include other databases (arXiv, bioRxiv, medRxiv).
This project demonstrates that combining domain-specific language models with neural reranking and automatic summarization can improve biomedical literature search. The fine-tuned BioBERT cross-encoder can successfully capture semantic relationships that traditional keyword-based search systems miss. By understanding biomedical terminology and concepts beyond exact keyword matches, the system ranks papers by true relevance rather than superficial term overlap. As demonstrated, semantic understanding of biomedical concepts can lead to more relevant results, particularly for complex or specialized queries.
This work contributes to a broader objective of accelerating scientific discovery by making literature search more intelligent, efficient, and accessible. By reducing the time researchers spend searching and reading abstracts, systems like this can help accelerate the pace of biomedical research and ultimately improve health outcomes.
-
Lee, J., et al. (2020). BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234-1240. [Paper] [Code]
-
Lewis, M., et al. (2020). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. [Paper]
-
Krithara, A., et al. (2022). BioASQ-QA: A manually curated corpus for Biomedical Question Answering. bioRxiv. [Paper]
-
Wolf, T., et al. (2020). Transformers: State-of-the-Art Natural Language Processing. EMNLP 2020: System Demonstrations. [Paper] [Code]
-
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP-IJCNLP 2019. [Paper] [Code]

