Skip to content

jen-cui/scholarsearch

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Instructions

  1. Clone the repository.
  2. Because GitHub doesn't allow files bigger than 100 MB, please download this file here and place it in the "model" folder.
  3. Set up the environment and run pip install -r requirements.txt in the terminal.
  4. Run main.py for a demo. To change the query, edit the string on line 9.

See the project report below.

Abstract

This project is an AI-powered search system designed to retrieve and synthesize relevant biomedical research papers from scientific sources. The system reranks and summarizes scientific literature to help researchers quickly find and understand key findings across multiple studies. By fine-tuning a BioBERT-based cross-encoder on biomedical question-answer pairs and integrating a BART summarization model, this project aims to deliver concise, relevant answers to complex research queries.

Overview

The Problem

Biomedical researchers face an overwhelming volume of scientific literature, with millions of papers published annually on PubMed alone. Traditional keyword-based search often returns hundreds of results, requiring researchers to manually read through abstracts to identify relevant studies. This process is time consuming and inefficient, especially when trying to synthesize findings across multiple papers or answer specific research questions.

Why This Problem Matters

Efficient literature search is critical for advancing scientific research and medicine. Researchers need to quickly identify relevant prior work to avoid duplicating studies, build on existing knowledge, and make informed decisions. In clinical settings, healthcare professionals require rapid access to the latest research to inform treatment decisions. A more intelligent search system can accelerate scientific discovery, improve research quality, and ultimately contribute to better health outcomes.

Proposed Approach

This project employs a three-stage pipeline:

  1. Initial retrieval using PubMed's search API.

  2. Reranking using a fine-tuned BioBERT cross-encoder to score relevance.

  3. Summarization of top papers using DistilBART.

Rationale

BioBERT's pretraining on medical literature makes it well-suited for biomedical search tasks compared to general-purpose language models. Cross-encoders jointly process query-document pairs to produce relevance scores, offering stronger semantic matching than bi-encoders. I chose DistilBART for summarization due to its balance of speed and quality, matching the scope of this project.

Key Components

The system is comprised of the following components:

  1. A PubMed knowledge base interface that retrieves initial candidates.

  2. A BioBERT-based reranker fine-tuned on BioASQ question-answer pairs.

  3. A DistilBART summarizer that presents brief summaries of the top 5 papers.

  4. An integration layer that connects the retrieval-rerank-summarize pipeline.

Result Components

List of top papers where each paper has:

  • Title
  • Relevance score
  • Source ID
  • Brief summary (for top 5)

Approach

Methodology

The overall methodology follows a retrieve, rerank, and summarize system. First, the candidates from the initial retrieval are then passed to a trained reranker that computes semantic similarity scores between the query and each paper's title and abstract. Finally, the top 5 papers are summarized individually using a summarization model, with results presented to the user in ranked order.

Models

Reranker: BioBERT-based cross-encoder (dmis-lab/biobert-base-cased-v1.1) was used and fine-tuned on the BioASQ 12B dataset. The model takes [query, answer] pairs as input and outputs a relevance score. Training uses binary cross-entropy loss with positive pairs from question-answer matches and hard negatives. Hard negatives are generated by computing TF-IDF similarity between each query and the corpus, selecting high-similarity but non-relevant snippets as negatives. This forces the model to learn distinctions between relevant and superficially similar text.

Summarizer: DistilBART-CNN-12-6, a distilled version of BART so it can perform faster, was employed to produce concise 2-3 sentence summaries for each paper.

Assumptions and Design Choices

Assumptions:

  • PubMed's initial retrieval has relevant papers
  • Title and abstract contain sufficient information for relevance scoring without full text
  • BioASQ training data generalizes well to diverse biomedical queries
  • Summarization of the top 5 papers provides a sufficiently comprehensive overview altogether

Choices:

  • BioBERT over general BERT - domain-specific pretraining improves biomedical understanding
  • Cross-encoder over bi-encoder - better accuracy prioritized
  • Hard negative mining via TF-IDF - forces model to distinguish semantic relevance from keyword overlap
  • Individual paper summaries over multi-document summarization - reduces hallucinations
  • DistilBART over full BART - faster inference with minimal quality loss

Limitations

  • Due to time, the model only searches PubMed and does not include papers in other databases, although extending it to include that functionality is not difficult.
  • Effective queries partly depend on the user.
  • The full text of articles is not considered by the reranker in consideration of computational time constraints for this project.
  • Summarizer trained on large general domain corpora, not scientific literature, so it may not capture technical nuances

Experiments

Dataset

Training Data: BioASQ 12B question-answer-passages dataset, containing biomedical questions paired with relevant text snippets from PubMed abstracts.

Corpus Statistics:

  • Training set: 5,049 questions from BioASQ 12B dev split
  • Evaluation set: 340 questions from BioASQ 12B eval split
  • Positive pairs: ~60,000
  • Negative pairs: ~60,000
  • Total training samples: ~120,000 pairs

Implementation

Models:

  • Reranker: BioBERT cross-encoder (dmis-lab/biobert-base-cased-v1.1)
  • Summarizer: DistilBART-CNN-12-6 (sshleifer/distilbart-cnn-12-6) (untrained)

Parameters For Cross-Encoder:

  • Batch size: 1024
  • Epochs: 2
  • Warmup steps: 100

Computing Environment:

  • Platform: Google Colab Pro (Student License)
  • Hardware: A100 GPU + High-RAM

Model Architecture

Cross-Encoder Reranker:

  • Base model: BioBERT (BERT-base architecture with 12 layers, 768 hidden dimensions, 12 attention heads)
  • Pre-trained on: PubMed abstracts and PMC full-text articles
  • Input: Query and document joined together with special separator tokens
  • Output: Single relevance score via classification head
  • Parameters: ~110M

Summarizer:

  • Base model: DistilBART-CNN-12-6 (distilled BART with 12 encoder layers, 6 decoder layers)
  • Pre-trained on: Large general domain corpora
  • Fine-tuned on: CNN/DailyMail dataset
  • Input: Abstract text
  • Output: Summary
  • Parameters: ~306M (distilled from 406M)

Results

The fine-tuned BioBERT cross-encoder demonstrates strong performance on the BioASQ 12B evaluation set:

F1 Score: 95%

Accuracy: 92%

Percision: 97%

Recall: 93%

Evaluation metrics

Parameter Choices

Step vs Training loss

  1. Batch Size: 1024

    • Chosen to maximize GPU utilization on A100
    • Smaller batches increased training time without performance gains
  2. Epochs: 2

    • Model converged after 2 epochs based on training loss
    • Training time: ~30 minutes
  3. Warmup Steps: 100

    • Prevents early training instability and improves convergence
    • Standard practice for transformer fine-tuning

Discussion

Comparison with Existing Approaches

This model tends to understand semantic meaning more than PubMed's Advanced Search engine.

Example prompt: "What are what are the effects of e-cigarettes on sleep quality?"

PubMed's Advanced Search ranking:

  1. "2019 ACC/AHA Guideline on the Primary Prevention of Cardiovascular Disease: Executive Summary: A Report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines."

  2. "Contemporary Concise Review 2024: Chronic Obstructive Pulmonary Disease."

  3. "On the potential harmful effects of E-Cigarettes (EC) on the developing brain: The relationship between vaping-induced oxidative stress and adolescent/young adults social maladjustment."

  4. "Deleterious Association of Inhalant Use on Sleep Quality during the COVID-19 Pandemic."

  5. "Comparative effects of e-cigarette and conventional cigarette smoke on in vitro bronchial epithelial cell responses."

  6. "The Effect of Cigarette Use and Dual-Use on Depression and Sleep Quality."

Model's ranking:

  1. "The Lifestyle of Saudi Medical Students." (Does not seem related at first glance, but the abstract reveals that the paper reports e-cigarette usage and sleep data. PubMed ranks this paper as #9.)

  2. "Bidirectional Relationships Between Sleep Quality and Nicotine Vaping: Studying Young Adult e-cigarette Users in Real Time and Real Life."

  3. "Main and Interactive Effects of Nicotine Product Type on Sleep Health Among Dual Combustible and E-Cigarette Users."

  4. "Sleep disturbances among young adult dual users of cigarettes and e-cigarettes: Analysis of the 2020 National Health Interview Survey."

  5. "Deleterious Association of Inhalant Use on Sleep Quality during the COVID-19 Pandemic."

  6. "Dual use of e-cigarettes with conventional tobacco is associated with increased sleep latency in cross-sectional Study."

PubMed's top 3 results don't seem to be relevant, while the model's top 6 papers seem to be all relevant.

Future Directions

  1. Find and include the full article text when reranking.

  2. Instead of calling PubMed's API, download its full data with a mechanism to update it daily.

  3. Incorporate citation relationships to boost papers that are highly cited by other relevant papers

  4. Extend beyond PubMed to include other databases (arXiv, bioRxiv, medRxiv).

Conclusion

This project demonstrates that combining domain-specific language models with neural reranking and automatic summarization can improve biomedical literature search. The fine-tuned BioBERT cross-encoder can successfully capture semantic relationships that traditional keyword-based search systems miss. By understanding biomedical terminology and concepts beyond exact keyword matches, the system ranks papers by true relevance rather than superficial term overlap. As demonstrated, semantic understanding of biomedical concepts can lead to more relevant results, particularly for complex or specialized queries.

This work contributes to a broader objective of accelerating scientific discovery by making literature search more intelligent, efficient, and accessible. By reducing the time researchers spend searching and reading abstracts, systems like this can help accelerate the pace of biomedical research and ultimately improve health outcomes.

References

  • Lee, J., et al. (2020). BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234-1240. [Paper] [Code]

  • Lewis, M., et al. (2020). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. [Paper]

  • Krithara, A., et al. (2022). BioASQ-QA: A manually curated corpus for Biomedical Question Answering. bioRxiv. [Paper]

  • Wolf, T., et al. (2020). Transformers: State-of-the-Art Natural Language Processing. EMNLP 2020: System Demonstrations. [Paper] [Code]

  • Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP-IJCNLP 2019. [Paper] [Code]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Jupyter Notebook 75.4%
  • Python 24.6%