The Caduceus Project is an initiative aimed at improving the conversion of complex scientific and medical PDF files to well-structured markdown format. By utilizing the power of OpenAI's GPT-4o model, this project aims to enhance the accessibility and usability of scientific and medical information. The PDF files used were taken from protocols.io, an open source repository of medical and scientific protocols. You can find the resulting dataset of this project here.
The dataset is created through the following steps:
-
PDF to Text Conversion: The
pdf2txt.pyscript is used to convert PDF files in a specified input folder to plain text format. It extracts the text content from each page of the PDF files using thePyPDF2library and saves the extracted text as individual text files in a specified output folder. -
Text Analysis: The
densitymap.pyscript processes the converted text files to calculate word counts, the occurrence of specific units of measurement, and the ratio of measurements to word count for each file. It saves this data to a CSV file namedword_counts_and_measurements.csvand creates density maps for word counts and measurements ratios using matplotlib. -
Data Filtering: The
scatterplot.pyscript reads the data from theword_counts_and_measurements.csvfile and creates an interactive scatter plot using Plotly. This plot visualizes the relationship between "Measurements Ratio" and "Word Count" for the dataset. The script filters the data based on thresholds for these two metrics, allowing for the selection of high-quality content. -
PDF to Markdown Conversion: The
pdf2md.pyscript converts the filtered PDF files to markdown format. It uses thepdf2imagelibrary to convert PDF pages to images and sends these images along with a prompt to the OpenAI API for conversion to markdown. The generated markdown content is saved in a JSONL file and optionally as individual markdown files. -
Dataset Cleaning: The
move.pyscript is used to clean the dataset by moving PDF files based on the "Measurements Ratio" and "Word Count" thresholds. Files meeting the thresholds are moved to a subset folder, while files not meeting the thresholds are moved to an unused folder.
This code can be modified in order to convert any type of PDF files to markdown format.
To use the Caduceus Project scripts, follow these steps:
-
Install the required dependencies by running the following command:
pip install -r requirements.txt
-
Update the file paths and API key in the scripts:
- In
pdf2txt.py:- Set
INPUT_FOLDERto the directory containing the PDF files you want to convert. - Set
OUTPUT_FOLDERto the directory where you want to save the converted text files.
- Set
- In
densitymap.py:- Set
TXT_FOLDERto the directory containing the converted text files.
- Set
- In
scatterplot.py:- Set
CSV_FILEto the path of theword_counts_and_measurements.csvfile generated bydensitymap.py. - Adjust
ratio_thresholdandword_count_thresholdvariables according to your desired filtering criteria.
- Set
- In
pdf2md.py:- Set
pdf_folderto the directory containing the filtered PDF files you want to convert to markdown. - Set
output_fileto the path where you want to save the JSONL file containing the converted markdown content. - If you want to save individual markdown files, provide the
markdown_folderwhen prompted.
- Set
- In
move.py:- Set
CSV_FILEto the path of theword_counts_and_measurements.csvfile. - Set
PDF_FOLDERto the directory containing the original PDF files. - Set
SUBSET_FOLDERto the directory where you want to move the PDF files meeting the thresholds. - Set
UNUSED_FOLDERto the directory where you want to move the PDF files not meeting the thresholds. - Adjust
ratio_thresholdandword_count_thresholdvariables to match the values used inscatterplot.py.
- Set
- In
-
Run the scripts in the following order:
-
pdf2txt.py: Converts the PDF files in theINPUT_FOLDERto text format and saves the converted files in theOUTPUT_FOLDER.python pdf2txt.py
-
densitymap.py: Analyzes the converted text files in theTXT_FOLDER, calculates word counts, unit counts, and measurements ratios, and creates density maps. Saves the data to theword_counts_and_measurements.csvfile.python densitymap.py
-
scatterplot.py: Reads the data from theword_counts_and_measurements.csvfile and creates an interactive scatter plot for data filtering. Filters the data based on the specifiedratio_thresholdandword_count_thresholdand saves the filtered data tointeractive_scatter_plot_subset.html.python scatterplot.py
-
move.py: Cleans the dataset by moving the PDF files based on the thresholds specified inscatterplot.py. Moves the PDF files meeting the thresholds to theSUBSET_FOLDERand the files not meeting the thresholds to theUNUSED_FOLDER.python move.py
-
pdf2md.py: Converts the filtered PDF files in theSUBSET_FOLDERto markdown format using the OpenAI API. Saves the converted markdown content to the specifiedoutput_filein JSONL format. If prompted, provide themarkdown_folderto save individual markdown files.python pdf2md.py
-
-
After running the scripts, you will have a cleaned dataset of PDF files in the
SUBSET_FOLDERand their corresponding markdown conversions in theoutput_fileand optionally in themarkdown_folder. -
Use the generated dataset for model fine-tuning or further analysis as needed.
Remember to replace the placeholders (e.g., INPUT_FOLDER, OUTPUT_FOLDER, TXT_FOLDER, etc.) with the actual paths on your system and set the appropriate thresholds in scatterplot.py and move.py based on your requirements.
Contributions to the Caduceus Project are welcome! If you have any ideas, suggestions, or improvements, please feel free to open an issue or submit a pull request on the project's GitHub repository.
The Caduceus Project is open-source and available under the Creative Commons Attribution 4.0 International (CC BY 4.0) License.
Under this license, you are free to:
- Share: Copy and redistribute the material in any medium or format.
- Adapt: Remix, transform, and build upon the material for any purpose, even commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms.
You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
Please see the Creative Commons Attribution 4.0 International (CC BY 4.0) License for more details.
