-
Notifications
You must be signed in to change notification settings - Fork 7
Joss paper #199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
manthey
wants to merge
12
commits into
master
Choose a base branch
from
joss-paper
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Joss paper #199
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
d029423
Draft of JOSS paper
manthey 6a3a221
Add more figures
manthey 1215873
Add another orcid.
manthey 11fb2a3
Update paper/paper.md
manthey 2354768
Add another orcid
manthey 3c6077d
Say more about feature shapes, SLIC, and huggingface UNI
10ecf15
Add DOI to some citations. Fix two-word last name
05a49a1
Merge branch 'master' into joss-paper
manthey e0557e0
Update for repo name change
manthey 6a243b0
Summary and background changes
cooperlab 2ac5678
Update about client/server architecture
manthey 101935a
Merge branch 'master' into joss-paper
manthey File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| name: Draft PDF | ||
| on: | ||
| push: | ||
| paths: | ||
| - paper/** | ||
| - .github/workflows/draft-pdf.yml | ||
|
|
||
| jobs: | ||
| paper: | ||
| runs-on: ubuntu-latest | ||
| name: Paper Draft | ||
| steps: | ||
| - name: Checkout | ||
| uses: actions/checkout@v4 | ||
| - name: Build draft PDF | ||
| uses: openjournals/openjournals-draft-action@master | ||
| with: | ||
| journal: joss | ||
| # This should be the path to the paper within your repo. | ||
| paper-path: paper/paper.md | ||
| - name: Upload | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: paper | ||
| # This is the output path where Pandoc will write the compiled | ||
| # PDF. Note, this should be the same directory as the input | ||
| # paper.md | ||
| path: paper/paper.pdf |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,113 @@ | ||
| @software{histomicsui, | ||
| title = {HistomicsUI: Organize, visualize, annotate, and analyze histology images}, | ||
| author = {{Kitware, Inc}}, | ||
| year = {2025}, | ||
| note = {Package version 1.7.0}, | ||
| url = {https://github.com/DigitalSlideArchive/HistomicsUI}, | ||
| doi = {10.5281/zenodo.5474914} | ||
| } | ||
|
|
||
| @software{histomicstk, | ||
| title = {HistomicsTK: a Python package for the analysis of digital pathology images}, | ||
| author = {{Kitware, Inc}}, | ||
| year = {2025}, | ||
| note = {Package version 1.4.0}, | ||
| url = {https://github.com/DigitalSlideArchive/HistomicsTK}, | ||
| doi = {10.5281/zenodo.14833780} | ||
| } | ||
|
|
||
| @software{digitalslidearchive, | ||
| title = {Digital Slide Archive: a system for working with large microscopy images}, | ||
| author = {{Kitware, Inc}}, | ||
| year = {2025}, | ||
| note = {Commit 2da1bfc7365dd72011854b5aebf4a744cfcf98a1; Access: 2025-04-30}, | ||
| url = {https://github.com/DigitalSlideArchive/digital_slide_archive} | ||
| } | ||
|
|
||
| @article{batchbald2019, | ||
| author = {Kirsch, Andreas and | ||
| van Amersfoort, Joost and | ||
| Gal, Yarin}, | ||
| title = {BatchBALD: Efficient and Diverse Batch Acquisition for Deep Bayesian | ||
| Active Learning}, | ||
| journal = {CoRR}, | ||
| volume = {abs/1906.08158}, | ||
| year = {2019}, | ||
| url = {http://arxiv.org/abs/1906.08158}, | ||
| eprinttype = {arXiv}, | ||
| eprint = {1906.08158}, | ||
| timestamp = {Thu, 14 Oct 2021 09:14:34 +0200}, | ||
| biburl = {https://dblp.org/rec/journals/corr/abs-1906-08158.bib}, | ||
| bibsource = {dblp computer science bibliography, https://dblp.org} | ||
| } | ||
|
|
||
| @article{Gutman2017, | ||
| title = {The Digital Slide Archive: A Software Platform for Management, Integration, and Analysis of Histology for Cancer Research}, | ||
| volume = {77}, | ||
| ISSN = {1538-7445}, | ||
| url = {http://dx.doi.org/10.1158/0008-5472.can-17-0629}, | ||
| DOI = {10.1158/0008-5472.can-17-0629}, | ||
| number = {21}, | ||
| journal = {Cancer Research}, | ||
| publisher = {American Association for Cancer Research (AACR)}, | ||
| author = {Gutman, David A and Khalilia, Mohammed and Lee, Sanghoon and Nalisnik, Michael and Mullen, Zach and Beezley, Jonathan and Chittajallu, Deepak R and Manthey, David and Cooper, Lee A D}, | ||
| year = {2017}, | ||
| month = {Oct}, | ||
| pages = {e75–e78} | ||
| } | ||
|
|
||
| @misc{TCGAData, | ||
| author = {National Cancer Institute and National Human Genome Research Institute}, | ||
| title = {The Cancer Genome Atlas (TCGA) Program}, | ||
| year = {2022}, | ||
| url = {https://www.cancer.gov/tcga}, | ||
| note = {Accessed: 2022-11-10]} | ||
| } | ||
|
|
||
| @article{SLIC2012, | ||
| author = {Achanta, Radhakrishna and | ||
| Shaji, Appu and | ||
| Smith, Kevin and | ||
| Lucchi, Aurelien and | ||
| Fua, Pascal and | ||
| S\"usstrunk, Sabine}, | ||
| title = {SLIC superpixels compared to state-of-the-art superpixel methods}, | ||
| journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence}, | ||
| year = {2012}, | ||
| volume = {34}, | ||
| number = {11}, | ||
| pages = {2274-2282}, | ||
| doi = {10.1109/TPAMI.2012.120} | ||
| } | ||
|
|
||
| @article{huggingface2024uni, | ||
| author = {Chen, Richard J and | ||
| Ding, Tong and | ||
| Lu, Ming Y and | ||
| Williamson, Drew F K and | ||
| Jaume, Guillaume and | ||
| Song, Andrew H and | ||
| Chen, Bowen and | ||
| Zhang, Andrew and | ||
| Shao, Daniel and | ||
| Shaban, Muhammad and | ||
| Williams, Mane and | ||
| Oldenburg, Lukas and | ||
| Weishaupt, Luca L and | ||
| Wang, Judy J and | ||
| Vaidya, Anurag and | ||
| Le, Long Phi and | ||
| Gerber, Georg and | ||
| Sahai, Sharifa and | ||
| Williams, Walt and | ||
| Mahmood, Faisal}, | ||
| title = {Towards a general-purpose foundation model for computational pathology}, | ||
| journal = {Nature Medicine}, | ||
| year = {2024}, | ||
| volume = {30}, | ||
| number = {3}, | ||
| pages = {850-862}, | ||
| month = {Mar}, | ||
| url = {https://huggingface.co/MahmoodLab/UNI}, | ||
| doi = {10.1038/s41591-024-02857-3} | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,90 @@ | ||
| --- | ||
| title: 'Histomics Label' | ||
| tags: | ||
| - Python | ||
| - histology | ||
| - bioimage informatics | ||
| - whole slide annotation | ||
| - whole slide images | ||
| - guided labeling | ||
| # (add orcid for anyone who has one) | ||
| authors: | ||
| - name: Brianna Major | ||
| affiliation: 1 | ||
| orcid: 0000-0003-4968-5701 | ||
| - name: Jeffery A. Goldstein | ||
| affiliation: 2 | ||
| orcid: 0000-0002-4086-057X | ||
| - name: Michael Nagler | ||
| affiliation: 1 | ||
| orcid: 0000-0003-3531-6630 | ||
| - name: Lee A. Newberg | ||
| affiliation: 1 | ||
| orcid: 0000-0003-4644-8874 | ||
| - name: Abhishek Sharma | ||
| affiliation: 2 | ||
| orcid: 0000-0001-6666-2179 | ||
| - name: Anders Sildnes | ||
| affiliation: 2 | ||
| orcid: 0009-0003-0141-6112 | ||
| - name: Faiza Ahmed | ||
| affiliation: 1 | ||
| orcid: 0000-0001-6687-9941 | ||
| - name: Jeff Baumes | ||
| affiliation: 1 | ||
| orcid: 0000-0002-4719-3490 | ||
| - name: Lee A.D. Cooper | ||
| affiliation: 2 | ||
| orcid: 0000-0002-3504-4965 | ||
| - name: David Manthey | ||
| affiliation: 1 | ||
| orcid: 0000-0002-4580-8770 | ||
| affiliations: | ||
| - index: 1 | ||
| name: Kitware, Inc., New York, United States | ||
| - index: 2 | ||
| name: Northwestern University Feinberg School of Medicine, Illinois, United States | ||
| date: 30 April 2025 | ||
| bibliography: paper.bib | ||
| --- | ||
|
|
||
| # Summary | ||
|
|
||
| `Histomics Label` is a software tool for the interactive development of machine-learning classifiers for whole slide pathology images. It is deployed as part of the Digital Slide Archive [@Gutman2017; @digitalslidearchive], a web-based data management system for whole slide image datasets, and was built on top of HistomicsUI [@histomicsui] and uses the HistomicsTK [@histomicstk] image analysis tool kit. | ||
|
|
||
| Users label image regions or tissue structures to provide training data for classifiers and iteratively improve these classifiers by reviewing their output and providing additional labels. The interface uses heuristics to guide users to the most impactful examples to label, and supports bulk labeling of samples and review of labeled examples for collaboration. An example data generation pipeline is included that segments a whole slide image into superpixels, and generates feature embeddings for segmented regions using a foundation model. | ||
|
|
||
| # Statement of need | ||
|
|
||
| One of the limitations in developing classification models is the need for labeled data. In pathology and other medical fields, the expertise required for labeling and busy schedules of medical experts make labeling particularly challenging. For whole slide images, where each image can contain several billion pixels, navigating vast datasets in search of possibly rare tissue states can be very inefficient and frustrating. Software interfaces need to be optimized for the user experience and make the most of an expert's time and energy. | ||
|
|
||
| Other issues in labeling include the volume and accessibility of data. Software that must run local to the data requires that all data be copied and correctly versioned for the project. Using a web-client and server model with appropriate permission models, only requires that the data be on centrally managed server. This allows there to be a single, coordinate source of data for a project, and reduces the burden on individual users to only requiring a web browser and ordinary internet connection. This enables collaboration between multiple experts, or to allow experts to review the work of their trainees. | ||
|
|
||
| `Histomics Label` uses a technique called active learning to identify the unlabeled examples that can provide the most benefit to classifier performance and provides an intuitive workflow | ||
| for presenting these examples to experts for efficient labeling. Data can be generated using a built-in pipeline that partitions whole-slide images into superpixels, or users can provide their own data from external cell or tissue segmentation algorithms. Users specify the categories that can be labeled and assign display properties like color, and can exclude categories from classifier training (for instance, for regions whose categories cannot be accurately determined). After labeling a few initial example regions, a classifier is trained and used to both predict the category of all regions and the unlabeled regions that provide the most classifier benefit. The user can retrain the classifier at any time and review the classifier predictions and labels from other users. Labeling can also be performed by painting directly on the whole slide image with a brush tool. | ||
|
|
||
| For development, the initial segmentation uses superpixels generated with the SLIC [@SLIC2012] algorithm. These are computed on whole slide images in a tiled manner so that they can work on arbitrarily large images, and the tile boundaries are properly handled to merge seamlessly. Once generated, segments are represented in one of two ways, either as two-dimensional patches, each centered in a fixed-sized square of masked pixels, or as one-dimensional feature embeddings, such as those generated from the huggingface UNI [@huggingface2024uni] foundation model. One of two basic models is trained based upon the segment representation. For two-dimensional patches, the model to be trained is a small-scale CNN implemented in tensorflow/keras or torch. For one-dimensional vectors, the model to be trained is a single-layer linear classifier. The certainty criteria for which segments should be labeled next can also be selected, and includes confidence, margin, negative entropy, and the BatchBALD [@batchbald2019] algorithm. | ||
|
|
||
| We had a placental pathologist provide feedback to validate the efficiency of the user interface and utility of the process. | ||
|
|
||
| # Basic Workflow | ||
|
|
||
| When starting a new labeling project, the user selects how superpixels are generated, which certainty metric is used for determining the optimal labeling order, and what features are used for model training. The labeling mode allows defining project labels and performing initial labeling. This mode can also be used to add new label categories or combine two categories if they should not have been distinct. Label categories can additionally be marked as excluded, which removes them from training and ensures that superpixels with those labels are no longer suggested for labeling. | ||
|
|
||
|  | ||
|
|
||
| Once some segments have been labeled and an initial training process has been performed, additional segments are shown with their predictions. The user can use keyboard shortcuts or the mouse to confirm or correct labels. These are presented in an order that maximizes the utility of improving the model based on the originally selected certainty metric. | ||
|
|
||
|  | ||
|
|
||
| To check on overall behavior or correct mistakes, there is a review mode that allows seeing all labeled segments with various filtering and sorting options. This can be used to check agreement between pathologists or determine how well the model agrees with the manually labeled data. | ||
|
|
||
|  | ||
|
|
||
| The whole slide image data in these figures are from data generated by the TCGA Research Network [@TCGA]. | ||
|
|
||
| # Acknowledgements | ||
|
|
||
| This work has been funded in part by National Library of Medicine grant 5R01LM013523 entitled "Guiding humans to create better labeled datasets for machine learning in biomedical research". | ||
|
|
||
| # References | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.