From adbe1771af00373d9a1691285b0d74683b392b3f Mon Sep 17 00:00:00 2001 From: David-Araripe Date: Mon, 10 Feb 2025 15:18:17 +0100 Subject: [PATCH 1/3] update field_reference page on the docs --- docs/source/field_reference.rst | 66 ++++++++++++++++++++------------- 1 file changed, 40 insertions(+), 26 deletions(-) diff --git a/docs/source/field_reference.rst b/docs/source/field_reference.rst index 1419cc3..b9e2607 100644 --- a/docs/source/field_reference.rst +++ b/docs/source/field_reference.rst @@ -1,51 +1,65 @@ Return Fields Reference ======================= -UniProtMapper supports a wide range of return fields from UniProt. These fields are organized by their types and can be used to specify which data you want to retrieve. +UniProtMapper supports a wide range of return fields from UniProt. These fields are organized by their types and can be used to specify data you want to get for the proteins retrieved by your query. -Field Categories ----------------- - -The return fields are organized into several categories: +UniProt return fields are organized into the categories: -1. Names & Taxonomy - - Basic identification and taxonomic information - - Gene names, organism details, etc. +1. **Names & Taxonomy**: Basic identification and taxonomic information. *E.g.:*: Gene names, organism details, etc. -2. Sequences - - Sequence-related information - - Length, mass, variants, etc. +2. **Sequences**: Sequence-related information. *E.g.:*: Length, mass, variants, etc. -3. Function - - Functional annotations - - Activity, pathways, binding sites, etc. +3. **Function**: Functional annotations. *E.g.:*: Activity, pathways, binding sites, etc. -4. Structure - - Structural information - - 3D structure, secondary structure elements, etc. +4. **Structure**: Structural information. *E.g.:*: 3D structure, secondary structure elements, etc. -5. Cross-references - - Links to external databases - - Organized by database type (e.g., genomic, proteomic, etc.) +5. **Cross-references**: Links to external databases, subdivided into different categories according to the database being cross-referenced. *E.g.:* `Chemistry` for datasets like `DrugBank`, `Genome annotation` for `Ensembl`, etc. Supported fields ---------------- +The supported return fields are listed below. The columns contain different information about the fields: + +- **label**: The label used by UniProt to represent this field. Also used as column names on the `pd.DataFrame` returned from `get` methods implemented on both APIs. +- **returned_field**: Name used to specify which information to retrieve by the APIs. For examples, check below. +- **field_type**: The category of the field, as listed above under `Field Categories`. Note that for `type=='cross_reference'`, the field_type is the category of the cross-referenced database. +- **has_full_version**: Always `yes` for `type=='uniprot_field'`. Is used by UniProt to indicate whether a cross-referenced database is fully integrated. +- **type**: Either "uniprot_field" or "cross_reference". The former indicates a field that is directly related to the protein, while the latter indicates a field that is a cross-reference to another database and not native to UniProt. + +For more up-to-date information on `has_full_version` of cross-referenced fields, check the official UniProt documentation: `Return Fields `_ + .. csv-table:: Supported Return Fields :header-rows: 1 :file: _static/uniprot_return_fields.csv -Usage Example -------------- +Specify Return Fields with ID Mapping API +----------------------------------------- -To specify which fields to retrieve:: +Specify which fields to retrieve on a ID mapping request:: from UniProtMapper import ProtMapper mapper = ProtMapper() - # Get specific fields + fields = ["accession", "gene_names", "organism_name"] result, failed = mapper.get( ["P30542"], - fields=["accession", "gene_names", "organism_name"] - ) \ No newline at end of file + fields=fields, + ) + +Specify Return Fields with UniProtKB API +---------------------------------------- + +Specify which fields to retrieve on a field-based query:: + + from UniProtMapper import ProtKB + from UniProtMapper.uniprotkb_fields import accession + + protkb = ProtKB() + + query = accession("P30542") + fields = ["accession", "gene_names", "organism_name"] + result, failed = protkb.get( + query, + fields=fields, + ) From 297c5a94a679515ddc797464cfab7123a4b293d8 Mon Sep 17 00:00:00 2001 From: David-Araripe Date: Mon, 10 Feb 2025 16:21:41 +0100 Subject: [PATCH 2/3] =?UTF-8?q?Update=20the=20docs=20=F0=9F=93=9D?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- docs/source/api/UniProtMapper.rst | 2 + docs/source/field_reference.rst | 1 + docs/source/tutorials/field_querying.rst | 65 +++++++++++++++++++----- docs/source/tutorials/mapping.rst | 23 ++++++++- docs/source/tutorials/retrieving.rst | 53 ------------------- 5 files changed, 76 insertions(+), 68 deletions(-) delete mode 100644 docs/source/tutorials/retrieving.rst diff --git a/docs/source/api/UniProtMapper.rst b/docs/source/api/UniProtMapper.rst index cb5f5cc..ba4eb61 100644 --- a/docs/source/api/UniProtMapper.rst +++ b/docs/source/api/UniProtMapper.rst @@ -9,6 +9,8 @@ Main Module :undoc-members: :show-inheritance: +.. _field_querying: + Field Querying -------------- diff --git a/docs/source/field_reference.rst b/docs/source/field_reference.rst index b9e2607..66ed74f 100644 --- a/docs/source/field_reference.rst +++ b/docs/source/field_reference.rst @@ -15,6 +15,7 @@ UniProt return fields are organized into the categories: 5. **Cross-references**: Links to external databases, subdivided into different categories according to the database being cross-referenced. *E.g.:* `Chemistry` for datasets like `DrugBank`, `Genome annotation` for `Ensembl`, etc. +.. _supported_fields: Supported fields ---------------- diff --git a/docs/source/tutorials/field_querying.rst b/docs/source/tutorials/field_querying.rst index 782af09..2977a13 100644 --- a/docs/source/tutorials/field_querying.rst +++ b/docs/source/tutorials/field_querying.rst @@ -6,7 +6,7 @@ This tutorial demonstrates how to use UniProtMapper's field-based querying funct Basic Field Queries ------------------- -Here's a simple example using boolean fields:: +A simple example on querying UniProtKB through field search:: from UniProtMapper import ProtKB from UniProtMapper.uniprotkb_fields import reviewed, organism_name @@ -17,37 +17,76 @@ Here's a simple example using boolean fields:: query = reviewed(True) & organism_name("human") result, failed = protkb.get(query) +.. note:: + + Running this code will take some time as it retrieves all reviewed human proteins! Each iteration of the displayed progress bar represents 500 entries fetched from UniProtKB. + Complex Queries --------------- -You can combine multiple fields with boolean operators:: +You can combine multiple fields with boolean operators, illustrated by the following examples: + +Example 1:: + from UniProtMapper import ProtKB from UniProtMapper.uniprotkb_fields import ( + organism_name, length, mass, date_modified, - gene_exact, - xref_count, ) + + protkb = ProtKB() # Find human proteins: - # - modified since 2024 + # - NOT modified after 2023 (in UniProtKB) # - between 200-300 amino acids # - mass < 50kDa - # - 5 or more deposited PDB structures query = ( organism_name("human") & - date_modified("2024-01-01", "*") & length(200, 300) & mass("*", 50000) & - xref_count("pdb", 5, "*") + (~ date_modified("2023-01-01", "*")) + ) + result = protkb.get(query) + +Example 2:: + + from UniProtMapper import ProtKB + from UniProtMapper.uniprotkb_fields import ( + xref_count, + organism_id, + reviewed, + fragment, + length, + ) + + protkb = ProtKB() + + # Find human proteins: + # - with 2 or more deposited pdb strctures + # - not fragments fragments + # - reviewed + # - length < 750 amino acids + query = ( + xref_count("pdb", 2, "*") + & organism_id(9606) + & reviewed(True) + & fragment(False) + & length("*", 750) ) result = protkb.get(query) +.. note:: + + The ``fields`` parameter is also supported by the ``ProtKB`` API. For a full list of the supported fields, check the :ref:`supported_fields` section of the docs. + Field Types ----------- -UniProtMapper supports several types of fields: +UniProtMapper supports several types of fields. For full documentation on the fields implemented in the package, check :ref:`field_querying`. + +See below for examples of different field types implemented in UniProtMapper. Boolean Fields ~~~~~~~~~~~~~~ @@ -55,7 +94,7 @@ Boolean Fields from UniProtMapper.uniprotkb_fields import reviewed, fragment, is_isoform - # Get reviewed entries that are not fragments + # Example: Get reviewed entries that are not fragments query = reviewed(True) & ~fragment(True) Range Fields @@ -64,7 +103,7 @@ Range Fields from UniProtMapper.uniprotkb_fields import length, mass - # Proteins between 200-300 amino acids + # Example: Proteins between 200-300 amino acids query = length(200, 300) Date Range Fields @@ -73,7 +112,7 @@ Date Range Fields from UniProtMapper.uniprotkb_fields import date_created, date_modified - # Entries created in 2023 + # Example: Entries created in 2023 query = date_created("2023-01-01", "2023-12-31") Text-Based Fields @@ -82,5 +121,5 @@ Text-Based Fields from UniProtMapper.uniprotkb_fields import gene_exact, keyword, family - # Proteins in kinase family with ATP-binding + # Example: Proteins in kinase family with ATP-binding query = family("Kinase*") & keyword("ATP-binding") diff --git a/docs/source/tutorials/mapping.rst b/docs/source/tutorials/mapping.rst index 798c2af..4dbfdc3 100644 --- a/docs/source/tutorials/mapping.rst +++ b/docs/source/tutorials/mapping.rst @@ -18,12 +18,31 @@ Here's a simple example of mapping UniProt accession IDs to Ensembl IDs:: to_db="Ensembl" ) -The result is a pandas DataFrame containing the mapped IDs, and failed is a list of IDs that couldn't be mapped. +The ``result`` is a `pandas.DataFrame` containing the query and mapped IDs (column names `From` and `To`, respectively), while ``failed`` is a list of IDs that couldn't be mapped. + +Mapping Through Cross-Referenced Fields +--------------------------------------- + +Ensembl is also cross-referenced in UniProt entries. In case you're interested in checking all cross-referenced Ensembl IDs for a given UniProt entry, you can do so by:: + + from UniProtMapper import ProtMapper + + mapper = ProtMapper() + + fields = ["xref_ensembl"] + result, failed = mapper.get( + ids=["P30542", "Q16678", "Q02880"], + fields=fields, + ) + +.. note:: + + For a full list of the supported fields, check the :ref:`supported_fields` section of the docs. Here, result is again a `pandas.DataFrame` containing the query and mapped IDs (column names `From` and `Ensembl`, following the `label` column in the reference table). Available Databases ------------------- -UniProtMapper supports mapping between numerous databases. You can view the complete list of supported databases in the mapping_dbs.json file or check UniProt's documentation. +UniProtMapper supports mapping between numerous databases. You can view the complete list of supported databases in ``ProtMapper()._supported_dbs`` or check UniProt's documentation. Handling Failed Mappings ------------------------ diff --git a/docs/source/tutorials/retrieving.rst b/docs/source/tutorials/retrieving.rst deleted file mode 100644 index b50fca1..0000000 --- a/docs/source/tutorials/retrieving.rst +++ /dev/null @@ -1,53 +0,0 @@ -Retrieving Information Tutorial -=============================== - -This tutorial shows how to retrieve information about proteins from UniProt. - -Basic Retrieval ---------------- - -Here's how to retrieve information about a protein:: - - from UniProtMapper import ProtMapper - - mapper = ProtMapper() - - # Get information using default fields - result, failed = mapper.get(["Q02880"]) - -Customizing Return Fields -------------------------- - -You can specify which fields to retrieve:: - - # List available fields - fields_df = mapper.fields_table - print(fields_df.head()) - - # Get specific fields - fields = ["accession", "organism_name", "structure_3d"] - result, failed = mapper.get(["Q02880"], fields=fields) - -Default Fields --------------- - -UniProtMapper comes with a set of default fields, but you can override them:: - - # Check default fields - print(mapper.default_fields) - - # Use custom fields instead - result, failed = mapper.get( - ["Q02880"], - fields=["accession", "gene_names", "length"] - ) - -Handling Multiple Entries -------------------------- - -You can retrieve information for multiple proteins at once:: - - ids = ["P30542", "Q16678", "Q02880"] - result, failed = mapper.get(ids) - -The results will be returned as a pandas DataFrame with one row per protein. From 32115326fe84687416792551435eed52de8496be Mon Sep 17 00:00:00 2001 From: David-Araripe Date: Mon, 10 Feb 2025 17:00:14 +0100 Subject: [PATCH 3/3] =?UTF-8?q?Update=20README=20=F0=9F=93=9D=20to=20refle?= =?UTF-8?q?ct=20latest=20changes?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- README.md | 36 ++++++++++++++++++++++-------------- 1 file changed, 22 insertions(+), 14 deletions(-) diff --git a/README.md b/README.md index 7ae4812..4008a49 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,7 @@ # UniProtMapper -Easily retrieve UniProt data and map protein identifiers using this Python package for UniProt's [Retrieve/ID Mapping](https://www.uniprot.org/id-mapping) RESTful API. +Easily retrieve UniProt data and map protein identifiers using this Python package for UniProt's Retrieve & ID Mapping RESTful APIs. [Read the full documentation](https://david-araripe.github.io/UniProtMapper/stable/index.html). ## 📚 Table of Contents @@ -17,6 +17,7 @@ Easily retrieve UniProt data and map protein identifiers using this Python packa - [Mapping IDs](#mapping-ids) - [Retrieving Information](#retrieving-information) - [Field-based Querying](#field-based-querying) +- [📖 Documentation](#-documentation) - [💻 Command Line Interface (CLI)](#-command-line-interface-cli) - [👏🏼 Credits](#-credits) @@ -24,31 +25,32 @@ Easily retrieve UniProt data and map protein identifiers using this Python packa UniProtMapper is a tool for bioinformatics and proteomics research that supports: 1. Mapping any UniProt [cross-referenced IDs](https://github.com/David-Araripe/UniProtMapper/blob/master/src/UniProtMapper/resources/uniprot_mapping_dbs.json) to other identifiers & vice-versa; -2. Programmatically retrieving any of the supported [return](https://www.uniprot.org/help/return_fields) and [cross-reference fields](https://www.uniprot.org/help/return_fields_databases) from both UniProt-SwissProt and UniProt-TrEMBL (unreviewed) databases; +2. Programmatically retrieving any of the supported [return](https://www.uniprot.org/help/return_fields) and [cross-reference fields](https://www.uniprot.org/help/return_fields_databases) from both UniProt-SwissProt and UniProt-TrEMBL (unreviewed) databases. For a full table containing all the supported resources, refer to the [supported fields](https://david-araripe.github.io/UniProtMapper/stable/field_reference.html#supported-fields) in the docs; 3. Querying UniProtKB entries using complex field-based queries with boolean operators `~` (NOT), `|` (OR), `&` (AND). For the first two functionalities, check the examples [Mapping IDs](#mapping-ids) and [Retrieving Information](#retrieving-information) below. The third, see [Field-based Querying](#field-based-querying). -All functionalities can also be accessed through the CLI. For more information, check [CLI](#-command-line-interface-cli). +The ID mapping API can also be accessed through the CLI. For more information, check [CLI](#-command-line-interface-cli). ## 📦 Installation ### From PyPI (recommended): -``` Shell +```shell python -m pip install uniprot-id-mapper ``` ### Directly from GitHub: -``` Shell +```shell python -m pip install git+https://github.com/David-Araripe/UniProtMapper.git ``` ### From source: -``` Shell +```shell git clone https://github.com/David-Araripe/UniProtMapper cd UniProtMapper python -m pip install . ``` + # 🛠️ Usage ## Mapping IDs @@ -71,9 +73,9 @@ The `result` is a pandas DataFrame containing the mapped IDs (see below), while | 1 | Q16678 | ENSG00000138061.12 | | 2 | Q02880 | ENSG00000077097.17 | -## Retrieving information +## Retrieving Information -The supported [return](https://www.uniprot.org/help/return_fields) and [cross-reference fields](https://www.uniprot.org/help/return_fields_databases) are both accessible through UniProt's website or by the attribute `ProtMapper.fields_table`: +All [supported return fields](https://david-araripe.github.io/UniProtMapper/stable/field_reference.html#supported-fields) are both accessible through the attribute `ProtMapper.fields_table`: ```Python from UniProtMapper import ProtMapper @@ -90,7 +92,7 @@ df.head() | 3 | Gene Names (primary) | gene_primary | Names & Taxonomy | yes | uniprot_field | | 4 | Gene Names (synonym) | gene_synonym | Names & Taxonomy | yes | uniprot_field | -All values in `returned_field` are supported in the database's API. Access UniProt data fields programmatically: +From the DataFrame, all `return_field` entries can be used to access UniProt data programmatically: ```Python # To retrieve the default fields: @@ -105,9 +107,9 @@ result, failed = mapper.get(["Q02880"], fields=fields) ## Field-based Querying -UniProtMapper supports complex field-based queries using boolean operators (AND, OR, NOT) through the `uniprotkb_fields` module. This allows you to create sophisticated searches combining multiple criteria. For example: +UniProtMapper supports complex field-based protein queries using boolean operators (AND, OR, NOT) through the `uniprotkb_fields` module. This allows you to create sophisticated searches combining multiple criteria. For example: -```Python +```python from UniProtMapper import ProtKB from UniProtMapper.uniprotkb_fields import ( organism_name, @@ -128,10 +130,16 @@ query = ( protkb = ProtKB() result = protkb.get(query) ``` +For a list of all fields and their descriptions, check the API reference for the [uniprotkb_fields](https://david-araripe.github.io/UniProtMapper/stable/api/UniProtMapper.html#module-UniProtMapper.uniprotkb_fields) module reference. + +## 📖 Documentation + +- [Stable Branch Documentation](https://david-araripe.github.io/UniProtMapper/stable/index.html) (master branch) +- [Development Documentation](https://david-araripe.github.io/UniProtMapper/dev/index.html) (dev branch) # 💻 Command Line Interface (CLI) -UniProtMapper provides a CLI for easy integration into bioinformatics workflows. Here is a list of the available arguments, shown by `protmap -h`: +UniProtMapper provides a CLI for the ID Mapping class, `ProtMapper`, for easy access to lookups and data retrieval. Here is a list of the available arguments, shown by `protmap -h`: ```text usage: UniProtMapper [-h] -i [IDS ...] [-r [RETURN_FIELDS ...]] [--default-fields] [-o OUTPUT] @@ -164,14 +172,14 @@ optional arguments: references, see: /resources/uniprot_mapping_dbs.json -over, --overwrite If desired to overwrite an existing file when using -o/--output -pf, --print-fields Prints the available return fields and exits the program. - ``` +``` Usage example, retrieving default fields from `/resources/cli_return_fields.txt`:

Image displaying the output of UniProtMapper's CLI, protmap

-# 👏🏼 Credits: +## 👏🏼 Credits - [UniProt](https://www.uniprot.org/) for providing the API and the amazing database; - [Andrew White and the University of Rochester](https://github.com/whitead/protein-emoji) for the protein emoji;