Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 22 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

# UniProtMapper <img align="left" width="40" height="40" src="https://raw.githubusercontent.com/whitead/protein-emoji/main/src/protein-72-color.svg">

Easily retrieve UniProt data and map protein identifiers using this Python package for UniProt's [Retrieve/ID Mapping](https://www.uniprot.org/id-mapping) RESTful API.
Easily retrieve UniProt data and map protein identifiers using this Python package for UniProt's Retrieve & ID Mapping RESTful APIs. [Read the full documentation](https://david-araripe.github.io/UniProtMapper/stable/index.html).

## 📚 Table of Contents

Expand All @@ -17,38 +17,40 @@ Easily retrieve UniProt data and map protein identifiers using this Python packa
- [Mapping IDs](#mapping-ids)
- [Retrieving Information](#retrieving-information)
- [Field-based Querying](#field-based-querying)
- [📖 Documentation](#-documentation)
- [💻 Command Line Interface (CLI)](#-command-line-interface-cli)
- [👏🏼 Credits](#-credits)

## ⛏️ Features
UniProtMapper is a tool for bioinformatics and proteomics research that supports:

1. Mapping any UniProt [cross-referenced IDs](https://github.com/David-Araripe/UniProtMapper/blob/master/src/UniProtMapper/resources/uniprot_mapping_dbs.json) to other identifiers & vice-versa;
2. Programmatically retrieving any of the supported [return](https://www.uniprot.org/help/return_fields) and [cross-reference fields](https://www.uniprot.org/help/return_fields_databases) from both UniProt-SwissProt and UniProt-TrEMBL (unreviewed) databases;
2. Programmatically retrieving any of the supported [return](https://www.uniprot.org/help/return_fields) and [cross-reference fields](https://www.uniprot.org/help/return_fields_databases) from both UniProt-SwissProt and UniProt-TrEMBL (unreviewed) databases. For a full table containing all the supported resources, refer to the [supported fields](https://david-araripe.github.io/UniProtMapper/stable/field_reference.html#supported-fields) in the docs;
3. Querying UniProtKB entries using complex field-based queries with boolean operators `~` (NOT), `|` (OR), `&` (AND).

For the first two functionalities, check the examples [Mapping IDs](#mapping-ids) and [Retrieving Information](#retrieving-information) below. The third, see [Field-based Querying](#field-based-querying).

All functionalities can also be accessed through the CLI. For more information, check [CLI](#-command-line-interface-cli).
The ID mapping API can also be accessed through the CLI. For more information, check [CLI](#-command-line-interface-cli).

## 📦 Installation

### From PyPI (recommended):
``` Shell
```shell
python -m pip install uniprot-id-mapper
```

### Directly from GitHub:
``` Shell
```shell
python -m pip install git+https://github.com/David-Araripe/UniProtMapper.git
```

### From source:
``` Shell
```shell
git clone https://github.com/David-Araripe/UniProtMapper
cd UniProtMapper
python -m pip install .
```

# 🛠️ Usage

## Mapping IDs
Expand All @@ -71,9 +73,9 @@ The `result` is a pandas DataFrame containing the mapped IDs (see below), while
| 1 | Q16678 | ENSG00000138061.12 |
| 2 | Q02880 | ENSG00000077097.17 |

## Retrieving information
## Retrieving Information

The supported [return](https://www.uniprot.org/help/return_fields) and [cross-reference fields](https://www.uniprot.org/help/return_fields_databases) are both accessible through UniProt's website or by the attribute `ProtMapper.fields_table`:
All [supported return fields](https://david-araripe.github.io/UniProtMapper/stable/field_reference.html#supported-fields) are both accessible through the attribute `ProtMapper.fields_table`:

```Python
from UniProtMapper import ProtMapper
Expand All @@ -90,7 +92,7 @@ df.head()
| 3 | Gene Names (primary) | gene_primary | Names & Taxonomy | yes | uniprot_field |
| 4 | Gene Names (synonym) | gene_synonym | Names & Taxonomy | yes | uniprot_field |

All values in `returned_field` are supported in the database's API. Access UniProt data fields programmatically:
From the DataFrame, all `return_field` entries can be used to access UniProt data programmatically:

```Python
# To retrieve the default fields:
Expand All @@ -105,9 +107,9 @@ result, failed = mapper.get(["Q02880"], fields=fields)

## Field-based Querying

UniProtMapper supports complex field-based queries using boolean operators (AND, OR, NOT) through the `uniprotkb_fields` module. This allows you to create sophisticated searches combining multiple criteria. For example:
UniProtMapper supports complex field-based protein queries using boolean operators (AND, OR, NOT) through the `uniprotkb_fields` module. This allows you to create sophisticated searches combining multiple criteria. For example:

```Python
```python
from UniProtMapper import ProtKB
from UniProtMapper.uniprotkb_fields import (
organism_name,
Expand All @@ -128,10 +130,16 @@ query = (
protkb = ProtKB()
result = protkb.get(query)
```
For a list of all fields and their descriptions, check the API reference for the [uniprotkb_fields](https://david-araripe.github.io/UniProtMapper/stable/api/UniProtMapper.html#module-UniProtMapper.uniprotkb_fields) module reference.

## 📖 Documentation

- [Stable Branch Documentation](https://david-araripe.github.io/UniProtMapper/stable/index.html) (master branch)
- [Development Documentation](https://david-araripe.github.io/UniProtMapper/dev/index.html) (dev branch)

# 💻 Command Line Interface (CLI)

UniProtMapper provides a CLI for easy integration into bioinformatics workflows. Here is a list of the available arguments, shown by `protmap -h`:
UniProtMapper provides a CLI for the ID Mapping class, `ProtMapper`, for easy access to lookups and data retrieval. Here is a list of the available arguments, shown by `protmap -h`:

```text
usage: UniProtMapper [-h] -i [IDS ...] [-r [RETURN_FIELDS ...]] [--default-fields] [-o OUTPUT]
Expand Down Expand Up @@ -164,14 +172,14 @@ optional arguments:
references, see: <pkg_path>/resources/uniprot_mapping_dbs.json
-over, --overwrite If desired to overwrite an existing file when using -o/--output
-pf, --print-fields Prints the available return fields and exits the program.
```
```

Usage example, retrieving default fields from `<pkg_path>/resources/cli_return_fields.txt`:
<p align="center">
<img src="https://github.com/David-Araripe/UniProtMapper/blob/master/figures/cli_example_fig.png?raw=true" alt="Image displaying the output of UniProtMapper's CLI, protmap"/>
</p>

# 👏🏼 Credits:
## 👏🏼 Credits

- [UniProt](https://www.uniprot.org/) for providing the API and the amazing database;
- [Andrew White and the University of Rochester](https://github.com/whitead/protein-emoji) for the protein emoji;
Expand Down
2 changes: 2 additions & 0 deletions docs/source/api/UniProtMapper.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ Main Module
:undoc-members:
:show-inheritance:

.. _field_querying:

Field Querying
--------------

Expand Down
67 changes: 41 additions & 26 deletions docs/source/field_reference.rst
Original file line number Diff line number Diff line change
@@ -1,51 +1,66 @@
Return Fields Reference
=======================

UniProtMapper supports a wide range of return fields from UniProt. These fields are organized by their types and can be used to specify which data you want to retrieve.
UniProtMapper supports a wide range of return fields from UniProt. These fields are organized by their types and can be used to specify data you want to get for the proteins retrieved by your query.

Field Categories
----------------

The return fields are organized into several categories:
UniProt return fields are organized into the categories:

1. Names & Taxonomy
- Basic identification and taxonomic information
- Gene names, organism details, etc.
1. **Names & Taxonomy**: Basic identification and taxonomic information. *E.g.:*: Gene names, organism details, etc.

2. Sequences
- Sequence-related information
- Length, mass, variants, etc.
2. **Sequences**: Sequence-related information. *E.g.:*: Length, mass, variants, etc.

3. Function
- Functional annotations
- Activity, pathways, binding sites, etc.
3. **Function**: Functional annotations. *E.g.:*: Activity, pathways, binding sites, etc.

4. Structure
- Structural information
- 3D structure, secondary structure elements, etc.
4. **Structure**: Structural information. *E.g.:*: 3D structure, secondary structure elements, etc.

5. Cross-references
- Links to external databases
- Organized by database type (e.g., genomic, proteomic, etc.)
5. **Cross-references**: Links to external databases, subdivided into different categories according to the database being cross-referenced. *E.g.:* `Chemistry` for datasets like `DrugBank`, `Genome annotation` for `Ensembl`, etc.

.. _supported_fields:
Supported fields
----------------

The supported return fields are listed below. The columns contain different information about the fields:

- **label**: The label used by UniProt to represent this field. Also used as column names on the `pd.DataFrame` returned from `get` methods implemented on both APIs.
- **returned_field**: Name used to specify which information to retrieve by the APIs. For examples, check below.
- **field_type**: The category of the field, as listed above under `Field Categories`. Note that for `type=='cross_reference'`, the field_type is the category of the cross-referenced database.
- **has_full_version**: Always `yes` for `type=='uniprot_field'`. Is used by UniProt to indicate whether a cross-referenced database is fully integrated.
- **type**: Either "uniprot_field" or "cross_reference". The former indicates a field that is directly related to the protein, while the latter indicates a field that is a cross-reference to another database and not native to UniProt.

For more up-to-date information on `has_full_version` of cross-referenced fields, check the official UniProt documentation: `Return Fields <https://www.uniprot.org/help/return_fields_databases>`_

.. csv-table:: Supported Return Fields
:header-rows: 1
:file: _static/uniprot_return_fields.csv

Usage Example
-------------
Specify Return Fields with ID Mapping API
-----------------------------------------

To specify which fields to retrieve::
Specify which fields to retrieve on a ID mapping request::

from UniProtMapper import ProtMapper

mapper = ProtMapper()

# Get specific fields
fields = ["accession", "gene_names", "organism_name"]
result, failed = mapper.get(
["P30542"],
fields=["accession", "gene_names", "organism_name"]
)
fields=fields,
)

Specify Return Fields with UniProtKB API
----------------------------------------

Specify which fields to retrieve on a field-based query::

from UniProtMapper import ProtKB
from UniProtMapper.uniprotkb_fields import accession

protkb = ProtKB()

query = accession("P30542")
fields = ["accession", "gene_names", "organism_name"]
result, failed = protkb.get(
query,
fields=fields,
)
65 changes: 52 additions & 13 deletions docs/source/tutorials/field_querying.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ This tutorial demonstrates how to use UniProtMapper's field-based querying funct
Basic Field Queries
-------------------

Here's a simple example using boolean fields::
A simple example on querying UniProtKB through field search::

from UniProtMapper import ProtKB
from UniProtMapper.uniprotkb_fields import reviewed, organism_name
Expand All @@ -17,45 +17,84 @@ Here's a simple example using boolean fields::
query = reviewed(True) & organism_name("human")
result, failed = protkb.get(query)

.. note::

Running this code will take some time as it retrieves all reviewed human proteins! Each iteration of the displayed progress bar represents 500 entries fetched from UniProtKB.

Complex Queries
---------------

You can combine multiple fields with boolean operators::
You can combine multiple fields with boolean operators, illustrated by the following examples:

Example 1::

from UniProtMapper import ProtKB
from UniProtMapper.uniprotkb_fields import (
organism_name,
length,
mass,
date_modified,
gene_exact,
xref_count,
)

protkb = ProtKB()

# Find human proteins:
# - modified since 2024
# - NOT modified after 2023 (in UniProtKB)
# - between 200-300 amino acids
# - mass < 50kDa
# - 5 or more deposited PDB structures
query = (
organism_name("human") &
date_modified("2024-01-01", "*") &
length(200, 300) &
mass("*", 50000) &
xref_count("pdb", 5, "*")
(~ date_modified("2023-01-01", "*"))
)
result = protkb.get(query)

Example 2::

from UniProtMapper import ProtKB
from UniProtMapper.uniprotkb_fields import (
xref_count,
organism_id,
reviewed,
fragment,
length,
)

protkb = ProtKB()

# Find human proteins:
# - with 2 or more deposited pdb strctures
# - not fragments fragments
# - reviewed
# - length < 750 amino acids
query = (
xref_count("pdb", 2, "*")
& organism_id(9606)
& reviewed(True)
& fragment(False)
& length("*", 750)
)
result = protkb.get(query)

.. note::

The ``fields`` parameter is also supported by the ``ProtKB`` API. For a full list of the supported fields, check the :ref:`supported_fields` section of the docs.

Field Types
-----------

UniProtMapper supports several types of fields:
UniProtMapper supports several types of fields. For full documentation on the fields implemented in the package, check :ref:`field_querying`.

See below for examples of different field types implemented in UniProtMapper.

Boolean Fields
~~~~~~~~~~~~~~
::

from UniProtMapper.uniprotkb_fields import reviewed, fragment, is_isoform

# Get reviewed entries that are not fragments
# Example: Get reviewed entries that are not fragments
query = reviewed(True) & ~fragment(True)

Range Fields
Expand All @@ -64,7 +103,7 @@ Range Fields

from UniProtMapper.uniprotkb_fields import length, mass

# Proteins between 200-300 amino acids
# Example: Proteins between 200-300 amino acids
query = length(200, 300)

Date Range Fields
Expand All @@ -73,7 +112,7 @@ Date Range Fields

from UniProtMapper.uniprotkb_fields import date_created, date_modified

# Entries created in 2023
# Example: Entries created in 2023
query = date_created("2023-01-01", "2023-12-31")

Text-Based Fields
Expand All @@ -82,5 +121,5 @@ Text-Based Fields

from UniProtMapper.uniprotkb_fields import gene_exact, keyword, family

# Proteins in kinase family with ATP-binding
# Example: Proteins in kinase family with ATP-binding
query = family("Kinase*") & keyword("ATP-binding")
23 changes: 21 additions & 2 deletions docs/source/tutorials/mapping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,31 @@ Here's a simple example of mapping UniProt accession IDs to Ensembl IDs::
to_db="Ensembl"
)

The result is a pandas DataFrame containing the mapped IDs, and failed is a list of IDs that couldn't be mapped.
The ``result`` is a `pandas.DataFrame` containing the query and mapped IDs (column names `From` and `To`, respectively), while ``failed`` is a list of IDs that couldn't be mapped.

Mapping Through Cross-Referenced Fields
---------------------------------------

Ensembl is also cross-referenced in UniProt entries. In case you're interested in checking all cross-referenced Ensembl IDs for a given UniProt entry, you can do so by::

from UniProtMapper import ProtMapper

mapper = ProtMapper()

fields = ["xref_ensembl"]
result, failed = mapper.get(
ids=["P30542", "Q16678", "Q02880"],
fields=fields,
)

.. note::

For a full list of the supported fields, check the :ref:`supported_fields` section of the docs. Here, result is again a `pandas.DataFrame` containing the query and mapped IDs (column names `From` and `Ensembl`, following the `label` column in the reference table).

Available Databases
-------------------

UniProtMapper supports mapping between numerous databases. You can view the complete list of supported databases in the mapping_dbs.json file or check UniProt's documentation.
UniProtMapper supports mapping between numerous databases. You can view the complete list of supported databases in ``ProtMapper()._supported_dbs`` or check UniProt's documentation.

Handling Failed Mappings
------------------------
Expand Down
Loading