Skip to content

NCI-GDC/esbuild

Repository files navigation

ESBuild

Note

The code in this repository has been made public as-is for informational purposes. The repository may use private resources for the building and execution of the code. For example, private registries may be used for dependency resolution.

The documentation may refer to restricted URLs.

Repository for building the GDC Elasticsearch indices.

pre-commit

Table of Contents generated with DocToc

Running

compare_indices.py

This is a script used to compare 2 indices. It can be used to see the changes between new indices and old indices. An output file with name counts_<index_name>_vs_<another_index_name>.json or compared_<index_name>_vs_<another_index_name>.json will be generated.

flags

--test-type

Choose between compare-counts and full-compare

  • compare-counts will show the count difference for indices.
  • full-compare will show detailed doc difference for indices.

--true-index and --test-index

The base names of the two elasticsearch indices being tested. All gdc_from_graph (annotation, case, file, project) subtypes will be tested.

Usage

compare_indices.py --true-index dr33_active_merged --test_index dr33_active_merged_v2 --test-type compare_counts

=========

Architecture

Build and Upload Process

The JSON index is built in the following steps:

  1. Cache the graph from the database in memory
  2. Filter out nodes from the graph that should be excluded
  3. Cache commonly used information, e.g. case to file relationships
  4. Visit each case, producing a case doc and all annotation and file docs for that case
  5. Merge file docs together (to allow for multiple cases in a file doc)
  6. Produce project summary docs
  7. Validate produced docs

The index is uploaded to elasticsearch in the following steps:

  1. Create a new index in naming scheme
  2. Upload each doc type to index
  3. On success, update the alias to point to new index
  4. Cleanup/close old indexes

Builders and Mappers

Esbuild consists of a builder and a mapper for the four indices it produces: project, annotation, file, and case.

  • The mapper produces the Elasticsearch mapping and contains basic traversals
  • The builder produces JSON documents using the mapper

Most of the business logic for building indices is contained in the graph.common.builder.GraphIndexBuilder class, which is inherited by graph.common.builder.ActiveGraphIndexBuilder to implement the actual build of the indices.

Builders exclude nodes that shouldn't be in the index (and therefore public) based during the filtering step in the is_node_indexed() function.

Each builder has a set of class variables that alter build behavior, including but not limited to mapper, file_mapping, case_to_file_paths, unindexed_by_property, hidden_properties. Those that are required to be overloaded are listed in required_attrs.

Each case document contains all of the files derived from it. Each file document in that case document contain a pruned version of the parent case document that contains only the direct ancestors of the file, i.e. only those found along all paths from the file to the case. This re-nesting is done to allow additional filtering in Elasticsearch.

Because file documents are produced by visiting a single case, the traversal from that case to each file will only contain one case. This means that file documents must be merged together in order to contain all the cases they are derived from. This is done as follows:

  1. for each file doc from denormalize_case() -> (case_docs, file_docs, annotation_docs)
  2. if there is an existing doc for this file and the newly produced document contains a subtree for a case that is not in the old doc, then append the case subtree from the new doc to that of the old doc.

Mappers

The mappers (see the graph.common.mappings module) produce the mapping (or schema) for Elasticsearch. The common mappers is futher extended in the active mapper which is used to produce the main mappings for the indices which will be built in the builder.

The mappers have three main functions to produce mappings: get_{project,case,annotation,file}_es_mapping.

The properties of each Entity (Node class) are dynamically added to the mapping based on the GDC Dictionary.

The traversal tree in the active mappings is dynamic and based on the graph structure.

Trouble shooting

The active module contains a ActiveMimic builder which can be used to test functionality against nodes, e.g. testing builder.is_node_indexed(node) to troubleshoot nodes that are not showing up in the index.

>>> from esbuild.graph.active.mimic import ActiveMimic
>>> from gdcdatamodel.models import Case
>>> mimic = ActiveMimic(None)
>>> mimic.is_node_indexed(Case())
[...][graph_index][   INFO] not indexed (unsubmitted state: <Case(None)>): None
False

Installation

Before continuing you must have the following programs installed:

Pip

Project dependencies are managed using PIP. You can install dependencies via

pip install -r requirements.txt

And optionally any dev requirements via

pip install '.[dev]'

Project Dependencies

If you are building to a local Elasticsearch installation, you will need to install it manually. On OSX you can install Elasticsearch via brew install elasticsearch.

Development

Tests

Use tox to run tests:

pip install tox
tox

The test suite data is visualized in a PDF using graphviz if you have installed whenever the tests are run.

Parallel testing

We are now able to run the tests in parallel, with indexd_test_utils2, pytest-postgresql, pytest-elasticsearch and pytest-xdist. To start test in parallel, run with the following command:

USE_RUNNING_ES=false USE_RUNNING_PG=false pytest tests -n auto

You have to export postgres path so pg_config is available for pytest-postgresql.

If your elasticsearch is not installed in default location or your are using opensearch, set the following env:

ES_EXECUTABLE=/opt/homebrew/opt/opensearch/bin/opensearch

If you are using mac, you also need to set:

OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

Setup pre-commit hook to check for secrets

We use pre-commit to setup pre-commit hooks for this repo. We use detect-secrets to search for secrets being committed into the repo.

To install the pre-commit hook, run

pre-commit install

To update the .secrets.baseline file run

detect-secrets scan --update .secrets.baseline

.secrets.baseline contains all the string that were caught by detect-secrets but are not stored in plain text. Audit the baseline to view the secrets .

detect-secrets audit .secrets.baseline

Production

This library was written with the intention of deploying via SaltStack and tungsten.

Contributing

Read how to contribute here.

About

Library for building the GDC Elasticsearch index

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages