Note
The code in this repository has been made public as-is for informational purposes. The repository may use private resources for the building and execution of the code. For example, private registries may be used for dependency resolution.
The documentation may refer to restricted URLs.
Repository for building the GDC Elasticsearch indices.
Table of Contents generated with DocToc
This is a script used to compare 2 indices. It can be used to see the changes between
new indices and old indices. An output file with name counts_<index_name>_vs_<another_index_name>.json
or compared_<index_name>_vs_<another_index_name>.json will be generated.
Choose between compare-counts and full-compare
compare-countswill show the count difference for indices.full-comparewill show detailed doc difference for indices.
The base names of the two elasticsearch indices being tested. All gdc_from_graph (annotation, case, file, project) subtypes will be tested.
compare_indices.py --true-index dr33_active_merged --test_index dr33_active_merged_v2 --test-type compare_counts=========
The JSON index is built in the following steps:
- Cache the graph from the database in memory
- Filter out nodes from the graph that should be excluded
- Cache commonly used information, e.g. case to file relationships
- Visit each case, producing a case doc and all annotation and file docs for that case
- Merge file docs together (to allow for multiple cases in a file doc)
- Produce project summary docs
- Validate produced docs
The index is uploaded to elasticsearch in the following steps:
- Create a new index in naming scheme
- Upload each doc type to index
- On success, update the alias to point to new index
- Cleanup/close old indexes
Esbuild consists of a builder and a mapper for the four indices it produces: project,
annotation, file, and case.
- The
mapperproduces the Elasticsearch mapping and contains basic traversals - The
builderproduces JSON documents using themapper
Most of the business logic for building indices is contained in the
graph.common.builder.GraphIndexBuilder class, which is inherited by
graph.common.builder.ActiveGraphIndexBuilder to implement the actual
build of the indices.
Builders exclude nodes that shouldn't be in the index (and therefore
public) based during the filtering step in the is_node_indexed()
function.
Each builder has a set of class variables that alter build behavior,
including but not limited to mapper, file_mapping,
case_to_file_paths, unindexed_by_property, hidden_properties.
Those that are required to be overloaded are listed in
required_attrs.
Each case document contains all of the files derived from it. Each
file document in that case document contain a pruned version of
the parent case document that contains only the direct ancestors of
the file, i.e. only those found along all paths from the file to the
case. This re-nesting is done to allow additional filtering in
Elasticsearch.
Because file documents are produced by visiting a single case, the traversal from that case to each file will only contain one case. This means that file documents must be merged together in order to contain all the cases they are derived from. This is done as follows:
- for each file doc from
denormalize_case() -> (case_docs, file_docs, annotation_docs) - if there is an existing doc for this file and the newly produced document contains a subtree for a case that is not in the old doc, then append the case subtree from the new doc to that of the old doc.
The mappers (see the graph.common.mappings module) produce the
mapping (or schema) for Elasticsearch. The common mappers is futher
extended in the active mapper which is used to produce the main mappings
for the indices which will be built in the builder.
The mappers have three main functions to produce mappings:
get_{project,case,annotation,file}_es_mapping.
The properties of each Entity (Node class) are dynamically added to the mapping based on the GDC Dictionary.
The traversal tree in the active mappings is dynamic and based on the graph structure.
The active module contains a ActiveMimic builder which can be used
to test functionality against nodes, e.g. testing builder.is_node_indexed(node)
to troubleshoot nodes that are not showing up in the index.
>>> from esbuild.graph.active.mimic import ActiveMimic
>>> from gdcdatamodel.models import Case
>>> mimic = ActiveMimic(None)
>>> mimic.is_node_indexed(Case())
[...][graph_index][ INFO] not indexed (unsubmitted state: <Case(None)>): None
FalseBefore continuing you must have the following programs installed:
Project dependencies are managed using PIP. You can install dependencies via
pip install -r requirements.txtAnd optionally any dev requirements via
pip install '.[dev]'If you are building to a local Elasticsearch installation, you will
need to install it manually. On OSX you can install Elasticsearch via
brew install elasticsearch.
Use tox to run tests:
pip install tox
tox
The test suite data is visualized in a PDF using graphviz if you have installed whenever the tests are run.
We are now able to run the tests in parallel, with indexd_test_utils2,
pytest-postgresql, pytest-elasticsearch and pytest-xdist.
To start test in parallel, run with the following command:
USE_RUNNING_ES=false USE_RUNNING_PG=false pytest tests -n autoYou have to export postgres path so pg_config is available for pytest-postgresql.
If your elasticsearch is not installed in default location or your are using opensearch, set the following env:
ES_EXECUTABLE=/opt/homebrew/opt/opensearch/bin/opensearchIf you are using mac, you also need to set:
OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YESWe use pre-commit to setup pre-commit hooks for this repo. We use detect-secrets to search for secrets being committed into the repo.
To install the pre-commit hook, run
pre-commit installTo update the .secrets.baseline file run
detect-secrets scan --update .secrets.baseline.secrets.baseline contains all the string that were caught by detect-secrets but are not stored in plain text. Audit the baseline to view the secrets .
detect-secrets audit .secrets.baselineThis library was written with the intention of deploying via SaltStack and tungsten.
Read how to contribute here.