A zero-dependencies series of bash scripts to interact with Datashare's index and queue.
Setup | Scripts | Test | Cookbook
To run those scripts only needs to have access to the ElasticSearch URL which must be stored in an
environement variable called ELASTICSEARCH_URL. Same logic applies to REDIS_URL. To avoid setting up
this variable everytime you use those script, you can store in a .env at the root of this directory:
ELASTICSEARCH_URL=http://localhost:9200
REDIS_URL=redis://redisHere are the main scripts available in this repository:
.
├── elasticsearch
│ │
│ ├── document
│ │ ├── agg
│ │ │ ├── avg.sh # Average of a field's values
│ │ │ ├── count.sh # Count of non-null field values
│ │ │ ├── max.sh # Maximum value of a field
│ │ │ ├── min.sh # Minimum value of a field
│ │ │ └── sum.sh # Sum of a field's values
│ │ ├── count.sh # Count documents under a given path
│ │ ├── delete.sh # Delete documents under a given path
│ │ ├── move.sh # Move documents from a directory to another
│ │ └── reindex.sh # Reindex documents from a given index and under a specific directory
│ │
│ ├── duplicate
│ │ ├── count.sh # Count duplicates
│ │ └── reindex.sh # Reindex duplicates from a given index
│ │
│ ├── index
│ │ ├── clone.sh # Clone a given index into another
│ │ ├── create.sh # Create an index using default Datashare settings
│ │ ├── delete.sh # Delete an index
│ │ ├── list.sh # Get all indices
│ │ ├── number_of_replicas.sh # Get or change number of replicas for a given index
│ │ ├── refresh.sh # Refresh a given index
│ │ ├── refresh_interval.sh # Get or change refresh interval for a given index
│ │ ├── reindex.sh # Reindex everything from a given index
│ │ ├── replace.sh # Replace an index by another one
│ │ └── safe_reindex.sh # Safely reindex an index with backup and verification
│ │
│ ├── named_entity
│ │ ├── count.sh # Count named entities
│ │ └── reindex.sh # Reindex named entities from a given index
│ │
│ └── task
│ ├── cancel.sh # Cancel a given task
│ ├── get.sh # Get a given task status
│ ├── list.sh # Get all tasks
│ └── watch.sh # Watch a given task status
│
├── redis
│ │
│ ├── queue
│ │ └── rpush.sh # Insert stdin rows to a given queue
│ │
│ └── report
│ ├── hdel.sh # Remove stdin rows from a given report map
│ └── hset.sh # Insert stdin rows to a given report map
│
└── lib
├── cli.sh # Main CLI library (sources all other libs)
├── colors.sh # ANSI color definitions
├── format.sh # Text formatting (truncate, duration, status, draw_line)
├── logging.sh # Logging functions (spinners, titles, log levels)
├── progress.sh # Progress bar functions
├── prompt.sh # User prompt functions
├── table.sh # Table formatting (header, row)
└── sync.sh # Sync this directory with another location with rsyncDeveloppers can run tests using bats:
export ELASTICSEARCH_URL=http://localhost:9200 # Change this with the URL of ElasticSearch
make testsThis cookbook list real-life examples of how to use those scripts.
An example showing how to copy documents from the kimchi index to the miso while taking care of updating the path.
1. Create a clone of the "miso" index to avoid messing up with data:
./elasticsearch/index/clone.sh miso miso-tmp2. Reindex documents from kimchi under the folder /disk/kimchi/tofu onto miso-tmp:
./elasticsearch/index/reindex.sh kimchi miso-tmp /disk/kimchi/tofu3. While the reindex is being done, watch progress using the task id from the last command:
./elasticsearch/task/watch.sh 8UnTR-67T8y0idkyndf77Q:360412594. The document moved to miso-tmp use the wrong path so we update it as well:
./elasticsearch/document/move.sh miso-tmp /disk/kimchi/tofu /disk/miso/tofu5. Finally, after checking everything is fine, we substitue the miso index by miso-tmp:
./elasticsearch/index/replace.sh miso-tmp misoThis opperation might be useful if mapping or settings of the index changed.
1. Create a ricecake-tmp empty index:
./elasticsearch/index/create.sh ricecake-tmp1'. Alternatively, you can create a ricecake-tmp empty index with the mappings/settings of the desired version:
./elasticsearch/index/create.sh ricecake-tmp 17.1.12. Reindex all documents (under "/" path) from ricecake under to ricecake-tmp:
./elasticsearch/documents/reindex.sh ricecake ricecake-tmp /3. Replace the old ricecake by the new one:
./elasticsearch/index/replace.sh ricecake-tmp ricecakeThis will get files from find and store them in the extract:queue list:
find /home/foo/bar -type f | ./redis/queue/rpush.sh extract:queueOr to filtered that list with a filtered.txt file:
find ~+ -type f | grep -vFf filtered.txt | ./redis/queue/rpush.sh extract:queueThis can also be done with a single file:
echo "/file/to/index.pdf" | ./redis/queue/rpush.sh extract:reportReport map are used to store error and skip already indexed files.
find /home/foo/bar -type f | ./redis/report/hset.sh extract:reportThis can be usefull to force a reindex on certain files:
cat to-reindex.txt | ./redis/report/hdel.sh extract:reportThis can also be done with a single file:
echo "/file/to/reindex.pdf" | ./redis/report/hdel.sh extract:report