|
🔥
|
Neither CollectionSpace (the organization) nor LYRASIS offers support for This application is created and maintained for internal LYRASIS staff use only. Many of its design decisions are based on:
This means:
However, we have made this code available in the spirit of open-source and transparency, in case any of it might be informative for CS institutions/users who wish to build their own tooling for working with CS data at scale. |
See doc/decisions.adoc for more info/background on some of the decisions made
-
The Ruby version indicated in the .ruby-version file
-
Docker and docker-compose.
-
Running the required docker-compose command will by default set up two instances of Redis on ports 6380 and 6381. The port numbers used are configurable as described below
-
-
Set up the Github CLI as detailed in the tech setup instructions in our team technical documentation, *and be authenticated via the CLI* as instructed there.
-
Set up AWS credentials and CLI and pass the setup check as detailed in the tech setup instructions in our team technical documentation.
-
Set up bastion access as detailed in the tech setup instructions in our team technical documentation
-
To run batch ingests from CSV data:
-
There must be an Fast Import Bucket set up for the site you are ingesting into. The command to list existing Fast Import Buckets can be found in our tech docs.
-
To avoid having to prepend thor, rspec and other commands with bundle exec, add this repo’s ./bin to your PATH.
If you do not do this step, and running a command you see in the documentation fails, try prepending `bundle exec ` to the command.
This should "just work" without you having to do anything, but you might want to change it if the ports being used for Redis conflict with something you use for other work.
If you want to change the Redis ports, you need to update them in two places:
-
the
./docker-compose.ymlfile (which builds the Redis instances and makes them locally accessible via the given ports) -
the
./redis.ymlfile (which tells the application which port/Redis instance to use for storing RefNames vs. CSIDs)
Nothing in ./redis.yml is sensitive, as it’s all just on your local machine.
Clone the repository from https://github.com/collectionspace/cspace-config-untangler.
You will need the path to the local copy of this repository for setting up your system config in the next step.
Use the commented sample_system_config.yml as an example or starting point.
You have three options for where to put your system config file. These will be checked in the following order. The first one found will be used:
-
Custom filename and location indicated in
COLLECTIONSPACE_MIGRATION_TOOLS_SYSTEM_CONFIGenvironment variable. You can set this environment variable per-session or permanently in your shell/terminal config. Recommended only for development and testing of this code. -
~/.config/collectionspace_migration_tools/system_config.ymlThis is the recommended location if you actively use this code to perform migrations work. Kristina is working to convert all tooling to store personal config in~/.config/tool_namebecause it’s easier to back up and migrate your config that way. -
system_config.ymlfile in the base directory of this repository. Not recommended for any purposes except initial testing during first setup
The settings are explained in comments in sample_system_config.yml
The following section(s) provide additional info that can be better formatted here.
Where possible, CMT leverages parallel processing to speed up our work.
The parallel gem is used to handle parallel processing.
Threads are generally used where completion of a job is slowed down by something that does not require your computer’s processing power, such as uploading files or getting a response from an API.
Processes are generally used where the completion of a job is slowed down by processing work your computer is doing, such as converting a CSV row to XML.
|
💡
|
You can find what uses threads vs. processes by searching this codebase for CMT.config.system.max_threads and CMT.config.system.max_processes.
|
The use/purpose of reading CSVs in chunks is explained in Faster Parsing CSV With Parallel Processing. Each chunk is sent to a parallel worker for processing. A chunk with more rows will take longer to process, but will require fewer threads/processes to complete the entire job. I have not investigated the tradeoff between queueing up/passing on more chunks vs. larger chunks.
|
ℹ️
|
The default settings seem to be working ok for not-gigantic migration projects on my DTS-issued Macbook Pro, but I have not yet done much testing to figure out optimal settings for these. I assume if things are running super slowly, try upping max_threads/max_processes. If your system is too strained, lower max_threads/max_processes. I confess I’m not entirely sure if it thread vs process makes a difference in terms of system resource usage, but it seemed like a good idea to separate them in case this mattered. |
|
💡
|
If you are doing breakpoint-based debugging on any parallelized code, set both max_threads and max_processes to 1.
|
You will create per-instance client/site .yml configuration files in the directory you set as client_config_dir setting in your system config file.
See client config management documentation for more details.
Once you have done the one-time config and set up at least one instance config, you can verify that your AWS access works by doing the following in this repo’s base directory.
First cd into the base directory of this application, e.g. ~/code/collectionspace_migration_tools or whatever you named the folder you cloned this repo into.
The following assumes the instance/client config file you created in your client_config_dir is named myclient.yml.
thor config switch myclientIf you get an error for that command, it most likely indicates some problem with your config file(s) that needs to be addressed.
If the error starts with CollectionspaceMigrationTools::Config::System validation error(s), the problem is in your system config.
If the error starts with CollectionspaceMigrationTools::Config::Client validation error(s), the problem is in your client/instance config.
Once you are able to switch to your config, try:
bin/console
CMT::Build::S3Client.callIf you get Success(#<Aws::S3::Client>), good. If you get a Failure, something is not right.
See the following locations for more information, depending on whether your System or Client config is getting an error:
-
Comments in
sample_system_config.yml -
Comments in
sample_client_config.yml -
If you can’t figure it out, DM Kristina for assistance
Ensure desired config is in place (See One-time setup and Per-instance setup sections above)
cd into repository root
docker-compose up -d (Starts Redis instances. The -d puts docker-compose into the background, so you can use the terminal for other things)
thor list (to see available commands)
Run available commands as necessary.
|
❗
|
Most of the commands for routine workflow usage are under thor batch and thor batches. See workflow overview documentation for details.
|
docker-compose down (Stops and closes Redis containers. The Redis volumes are NOT removed, so your cached data should still be available next time you run docker-compose up -d.)
You can also use the IRB console for direct access to all objects:
bin/console|
💡
|
If you make changes to code while you are in the console, running CMT.reload! will reload the application without you needing to exit and restart console. This doesn’t always work to pick up all changes, but saves a lot of time anyway.
|
To test, run:
rspecAt least initially, a lot of the functionality around database connections, querying, and anything that relies on a database call is not covered in automated tests. This is mainly because I did not have time to figure out how to test that stuff in a meaningful way without exposing data that needs to be kept private.
-
Built by Kristina Spurgin with design/infrastructure input from Mark Cooper
-
Project scaffold built with Rubysmith.