-
Notifications
You must be signed in to change notification settings - Fork 16
Provided Software
This page describes the usage of the two scripts generate_data.py and evaluate.py provided in this repository. It does not describe any auxiliary code or the details of the implementation. If you encounter any problems, find any bugs, or need help please contact us via mail at mlgwsc@aei.mpg.de or via Slack. For more details see our support page.
To run the code in this repository a working installation of Python 3.7 or higher as well as an adequately new version of PyCBC are required. If you need to install Python 3.7 make sure to also install the appropriate python development libraries. For Ubuntu the commands would be
sudo apt-get install python3.7
sudo apt-get install python3.7-dev
We recommend to use a virtual environment for this mock data challenge. To create one you can use virtualenv
virtualenv -p python3.7 <env-name>
source <env-name>
To install an appropriate version of PyCBC simply install the requirements after cloning the repository by
pip install --upgrade pip setuptools
pip install -r requirements.txt
To download the code you can simply clone this repository to a suitable location on your machine by executing the command below in the desired directory.
git clone https://github.com/gwastro/ml-mock-data-challenge-1.git
This script contains the code to generate mock data for testing. To use it multiple options can be specified. An example call specifying the most common options would be
./generate_data.py \
--dataset 1 \
--output-injection-file injections.hdf \
--output-foreground-file foreground.hdf \
--output-background-file background.hdf \
--seed 42 \
--start-offset 0 \
--duration 32000 \
--verbose
- The
--datasetoption specifies how the noise is generated and which injections are made. For details please refer to this page. - All options prefixed by
outputspecify where files generated by the code will be stored. The--output-injection-filecontains the parameters of the injected signals.--output-background-filecontains the pure detector noise.--output-foreground-filecontains the same noise with signals injected into it. For details on the structure of the foreground and background files please refer to this page. - The
--seedis used to make the noise and signal generation reproducible. Two calls to this function with the same--dataset,--seed,--start-offset, and--durationwill yield identical results. If the seed is not specified it will default to 0 and not be random! To use a random seed on each invocation of the program use a negative number as seed. - The
--start-offsetspecifies at which time to start generating noise. It must be greater or equal zero. All noise starts to be generated at the same reference time. You may want to alter this value if you want to produce a large amount of noise on multiple machines in parallel. The--start-offsetfor the second call to the function would in that case be the value given to the--durationof the first call. In other words, this option tells the code how much data to skip in the beginning and where to start generating. - The
--durationspecifies how much data is generated (in seconds). Note that in total only7111579seconds of data are available and for technical reasons no more than7024699should be requested. We recommend to stay way below these limits. - The option
--verboseprints status updates to the screen.
Additionally, you may want to only generate injections once and use them for multiple data sets. In this case you can omit the option --output-injection-file and instead set the option --injection-file. Pass the path to the injection file you want to use. The --injection-file is expected to be of the format output by --output-injection-file.
For further options and a description of them please refer to
./generate_data.py -h
Note that the code will download a file called segments.csv. This file contains information on which GPS times to use for data generation. Irrespective of the data set specified by --dataset data will be generated in these segments. ATTENTION! If you specify --dataset 4 the code will start to download a large (~94 GB) file containing real noise downsampled to 2 kHz. You can interrupt this download at any time and the function will pick up where it left off. However, the code is not able to generate any data for data set 4 before this file is downloaded completely. You can also download the file directly via
python -c "from generate_data import download_data; download_data()"
or from the URL https://www.atlas.aei.uni-hannover.de/work/marlin.schaefer/MDC/real_noise_file.hdf.
For more control over the data generating process the functions from the script can be called directly. We consider this advanced usage and do not document it beyond the comments in the code.
This script contains the functionality to get the false-alarm rate (FAR) as well as the sensitivity of the search algorithm. As input it requires the file containing the injections, the file containing the foreground input data, as well as the event files returned by the search algorithm applied to the foreground and background data. It returns a file of the HDF5 format containing many different datasets. The most important of these are labeled far and sensitive-distance. They are of the same length and values of the sensitive-distance correspond to the far value at the same index. To plot them, they have to be sorted by the far values. An example-call to the script would be
./evaluate.py \
--injection-file injections.hdf \
--foreground-events <path to output of algorithm on foreground data> \
--foreground-files foreground.hdf \
--background-events <path to output of algorithm on background data> \
--output-file eval-output.hdf \
--verbose
The options mean the following
- The option
--injection-filespecifies the injections that were used to create the foreground data. It corresponds to the output ofgenerate_data.py --output-injection-fileor the path given togenerate_data.py --injection-file. - The option
--foreground-eventsspecifies the output of the search algorithm that was obtained using the foreground file returned bygenerate_data.py --output-foreground-file. For details on the structure of these files please refer to this page. Multiple paths may be provided if the input data was split into multiple parts. - The option
--foreground-filesspecifies the foreground data that was used as input to the algorithm. This file is only used to determine which injections were actually contained in the foreground data and how much data was analyzed. It has to be the file created bygenerate_data.py --output-foreground-file. Multiple paths may be provided if the input data was split into multiple parts. - The option
--background-eventsspecifies the output of the search algorithm that was obtained using the background file returned bygenerate_data.py --output-background-file. For details on the structure of these files please refer to this page. Multiple paths may be provided if the input data was split into multiple parts. - The option
--output-filespecifies where the analysis output should be stored. - The option
--verbosetells the script to print status updates. - An option
--forceexists to allow the code to overwrite existing files.