My code wraps up code from GrammarViz. To use it, you'll first need to run:
git clone https://github.com/GrammarViz2/grammarviz2_src.git
cd grammarviz2_src
mvn package -Psingle
...to build the jar. I wrote a wrapper that implements the scikit-learn
BaseEstimator interface so I could use scikit's Pipeline and GridSearchCV
for cross-validation parameter grid search. To allow the wrapper to find the
GrammarViz code, you have two options. You can either add an environment
variable, like so:
export GRAMMARVIZ_HOME="/path/to/grammarviz2_src/target/"
...or you can pass that path using the --gviz-home flag.
The other dependencies are listed in the requirements.txt file. Note that
I've bundled one of the requirements (pysax) since the author recommended
this. All custom code is in pure Python, so there is no need to build anything.
The custom Python code is contained in 4 modules:
grammar_parser.py -- Parse text output from GrammarViz CLI into `Grammar` object
motif_tagger.py -- SAX-discretize time series and use `Grammar` object to tag with motifs
motif_finder.py -- Provide a convenient method to run the GrammarViz code and parse it
directly into Python using temporary files for IPC. The code spawns
a subprocess shell to run the JVM in, then parses the results back
into the Python `Grammar` object. This code contains the interface
for the `sklearn.BaseEstimator`, called `MotifFinder`.
grid_search.py -- Using the `MotifFinder` for feature selection and a `sklearn.RandomForest`
for classification, conduct a grid search over the SAX
parameters. The data is output to a csv file and the
accuracy of the best estimator is output to stdout.
Both motif_finder.py and grid_search.py share a common base CLI.
You can see this by running:
python motif_finder.py -h
I've included the output here for reference:
usage: Get stats from time series dataset files [-h] [-tr TRAIN] [-te TEST]
[-v VERBOSITY]
[-w WINDOW_SIZE] [-p PAA_SIZE]
[-a ALPHABET_SIZE]
[-g GVIZ_HOME]
Motif finding using sequitur
optional arguments:
-h, --help show this help message and exit
-tr TRAIN, --train TRAIN
path to training data file
-te TEST, --test TEST
path to test data file
-v VERBOSITY, --verbosity VERBOSITY
verbosity level for logging; default=1 (INFO)
-w WINDOW_SIZE, --window-size WINDOW_SIZE
window size
-p PAA_SIZE, --paa-size PAA_SIZE
number of letters in a word; SAX word size
-a ALPHABET_SIZE, --alphabet-size ALPHABET_SIZE
size of alphabet; number of bins for SAX
-g GVIZ_HOME, --gviz-home GVIZ_HOME
path to directory with gviz jar
The grid_search.py module also contains an option for where to output results to:
-o OUTPUT_PATH, --output-path OUTPUT_PATH
path to write CV scores to
Finally, I've included two bash scripts and sample data that I used to run some experiments using this code. These scripts assume you have the data in the following structure (same as in zip archive):
├── dataset1
│ ├── test.txt
│ └── train.txt
├── dataset2
│ ├── test.txt
│ └── train.txt
├── dataset3
│ ├── test.txt
│ └── train.txt
├── dataset4
│ ├── test.txt
│ └── train.txt
└── dataset5
├── test.txt
└── train.txt
To run the grid search on all datasets, use grid_search.sh. To compute the
accuracies using the best parameters I found in my experiments, run
compute_accuracies.sh.