This benchmark for ddisasm measures ddisasm's accuracy of disassembly (i.e., code inference). It defines a GitLab CI pipeline that generates Precision and Recall metrics with respect to instruction recovery.
These setup instructions have been tested on Ubuntu 20.
To run the benchmark locally, a few Python packages are required. Install them with pip:
pip3 install .
The benchmark also requires some other dependencies:
- binutils-arm-linux-gnueabihf - utilities for ARM binaries
- ddisasm: The benchmark expects to find
ddisasminPATH; whichever version is installed will be evaluated. See installation instructions forddisasmat https://grammatech.github.io/ddisasm/GENERAL/1-Installation.html
The dataset is committed as a compressed tar file, so it must be extracted for
local use. It is compressed with zstd,
which can be installed on Ubuntu with apt install zstd.
On Ubuntu 20+, it can be simply extracted with:
tar -xf arm32-dataset.tar.zstOn Ubuntu 18 the necessary command is:
tar --use-compress-program=unzstd -xvfThe compressed file dataset-gt.zip contains ground truth extracted from
marker symbols (using --truth-source elf) and extended to include information
GT information from interworking arm veneers (it was extended using
disasm_benchmark.adjust_gt.py).
You can extract it with the following command:
unzip dataset-gt.zipTo run the entire ARM benchmark locally:
python3 -u -m disasm_benchmark.driver ./dataset/ | tee results.txt
Bin TP FP FN Precision Recall Runtime
libstagefright_foundation.so 23211 73 139 0.99686 0.99405 6.673
libnl.so 17788 135 138 0.99247 0.9923 5.033
libjni_jpegstream.so 93920 508 3421 0.99462 0.96486 35.41
...teewrites output tostdoutandresults.txt, ensuring the results are saved, but also providing live status information.- Providing the
-uargument to the Python interpreter ensure output is flushed immediately, even when writing to a pipe.
The results of a single binary can also be analyzed. This outputs addresses of each instruction for which an error occurred, followed by summary information:
python3 -m disasm_benchmark.driver ./dataset/android/daemon/bzip2
False positive addrs (Default):
0x6308
0x630c
0x6318
0x631c
...
False positive addrs (Thumb):
0x2e94
0x2f62
0x2f64
0x2fa6
0x2fa8
...
False negative addrs (Default):
0x13c0
0x4e84
0x4e88
...
False negative addrs (Thumb):
0x16a2
0x16bc
0x16be
...
True positive: 5380
False positive: 163
False negative: 122
Precision: 0.97059
Recall: 0.97783By default, ground truth is collected from mapping symbols present in the ELF binary.
This can be changed by specifying --truth-source.
elfcollects ground truth from mapping symbols (only possible for ARM binaries).yamlcollects ground truth for a binary[BINARY]from a file[BINARY].truth.yamllocated next to the binary (see creating baselines).pdbcollects ground truth for a binary[BINARY]from a PDB file[BINARY].pdblocated next to the binary (only applicable to PE binaries). PDB files are analyzed with thepdb-markersapplication (see pdb directory).panginedbcollects ground truth for a binary[BINARY]from a sqlite database[BINARY].sqlitelocated next to the binary. The format of the SQL database is the one defined in https://github.com/pangine/disasm-benchmark?tab=readme-ov-file#using-our-disassembly-ground-truthsokcollects ground truth by tracing compiling process.
pip install .[sok]The command-line argument --disasm can be used for choosing a disassembler from ddisasm, darm, or various disassemblers supported by SOK (ddisasm by default).
darm: DARM
pip3 install .[darm]
- disassemblers supported by SOK: SOK
pip3 install .[sok]
Detailed results in json format can be generated with the --json option.
Overall metrics can be generated with the --metrics METRICS option.
The disasm_benchmark.driver can optionally check against a expected set of metrics
and fail if those metrics are not met with --expected-metrics (the
process will not fail if the actual metrics are better than expected). The format
of the metrics file is as follows:
disasm_bench_precision 0.9
disasm_bench_recall 0.8
disasm_bench_tp 145461
disasm_bench_fp 0
disasm_bench_fn 23
disasm_bench_failures 0
The expected metrics file does not need to be complete. One can check against only some of the metrics, e.g.:
disasm_bench_failures 0
will make the driver fail if there are benchmark failures.
We can create ground truth files .truth.yaml on a dataset automatically
using disasm_benchmark.baseline:
python3 -m disasm_benchmark.baseline ./dataset/
This script also accepts a --truth-source option:
elfcreate a yaml using ARM mapping symbols in the ELF file.gtirbcreate a yaml using the current results of Ddisasm.
Below is an example of a ground truth file:
.plt:
- 0x400de0-0x400dec $a
- 0x400dec-0x400df0 $d
- 0x400df0-0x401040 $a
.plt.got:
- 0x401040-0x401046 $a
- 0x401046-0x401048 $d
.text:
- 0x401050-0x4010a7 $a
- 0x4010a7-0x4010b0 $d
- 0x4010b0-0x4010b2 $a
- Address decode ranges are grouped by sections (corresponding to the binary sections).
- Within each section, address ranges are sorted.
- The marker at the end of the range specifies whether the range is:
$a: Code$t: Thumb code$d: Data$i: Ignored
Triggering the benchmark from CI looks something like this:
trigger:
stage: trigger
variables:
ARTIFACT_URL: ${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/jobs/${JOB_ID_DEBIAN_INSTALLER_UBUNTU20}/artifacts
trigger:
project: rewriting/disasm-benchmark
branch: master
strategy: depend
results:
image: $DOCKER_REGISTRY/rewriting/disasm-benchmark
stage: results
needs:
- trigger
script:
- curl --location --output artifacts.zip "${CI_API_V4_URL}/projects/rewriting%2Fdisasm-benchmark/jobs/artifacts/master/download?job=merge-metrics&job_token=$CI_JOB_TOKEN"
- unzip artifacts.zip
artifacts:
reports:
metrics: metrics.txtThe trigger job starts the pipeline in the disasm-benchmark repository
and waits for it to complete, mirroring its success/failure status. After
completion, the results job downloads the metrics artifact from the pipeline
and re-uploads it as a metrics report in the source pipeline.
The trigger job passes the PARENT_PIPELINE_ID environment variable for the
benchmark to download the ddisasm package from the pipeline that triggered it.
The script disasm_benchmark/annotate is provided to annotate a GTIRB file with the results
of evaluating against ground truth.
Comments will be added for the different kinds of false positives, false negatives, and
address ranges that were ignored with respect to ground truth.
These comments can be seen using gtirb-pprinter's --listing=debug mode.
The ARM dataset is based on the paper "An Empirical Study on ARM Disassembly Tools". However, no code is reused. Find those at the links below: