Skip to content

Commit b39ad64

Browse files
author
fuzz-evaluator
authored
switch to action items
1 parent eba4d6c commit b39ad64

File tree

1 file changed

+120
-75
lines changed

1 file changed

+120
-75
lines changed

README.md

Lines changed: 120 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -1,76 +1,121 @@
1-
## Fuzzing Checklist
2-
3-
The following checklist is intended for authors and reviewers of fuzzing papers. It lists a number of recommendations for a fair and reproducibile evaluation that should be followed in general, even though we stress that individual points may not apply in specific scenarios, requiring human consideration on a case-by-case basis. This checklist is not exhaustive but aims to point out common pitfalls with the goal of avoiding these.
4-
5-
More details regarding each category of checkmarks are available in Section 5 of our [paper](http://filled-in-after-paper-is-published)
6-
7-
- Artifact
8-
- Documentation
9-
- [ ] Is the process of building/preparing the artifact documented?
10-
- [ ] Is the interface used for interaction with the fuzzer documented?
11-
- [ ] If the approach extends another fuzzer, are the interfaces (i.e., hooked functions, added compiler passes) documented?
12-
- [ ] Are the individual experiments documented and how to compare their results against the paper?
13-
- [ ] If the new approach is based on another fuzzer, is the version of the baseline documented?
14-
- [ ] Are the versions of the targets and other fuzzers used during evaluation specified?
15-
- Completeness
16-
- [ ] Are the other fuzzers, or instructions on setting them up, provided?
17-
- [ ] Are all experiments required to back the paper's main claims included?
18-
- Reusability
19-
- [ ] Can the fuzzer be executed independently of the underlying system, e.g., through virtualization or container engines?
20-
- [ ] Are there external dependencies (e.g., tarballs downloaded via https), that may be unavailable in the future?
21-
- [ ] Is the commit history of projects the fuzzer is based on available and not squashed?
22-
23-
- Targets used for the evaluation
24-
- [ ] Are the targets suitable to show the strengths of the approach? For example, are parsers targeted if the approach is related to grammar fuzzing?
25-
- [ ] Is it documented how the targets need to be prepared for fuzzing?
26-
- [ ] Are all modifications (e.g., patches, changes to the runtime environment) applied to targets documented?
27-
- [ ] Are targets used that are also tested by related work (to allow comparability)?
28-
- [ ] If applicable for the approach, are benchmarks such as FuzzBench used?
29-
- [ ] If using benchmarks, are benchmarks injecting artificial bugs avoided?
30-
31-
- Competitors used for comparison
32-
- [ ] Is the approach compared against state-of-the-art tools in the respective field?
33-
- [ ] Is the baseline, i.e., the fuzzer the new approach is based on (if any), compared against?
34-
- [ ] If some of the state-of-the-art fuzzers failed on some targets, are the reasons sufficiently documented?
35-
36-
- Setup used for evaluation
37-
- [ ] Is the used hardware (e.g., CPU, RAM) documented?
38-
- [ ] Is a sufficiently long (>=24h) runtime used for comparison?
39-
- [ ] Is the number of repetitions documented and sufficiently large (>= 10)?
40-
- [ ] Had all fuzzers access to the same amount of computation time? This requires particular thought if a tool requires precompuation(s).
41-
- [ ] Is the runtime sufficient such that all fuzzers are flatlining towards the end of the fuzzing runs?
42-
- [ ] Is it documented how many instances of each fuzzer have been run in parallel?
43-
- [ ] Is it documented how many CPUs have been available to each fuzzing process (e.g., via CPU pinning)?
44-
- [ ] Is the setup (e.g., available cores, hardware) for fuzzers compared to in the evaluation suitable according to their requirements?
45-
- [ ] Seeds:
46-
- [ ] Are the used seeds documented?
47-
- [ ] Are the used seeds uninformed?
48-
- [ ] Are the seeds publicly available?
49-
- [ ] Were all fuzzers provided with the same set of seeds?
50-
- [ ] If informed seeds are used, is the initial coverage achieved by those visible from, e.g., plots?
51-
52-
- Evaluation Metrics
53-
- [ ] Are standard metrics, such as coverage over time or the number of (deduplicated) bugs, used?
54-
- [ ] Is it specified how coverage was collected?
55-
- [ ] Is a collision-free encoding used for coverage collection?
56-
- [ ] If collecting coverage for JIT-based emulation targets, are basic blocks reported instead of translation blocks (jitted blocks)?
57-
- [ ] Is no metric used the new approach is naturally optimized for?
58-
- [ ] Bug finding:
59-
- [ ] Are targets used that are relevant (have a sufficient user base, forks, GitHub stars, ...)?
60-
- [ ] Are the targets not deprecated and still receive updates?
61-
- [ ] Are the targets known for having unfixed bugs?
62-
- [ ] If reproducing existing CVEs, are the CVEs themselves valid and not duplicates or rejected?
63-
- [ ] Is an uninstrumented binary used for crash deduplication and reproduction?
64-
- [ ] Is the process of deduplication and triaging crashes described?
65-
- [ ] Are CVEs IDs provided (if possible anonymously)?
66-
- [ ] Are the CVEs referencing different bugs and not duplicates?
67-
68-
- Statistical Evaluation
69-
- [ ] Are measures of uncertainty, such as intervals in plots, used?
70-
- [ ] Is a sufficient number of trials used (>= 10)?
71-
- [ ] Are reasonable statistical tests used to check for the significance of the results?
72-
- [ ] Are reasonable statistical tests used to study the effect size?
73-
74-
- Other
75-
- [ ] Are threats to validity considered and discussed?
1+
# Fuzzing Evaluation Guidelines
762

3+
4+
Current version: 1.0.0
5+
6+
Proposals for changes welcome (please open an issue for discussion or a pull request for changes).
7+
8+
DISCLAIMER: These items represent are a best-effort attempt at capturing action items to follow during the evaluation of a scientific paper that focuses on fuzzing. **They do not apply universally to all fuzzing methods - in certain scenarios, techniques may wish to deviate for good reason from these guidelines. In any case, a case-by-case judgment is necessary.**
9+
The guidelines do not discuss many malicious choices that immediately negate any chance of a fair evaluation, such as giving your fuzzer an unfair advantage (e.g., by finetuning the fuzzer or its targets) or putting other fuzzers at a disadvantage.
10+
11+
12+
13+
A. Preparation for Evaluation
14+
1. Find relevant tools and baselines to compare against
15+
- 1.1 Include state-of-the-art techniques from both academia and industry
16+
- 1.2 If your fuzzer is based on an existing fuzzer, include the baseline (to measure the delta of your changes, which allows to attribute improvements to your technique)
17+
- 1.3 Use recent versions of fuzzers
18+
- 1.4 If applicable, derive a baseline variant of your technique that replaces core contributions by alternatives. For example, consider using a variant that replaces an informed algorithm with randomness.
19+
- 1.5 If using AFL-style fuzzers, do not use afl-gcc but afl-clang-fast or afl-clang-lto.
20+
21+
2) Identify suitable targets for the evaluation
22+
- 2.1 If applicable, consider using evaluation benchmarks, such as Fuzzbench (this allows to test many fuzzers under standardized conditions)
23+
- 2.2 Select a representative set of programs from the target domain
24+
- 2.3 Include targets used by related work (for comparability reasons)
25+
- 2.4 Do not cherry-pick targets based on preliminary results
26+
- 2.5 Do not pick multiple targets that share a considerable amount of code (e.g., two wrappers for the same library)
27+
- 2.6 Do not use artificial programs or programs with artificially injected bugs
28+
29+
4) Derive suitable experiments to evaluate your approach
30+
- 3.1 Evaluate on found bugs (if applicable)
31+
- 3.1.1 If using *new* bugs,
32+
- 3.1.1.1 include whether other fuzzers find the bug as well (so you can attribute finding this bug to your technique rather than being the first to fuzz this target); other fuzzers must have had the same computing resources
33+
- 3.1.1.2 deduplicate crashing inputs to derive the true bug count
34+
- 3.1.1.2.1 If possible, use vendor confirmation to identify true bugs
35+
- 3.1.1.2.1 Otherwise use manual triaging (consider automated deduplication as a pre-step to reduce number of findings)
36+
- 3.1.1.3 do not fuzz unsuitable programs for the sake of finding bugs (e.g., small hobby projects that are no longer maintained are not suitable)
37+
- 3.1.1.4 do not search for bugs in unstable, fast moving development branches, but prefer stable/release versions
38+
- 3.1.2 If using *known* bugs,
39+
- 3.1.2.1 use the known bugs as ground truth
40+
- 3.1.2.2 take into account that known bugs may not have been deduplicated
41+
- 3.1.2.3 do not evaluate on artificial bugs
42+
- 3.1.3 Do not use the number of (unique) crashing inputs as bug count
43+
44+
- 3.2 Evaluate code coverage over time (if applicable)
45+
- 3.2.1 If possible, use source code-based coverage (e.g., llvm-cov or lcov)
46+
- 3.2.2 Otherwise use a collision-free encoding
47+
- 3.2.3 Measure coverage on a neutral binary; this binary should include only instrumentation needed to measure the coverage, but not sanitizers or fuzzer-specific instrumentation
48+
- 3.2.4 If using dynamic binary translation, the coverage measurement should be independent of the translation (e.g., emulators may split a basic block into multiple translation blocks, disturbing measurements)
49+
50+
- 3.3 If applicable, evaluate domain-specific aspects of your fuzzer
51+
52+
- 3.4 If applicable, conduct ablation studies to measure individual design choices
53+
54+
- 3.5 If applicable, evaluate the influence of hyperparameters on your design
55+
56+
- 3.6 If doing experiments using custom metrics,
57+
- 3.6.1 take special care to ensure a fair comparison to existing work.
58+
- 3.6.2 In particular, avoid queue survivor bias (i.e., the queue only contains input fulfilling specific criteria) as it may favor your fuzzer. For example, your fuzzer optimizing towards the new, custom metric may keep inputs in the queue that others discard (even though they find the input at runtime) -- evaluating only inputs on the queue thus gives your fuzzer an unfair advantage.
59+
60+
61+
B. Documenting the Evaluation
62+
1) Describe the setup, including
63+
- 1.1 hardware used (such as CPU and RAM)
64+
- 1.2 how many cores have been available to each fuzzing campaign (e.g., via CPU pinning)
65+
- 1.3 technologies used, such as Docker or virtualization
66+
2) Choose and document experiment parameters, including
67+
- 2.1 a sufficiently long runtime (if possible >= 24h)
68+
- 2.2 a sufficient number of repetitions/trials to account for randomness and enable a robust statistical evaluation (if possible >= 10 trials)
69+
- 2.3 fairness of computing resource allocation, i.e., all fuzzers have access to the same amount of computation resources. This requires particular consideration if a tool requires precompuation(s).
70+
- 2.4 suitable seeds:
71+
- 2.4.1 If possible, use uninformed seeds for coverage evaluation (for bug experiments, informed seeds may be beneficial)
72+
- 2.4.2 Otherwise identify the coverage achieved by the initial seed set
73+
- 2.4.3 Provide all fuzzers with the same set of seeds
74+
- 2.4.4 Publish the used set of seeds
75+
- 2.5 targets:
76+
- 2.5.1 Use recent versions
77+
- 2.5.2 If applicable, explain modifications to the programs or runtime environment (e.g., when you patch the program or set a lower stack size)
78+
- 2.6 other tools/fuzzers:
79+
- 2.6.1 Use recent versions
80+
- 2.6.2 If your fuzzer is based on another one, make sure the version you base your tool on and the one used in the evaluation are the same
81+
82+
83+
C. Experiment Postprocessing
84+
1) Data Analysis
85+
- 1.1 Run a robust statistical evaluation to measure significance, such as Mann-Whitney U or bootstrap-based methods
86+
- 1.2 Measure effect size using a test such as the Vargha and Delaney A^\_{12} test
87+
2) Data Visualization
88+
- 2.1 If applicable, plot absolute values (such as for coverage over time)
89+
- 2.2 Measure uncertainty, for example using standard deviation or (confidence) intervals in plots
90+
3) Bug Handling
91+
- 3.1 Deduplicate and triage crashing inputs
92+
- 3.2 Report new bugs
93+
- 3.2.1 Follow responsible disclosure guidelines
94+
- 3.2.2 If possible, minimize samples before reporting
95+
- 3.2.3 If possible, attach available information, such as precise environment (OS, compilation flags, command line arguments, ...), ASAN reports, and (minimized) crashing input
96+
- 3.2.4 Consider reporting the bug with an anonymous identity and link to it in the paper during submission, such that reviewers can assess the bug and its impact themselves
97+
- 3.3 CVEs
98+
- 3.3.1 CVEs should be requested by maintainers
99+
- 3.3.2 If the maintainers do not request a CVE, link to the bug tracker instead of requesting a CVE yourself
100+
101+
102+
D. Artifact Release
103+
1) Artifact Contents
104+
- 1.1 Publish your code on a platform ensuring long-term availability, such as Zenodo or Github
105+
- 1.2 Publish modifications of other tools
106+
- 1.2.1 If modifying other tools, publish any modifications
107+
- 1.2.2 Publish your integration of other tools
108+
- 1.3 If possible, publish experiment data
109+
2) Artifact Documentation
110+
- 2.1 Document how to build your fuzzer
111+
- 2.2 Document how to interact with your fuzzer
112+
- 2.3 Document the source code
113+
- 2.4 Document modifications/extensions to other tools and their integration
114+
- 2.5 Document how to run and reproduce experiments described in the paper
115+
3) Artifact Reusability
116+
- 3.1 Specify versions of all tools used
117+
- 3.2 If possible, enable execution of your fuzzer independent from the underyling system, e.g., through virtualization or container engines
118+
- 3.3 Avoid external dependencies that may be unavailable in the future, such as tarball downloads via https
119+
- 3.4 Pin versions of dependencies
120+
- 3.5 If applicable, maintain the commit history of underlying tools instead of squashing them
121+
- 3.6 Double check your code is complete and reusable

0 commit comments

Comments
 (0)