-
Notifications
You must be signed in to change notification settings - Fork 318
Dataset creation for backout commits #4159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
suhaibmujahid
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @benjaminmah! Please see my comments. Also, please fix the linting errors (you may want to consider installing pre-commit1).
Footnotes
scripts/backout_data_collection.py
Outdated
| def main(): | ||
| download_databases() | ||
|
|
||
| commit_dict, bug_to_commit_dict, bug_dict = preprocess_commits_and_bugs() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to consider the space complexity when iterating over the whole dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed unused keys when constructing the dictionaries and implemented a cache to use generated dictionaries from previous instances of running the code via saving them as JSON files. Let me know if this needs additional changes/fixes!
…gzilla.get_bugs`, removed a few tqdm lines
… found, and number of commits with multiple non backed out commits following it
… out commits is <= 2
…he dataset, separated by filename and split into `added_lines` and `removed_line`.
…he fix commit to extract the exact fix.
…t 2 years. Added batch file writing to reduce memory load.
…iff from fix commit
|
Example diffs extracted: |
Script to generate dataset of bug-inducing commits, backout commits, and the subsequent fix commit.
Intended to include:
pushdate,desc).