Skip to content

Git and the Clean Slate Project

Jeremy Lang edited this page Jan 20, 2021 · 1 revision

The CfB Clean Slate project uses Git and GitHub to share code and collaborate with other team members. This repository instead contains code and data associated with the data science branch of the project. Due to limitations in the DocAssemble framework used to create the apps dedicated to creating forms for record sealing or expungement, code for our apps must live in separate repositories.

Repository Norms

  • Top-level folders must have a README.
  • Scripts should have either a documentation header or be documented in their containing folder’s README.
  • Folder and file names should be lowercase, with underscores ( _ ) for spaces
  • .gitignore will be edited to allow Excel (.xlsx) data files, with sensitive data files blocked individually
    • Excel files are allowed over CSVs as they are less likely to be generated by scrapers, meaning they are less likely to be committed accidentally due to a contributor forgetting they were created as an output of a script

Repository Structure

We have decided that the following repository structure makes sense for our project. Each bolded name represents a folder:

  • data (for all data files -- data files should not live in analyses subfolders)
    • raw (the first data file we obtain)
    • cleaned (a raw dataset that has been cleaned and scripts used in cleaning, which take raw data as inputs and return the output set)
    • processed (the data file used for a specific analysis, post-wrangling)
    • Data files must be documented in the correct folder’s README with source identified.
  • analyses
    • notebooks (for single-file projects that don’t justify their own folder)
    • This top-level folder should contain folders for each analysis, which (outside of the data and utils folder) should be mostly self-contained
  • scrapers
    • This folder will contain subfolders for each data scraper we’ve used in this project, including those which will be folded in from other repos
  • utils (for code snippets and functions shared between scripts)
  • docs (for project-level documentation; documentation of code should be inside the script or the readme of the folder the script is in)
    • external_info (for documentation gathered from external sources, including PDFs from advocacy groups and government agencies and links to relevant sources)

https://github.com/CLSPhila/DocketScraperAPI

Pull Request Guidelines

This repo has branch protection turned on, meaning that any commits will need at least one review from a collaborator before being allowed to be merged to master. Certain commits will also require group review by the Clean Slate team, following our five-finger consensus voting guidelines. In general, PRs are fine to be merged with only the one review if they:

  • Are purely additive in nature (net-new documentation, new analysis folders, or scripts, for instance)
  • Change files that are clearly primarily developed or maintained by the contributor making the change
  • Make edits to documentation that are minor in nature -- on the order of typo or bug fixes.

The following types of change will probably need review from the group:

  • Before adding top-level folders not defined in this proposal
  • Before editing the .gitignore in any way
  • Before file or folder deletion

Tools Guidelines

  • GitHub issues and project boards are for tracking and documenting progress on an issue
  • Slack is good for quick communication, but discussions should be summarized on a GitHub issue
  • Try to avoid the Google Drive
    • MA Expungement Analysis contains active documents related to MA expungement analysis
    • Old files are to be placed in the “deprecated files” folder to avoid confusion
    • Ideally, any document that goes into Google Drive has a different long term home such as the GitHub Wiki for long term reference information or the repo for official data
  • Use the Wiki for team level documentation and long term reference material

Clone this wiki locally