Skip to content

Add helpers to extract crawl metrics / data verification #12

@motin

Description

@motin

Currently after each crawl, we run data verification using a rather manual process, requiring quite a lot of notebook copying/cloning.

Ideally, it should be enough to run something like crawl_metrics(s3_bucket, crawl_directory) or similar to get relevant metrics, including those from https://github.com/citp/openwpm-data-release/blob/master/Crawl-Data-Metrics.ipynb and those in the notebook linked in openwpm/openwpm-crawler#30 (comment).

A companion crawl_metrics_summary(crawl_metrics) method could be included to print out the most relevant metrics in human-readable form.

Use cases:

  • Include at the top of every crawl-analysis notebook to understand the nature of the gathered crawl dataset
  • To easily set up notebooks that analyses notebook crawl datasets longitudinally and/or compares individual crawl datasets
  • Include in OpenWPM CI to spot regressions in crawl performance/health (related: https://github.com/mozilla/OpenWPM/issues/479)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions