Revise Regression Tests and have Relative Reporting in CI

We have the regression tests, which are amazing for finding if a change breaks SVGs. However, is not good at measuring if a change was actually _meaningful_.

For example:

* After a bug fix, do the metrics reflect what was expected?
* If we have a proposed or revised optimization, is it actually reducing bytes?
* How long did it take to run the regression test pipeline?
* Is a change doing anything else to the output of the SVGs? (Change in formatting, attribute order, etc.)

Before fixing more bugs/optimizations, let's finally address this so we have more insight.

## Scope

### Allow Test Failures ✅

The regression tests should test all SVGs that it can, including ones that may fail. We should instead define three lists:

| List | Description |
|---|---|
| `expect-error` | SVGs that we know are broken by SVGO. The build will fail if during the regression pipeline, we determine that one of them are actually _not_ broken. This is for when we've intentionally or incidentally fixed an SVG. |
| `ignore` | Ignore the results of these SVGs as they are finicky. They sometimes pass, sometimes fail, we'll figure out why eventually. We'll report the status of them regardless, but the result has no effect on the pipeline status. |
| `skip` | SVGs that we shouldn't extract. We only have one SVG for this scenario right now, which takes too long to optimize to be practical for CI environments. |

Instead of being defined in code, let's make these separate config files that are read from in the regression tests directory. They'll be standalone text files, one file per line with support for comments.

### Metrics ✅

The regression tests must output the following metrics:

* How many tests pass or failed, so we know the progress towards acing regression tests.
* The total number of bytes saved so we can confirm changes we are actually optimizing SVGs, and not undoing optimizations beyond necessary in fixes.
* The hash of every individual SVG, so we can keep track of which files are affected by changes in SVGO.
* The maximum memory footprint the process had during the run.

The results will be written to STDOUT and a file.

#### STDOUT

The resulting output should look something like this, but happy to consider alternative presentations. The data contained in the example is fictional.

Normal Run:

```
SVGO Test Suite Version: 30ba6479ef7570d3748a30bbbd39f8c7

▶ Test Results
              Match: 4,799 / 4,799
  Expected Mismatch: 159 / 159 
            Ignored: 1 / 1
            Skipped: 1

▶ Metrics
        Bytes Saved: 40.4 MiB
         Time Taken: 23m56s
  Peak Memory Alloc: 2.305 GB

▶ Relative to svg/svgo.git#main
       Files Affected: 277 / 4,959
        Bytes Saved Δ: +0.4 MiB
         Time Taken Δ: +20s
  Peak Memory Alloc Δ: +0.001 GB
```

If there is no previous report to compare results to:

```
…

▶ Relative to svg/svgo.git#main
  Previous test report not available.
```

If the SVGO Test Suite version is different from the version specified in the last test report:

```
…

▶ Relative to svg/svgo.git#main
  Previous test report used a different version of SVGO Test Suite.
  Rerun regression tests on main to regenerate test report, then try again.
```

If there are failing test cases, show them in a bullet point list, probably in red as one would expect in other test runners:

```
▶ Expected match, but actually mismatched:
  ✖ oxygen-icons-5.116.0/scalable/actions/hidef/tools-rip-audio-cd.svg
  ✖ oxygen-icons-5.116.0/scalable/actions/im-ban-kick-user.svg
```

Implementation Details:

* The SVGO Test Suite version will be included in the download from the svg/svgo-test-suite bundle in a file named `VERSION`.
* In `Relative to …`, any `… Delta` line must use the same units as the field it's relative to.
* Numbers must be localized according to the system locale. (i.e., `1,000` not `1000` for en-GB)
* Files Changed can be determined by including a checksum of each optimized SVG in the test report, and comparing the checksums.
* It's expected to use a library to make text bold or colored, etc. The capabilities of the library may influence the design. For example, if we make headings bold, there's no need to underline them.

#### File

We will write a more verbose report to `…/svgo-test-report.json`, containing raw data for the run. The JSON file should not be concerned with presentation.

This will include everything, including the checksum of every file that was optimized. This file will not be committed to the Git repository.

The tests will run on a merge to `main`, and will upload an artifact to GitHub (or cloud storage if GitHub isn't practical for whatever reason). Then, on CI and perhaps locally, we'll generate the same report. The regression tests can compare the results to the last report for `main`, and print the relative summary.

## Show Progress ✅

While we're doing this revamp anyway, perhaps we should show a loading state of some kind until the results are ready. Better than looking frozen!

It should be a progress bar of some kind that shows how many of the total SVGs have been processed so far.

## Related

* johnkenny54 did something like this before in https://github.com/svg/svgo/pull/1932

## Feedback

### Keep Metrics more Focused ✅

[Lorfdail](https://github.com/Lorfdail) suggested that we should run the bulk optimizations and the Playwright in separate processes so that we could get a more accurate read on the duration and memory usage without the overhead of the testing method.

This is a great idea, so that will be done as part of this epic as well.

This will hopefully give us slightly more consistent metrics, since it keeps what we're timing/measuring more focused.

### Add Steps to STDOUT

[Lorfdail](https://github.com/Lorfdail) suggested that we should add "Step X" logs to the STDOUT while running. We'll consider this as it's true the current UX isn't the best.

We go from 0 to 4,800~ during the optimization step, just to go back to 0 again for the comparison step. This gives the contributor a false sense of the work remaining.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Revise Regression Tests and have Relative Reporting in CI #2160

Scope

Allow Test Failures ✅

Metrics ✅

STDOUT

File

Show Progress ✅

Related

Feedback

Keep Metrics more Focused ✅

Add Steps to STDOUT

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

List	Description
`expect-error`	SVGs that we know are broken by SVGO. The build will fail if during the regression pipeline, we determine that one of them are actually not broken. This is for when we've intentionally or incidentally fixed an SVG.
`ignore`	Ignore the results of these SVGs as they are finicky. They sometimes pass, sometimes fail, we'll figure out why eventually. We'll report the status of them regardless, but the result has no effect on the pipeline status.
`skip`	SVGs that we shouldn't extract. We only have one SVG for this scenario right now, which takes too long to optimize to be practical for CI environments.

Uh oh!

Uh oh!

Revise Regression Tests and have Relative Reporting in CI #2160

Description

Scope

Allow Test Failures ✅

Metrics ✅

STDOUT

File

Show Progress ✅

Related

Feedback

Keep Metrics more Focused ✅

Add Steps to STDOUT

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions