Skip to content

MTE-4764 Querying new crashes that affect many users#292

Open
clarmso wants to merge 12 commits intomasterfrom
cs/MTE-4764-sentry-new-crashes
Open

MTE-4764 Querying new crashes that affect many users#292
clarmso wants to merge 12 commits intomasterfrom
cs/MTE-4764-sentry-new-crashes

Conversation

@clarmso
Copy link
Collaborator

@clarmso clarmso commented Feb 2, 2026

This PR queries Sentry to see if there are any new issues that affect more than 1000 users or happen more than 1000 times during the last 3 days (due to weekend). If there are spikes, we send a slack notification. If there are no spikes, we do not send any notifications.

Here's a mockup of the Slack notification if new crashes are found:
Screenshot 2026-02-04 at 00 48 57

I would like to run the query on preflight and staging.

@clarmso clarmso requested review from AaronMT and isabelrios February 2, 2026 22:29
if args.type == 'spike-issues':
main_spike_issues(args.file, args.project)
else:
main_rates(args.file, args.project)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm considering splitting this file into two: one for rates and one for spike issues.

if int(issue.get('filtered', {}).get('userCount', 0)) > threshold
if int(issue.get('filtered', {}).get('count', 0)) > threshold
if int(issue.get('userCount', 0)) > threshold
if int(issue.get('count', 0)) > threshold
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I consider all things userCount and count in the issue.

df_issues = pd.DataFrame()
for release_version in release_versions:
short_release_version = release_version.split('+')[0]
issues = self.sentry_top_new_issues(short_release_version, statsPeriod=3)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, I would like to set statsPeriod to 1 but due to weekend we may miss some issues.

@AaronMT
Copy link
Collaborator

AaronMT commented Feb 3, 2026

Couple questions

spike_issues = [
  issue for issue in issues
  if lifetime.userCount > threshold
  if lifetime.count > threshold
  if filtered.userCount > threshold
  if filtered.count > threshold
  if userCount > threshold
  if count > threshold
]

This is a bit strict in that an issue must exceed the threshold in all 6 fields. In practice, that can easily filter out things you might still consider spikes (e.g., high count but lower userCount, or filtered vs lifetime not both huge).

If the intent is “either users OR count over threshold” the predicate probably wants OR, not AND across everything.

re: new issues, and then applied threshold

is:new issues with >1000 users and >1000 events within 3 days is possible, but it’s a narrow target. Might be fine if you only want “holy crap” regressions, but it’s worth confirming that’s the desired sensitivity with Winnie or whomever?

Also it will keep reporting the same spike issues every time the workflow runs, unless the underlying Sentry query stops returning them. What do we want to do there? One spikes, you might want to cache the values so the alert won't fire if the cache exists, and then clear the cache after 48 hours or whatever. Since this is added to existing workflows, these run multiple times a day so that's potential for repeat alerts.

@isabelrios
Copy link
Collaborator

I am thinking about new issues... or if we should query issues in general to check if any of those pass the number of Users or Count. If we query new issues we may miss a new issue that starts appearing frequently. Would that be possible?
How often are we going to run this? we may need to have this not only daily but also when an issue has more than 1K users or counts

@clarmso
Copy link
Collaborator Author

clarmso commented Feb 3, 2026

I am thinking about new issues... or if we should query issues in general to check if any of those pass the number of Users or Count. If we query new issues we may miss a new issue that starts appearing frequently. Would that be possible?

is:new from api_sentry.py restricts the issues to be new ones. In addition, is:unassigned and is:unresolved ensures that the devs haven't assigned them to anyone yet. These parameters cut down on lots of issues to be filtered.

    # API: New top issues
    def sentry_top_new_issues(self, release, statsPeriod=3):
        return self.client.http_get(
            (
                'organizations/{0}/issues/'
                '?project={1}'
                '&query=release.version:{2} is:unassigned is:unresolved is:new'
                '&sort=freq&statsPeriod={3}d'
            ).format(
                self.organization_slug, self.sentry_project_id, release,
                statsPeriod
            )
        )

How often are we going to run this? we may need to have this not only daily but also when an issue has more than 1K users or counts

I'm thinking about running the job daily from Monday to Friday and report only if there's an issue exceeding 1K users or counts.

@isabelrios
Copy link
Collaborator

I am thinking about new issues... or if we should query issues in general to check if any of those pass the number of Users or Count. If we query new issues we may miss a new issue that starts appearing frequently. Would that be possible?

is:new from api_sentry.py restricts the issues to be new ones. In addition, is:unassigned and is:unresolved ensures that the devs haven't assigned them to anyone yet. These parameters cut down on lots of issues to be filtered.

How often are we going to run this? we may need to have this not only daily but also when an issue has more than 1K users or counts

I'm thinking about running the job daily from Monday to Friday and report only if there's an issue exceeding 1K users or counts.

I am wondering if we may need this to run more often to really detect and alert when there is a spike in an issue pointing to a crash.
About new issues, I am not sure I understand the logic in sentry for new issues.. can a new issue start with several repetitions or those are added as they appear?
We may need a cache mechanism as Aaron mention so that we store the new issues, and we update their count / users for x days and if the go over the limit, alert...

@clarmso
Copy link
Collaborator Author

clarmso commented Feb 3, 2026

I am wondering if we may need this to run more often to really detect and alert when there is a spike in an issue pointing to a crash.

What is the frequency you'd suggest? I thought the issues would be surfacing within a day.

About new issues, I am not sure I understand the logic in sentry for new issues.. can a new issue start with several repetitions or those are added as they appear? We may need a cache mechanism as Aaron mention so that we store the new issues, and we update their count / users for x days and if the go over the limit, alert...

On the various parameters for is: from Sentry, here are the options and their short description. The options are resolved, unresolved, archived, escalating, new, ongoing, regressed, assigned, unassigned, for_review, linked and unlinked. I deteremined that the combination consisting unresolved, new and unassigned is the most appropriate for our work.

Screenshot 2026-02-03 at 13 30 01 Screenshot 2026-02-03 at 13 30 11 Screenshot 2026-02-03 at 13 30 16

The same issues would increment to the count of the particular issue rather than adding a new entry.

@AaronMT
Copy link
Collaborator

AaronMT commented Feb 3, 2026

Other questions:

  • Is culprit going to fit in a cell like that, or is it better to link out (e.g, could it be a massive signature?)

Nits:

  • Remove the alarm and exclamation emoji as the red attachment badge already signifies a critical issue
  • Issue Title -> Issue

@clarmso
Copy link
Collaborator Author

clarmso commented Feb 4, 2026

Other questions:

* Is culprit going to fit in a cell like that, or is it better to link out (e.g, could it be a massive signature?)

Nits:

* Remove the alarm and exclamation emoji as the red attachment badge already signifies a critical issue

* Issue Title -> Issue

It looks cleaner with fewer columns. Just the issue link and version should warrant the developers to investigate.

Screenshot 2026-02-04 at 00 48 57

I'm not 100% sure how the developers use culprit in debugging, so it may be better to leave it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants