Skip to content

Conversation

@anxkhn
Copy link
Contributor

@anxkhn anxkhn commented Sep 12, 2025

This PR updates the dataset as of 2025-09-09 using the new Selenium-based extraction method (introduced because LeetCode changed their GraphQL API and no longer provides full content via the ugcArticleDiscussionArticles query).

While fetching data, the scraper currently used max_recs = 2000, which caps the number of posts fetched.

However this begs an interesting question:

Is there a reason we limit to 2000?
We’re losing all older entries, this is intended behavior, but any reason why?
I am in favor of creating a fork which stores all the data, even beyond 2000.

@anxkhn anxkhn force-pushed the sept-data-refresh branch from 55233ae to 7379ce6 Compare October 2, 2025 17:22
@anxkhn anxkhn changed the title chore: refresh data with new entries till 9.9.25 chore: refresh data with new entries till 30.9.25 and prefill Oct 2, 2025
@anxkhn anxkhn changed the title chore: refresh data with new entries till 30.9.25 and prefill chore: refresh data with new entries till 30.9.25 and backfill Oct 2, 2025
@anxkhn
Copy link
Contributor Author

anxkhn commented Oct 2, 2025

  • updated the data with additional posts from month of September ~ 182 entires.
  • backfilled with the values (manually)
  • updated the max_recs limit to 5000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant