Skip to content

Conversation

@anxkhn
Copy link
Contributor

@anxkhn anxkhn commented Sep 12, 2025

LeetCode has changed its data structure for compensation posts starting March 2025, which makes our current parsing logic obsolete. The new GraphQL query used by https://leetcode.com/discuss/ (ugcArticleDiscussionArticles) only returns a summary instead of the full content of posts.

After investigating, there seems to be no available GraphQL query that returns the complete description for these new post IDs.

Approach

  • Introduced a new query (COMP_POSTS_QUERY) for fetching posts created after March 2025.

  • Retained the existing query as COMP_POSTS_QUERY_LEGACY for older posts.

  • Implemented a date-based switch:

    • If the post date is after March 1, 2025 → use the new query.
    • Otherwise → use the legacy query.
  • Since the new query only returns summaries, added a Selenium-based content extraction step:

    • Uses a headless Chrome driver to visit the post page and grab the content via a stable CSS selector.
    • Verified that LeetCode public pages currently have no rate limits, making this (the only) viable approach despite being slower.
  • Updated refresh.py to:

    • Dynamically create and tear down the Chrome driver when needed.
    • Fallback to using summaries if Selenium fails.

Additional Notes

  • I have used this to extract ~1400 more posts on my local machine and run it thru LLM, all those data changes are in a diff PR to declutter.

  • Some random pre-commit tests were failing from master, those changes are also included.

  • This PR adds new dependencies (selenium, webdriver-manager).

  • While Selenium crawling is slower, it has proven reliable and ensures we get the full content.

  • More discussion is needed on:

    • Whether Selenium should be the long-term solution or if alternatives should be explored.
    • How to name legacy-related code (e.g. _legacy suffix) and organize code structure better.

Looking forward to feedback on:

  • The overall approach (Selenium vs Playwright vs other methods) and how would it scale on GitHub Actions, etc.
  • Any edge cases you can come across and think should be handled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant