Skip to content

fix(task_database): Use stable SHA1 hash for task ID generation#71

Open
alebmorais wants to merge 1 commit intomainfrom
fix-stable-task-id
Open

fix(task_database): Use stable SHA1 hash for task ID generation#71
alebmorais wants to merge 1 commit intomainfrom
fix-stable-task-id

Conversation

@alebmorais
Copy link
Copy Markdown
Owner

@alebmorais alebmorais commented Oct 28, 2025

Replaced the use of Python's built-in hash() with hashlib.sha1 to generate a deterministic task_id for tasks without an explicit ID.

The built-in hash() is not stable across different Python processes, which could lead to duplicate tasks being created when the application is restarted. This change ensures that the same task payload will always produce the same task_id, preventing data duplication.

A new test case has been added to tests/test_task_database.py to verify that the task ID generation is stable and predictable.

Summary by CodeRabbit

  • Bug Fixes

    • Task IDs are now generated consistently and deterministically for improved stability.
  • Tests

    • Added unit tests to validate stable task ID generation behavior.

Replaced the use of Python's built-in `hash()` with `hashlib.sha1` to generate a deterministic `task_id` for tasks without an explicit ID.

The built-in `hash()` is not stable across different Python processes, which could lead to duplicate tasks being created when the application is restarted. This change ensures that the same task payload will always produce the same `task_id`, preventing data duplication.

A new test case has been added to `tests/test_task_database.py` to verify that the task ID generation is stable and predictable.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @alebmorais, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue where task IDs were not consistently generated across different Python processes, potentially leading to data duplication. By switching to a stable SHA1 hashing mechanism for task ID generation, the system now ensures that the same task payload will always produce the same identifier. A new test has been introduced to validate this improved determinism.

Highlights

  • Deterministic Task ID Generation: Replaced Python's built-in hash() function with hashlib.sha1 to generate stable and deterministic task_ids for tasks without an explicit ID. This prevents duplicate tasks from being created when the application restarts.
  • New Test Case: Added a new test file, tests/test_task_database.py, with a test case to verify that the task ID generation is now stable and predictable using the SHA1 hash.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Oct 28, 2025

Walkthrough

The _normalise_task function in task_database.py is updated to use deterministic SHA1 hashing for generating task IDs when none exists in the payload, replacing the previous built-in hash function. A new unit test validates that this hashing produces stable, repeatable IDs.

Changes

Cohort / File(s) Summary
Core Implementation
task_database.py
Modified _normalise_task to compute SHA1 hash of sorted JSON payload as fallback task_id instead of using Python's built-in hash function; ensures stable, repeatable IDs.
Test Coverage
tests/test_task_database.py
Added new test file with TestTaskDatabase class; includes test_normalise_task_generates_stable_id test that validates SHA1-based task_id generation produces expected stable hash ("a5b0c8d49a4c19a78200530eccd3caf4f859e5b5").

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Verify SHA1 import is correctly included and accessible
  • Validate hash computation logic against the precomputed expected value in the test
  • Confirm JSON payload serialization (with sort_keys=True) produces consistent results
  • Check that the string conversion of the hash digest is properly formatted
  • Ensure test assertion covers the documented contract for stable ID generation

Poem

🐰 A hash so stable, a puzzle so fine,
SHA1 ensures the IDs align,
No more chaos from Python's old way,
Our tasks now sleep soundly, hooray, hooray! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "fix(task_database): Use stable SHA1 hash for task ID generation" directly and accurately summarizes the main change in this changeset. The title correctly identifies the module (task_database), the nature of the change (replacing hash with SHA1), and the primary objective (ensuring stable task ID generation). The title is concise, specific, and uses clear language without vague terminology or unnecessary noise. A teammate reviewing the commit history would immediately understand that this change addresses task ID stability by switching to a deterministic hashing approach.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix-stable-task-id

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a potential data duplication issue by replacing Python's non-deterministic built-in hash() with hashlib.sha1 for task ID generation. This ensures that task IDs are stable across application restarts. The inclusion of a new test case to validate this behavior is a good practice. My review includes a suggestion to use a more modern hashing algorithm for better future-proofing and a minor code cleanup in the new test file.

# Bug fix: Use a stable SHA1 hash of the JSON payload
# to ensure the ID is deterministic.
payload_bytes = json.dumps(payload, sort_keys=True).encode("utf-8")
task_id = hashlib.sha1(payload_bytes).hexdigest()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While SHA1 is sufficient for generating a unique ID in this context, it is considered a legacy algorithm with known theoretical weaknesses. For better future-proofing and to align with current cryptographic best practices, consider using a more modern hash function from the SHA-2 family, such as SHA-256. This change would also require updating the expected hash value in the corresponding test.

Suggested change
task_id = hashlib.sha1(payload_bytes).hexdigest()
task_id = hashlib.sha256(payload_bytes).hexdigest()

@@ -0,0 +1,29 @@
import unittest
import json
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The json module is imported but not used in this file. Removing unused imports is recommended by PEP 8 to keep the code clean and avoid confusion.1

Style Guide References

Footnotes

  1. PEP 8 recommends removing unused imports to improve code readability and maintainability. This is often flagged by linters with codes like F401.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/test_task_database.py (1)

10-26: Good test validating stable ID generation.

The test correctly verifies that _normalise_task produces a deterministic SHA1-based task_id for payloads without an explicit ID. The expected hash value appears to have been verified against the actual implementation.

Consider adding a few more test cases to strengthen coverage:

def test_normalise_task_idempotency(self):
    """Verify that the same payload always produces the same ID."""
    payload = {"name": "Consistent Task", "status": "pending"}
    
    result1 = self.db._normalise_task(payload, "2023-10-27T10:00:00Z")
    result2 = self.db._normalise_task(payload, "2023-10-27T11:00:00Z")
    
    # Same payload should produce same ID regardless of timestamp
    self.assertEqual(result1["task_id"], result2["task_id"])

def test_normalise_task_different_payloads_different_ids(self):
    """Verify that different payloads produce different IDs."""
    payload1 = {"name": "Task A", "status": "pending"}
    payload2 = {"name": "Task B", "status": "pending"}
    
    result1 = self.db._normalise_task(payload1, "2023-10-27T10:00:00Z")
    result2 = self.db._normalise_task(payload2, "2023-10-27T10:00:00Z")
    
    self.assertNotEqual(result1["task_id"], result2["task_id"])
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 21cc09b and 1539ba4.

📒 Files selected for processing (2)
  • task_database.py (2 hunks)
  • tests/test_task_database.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_task_database.py (1)
task_database.py (1)
  • _normalise_task (101-146)
🪛 Ruff (0.14.1)
task_database.py

108-108: Probable use of insecure hash functions in hashlib: sha1

(S324)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (3)
task_database.py (2)

11-11: LGTM: Import added correctly.

The hashlib import is properly positioned and necessary for the SHA1-based ID generation.


105-108: Excellent fix for deterministic task IDs.

The SHA1-based implementation correctly addresses the non-deterministic hash() problem. The use of sort_keys=True ensures consistent JSON serialization, making the generated IDs stable across process restarts.

Regarding the static analysis warning about SHA1 being insecure: this is a false positive. SHA1 is appropriate here since you're generating stable identifiers, not performing cryptographic operations like signing or password hashing. The collision resistance of SHA1 is sufficient for this use case.

If you want to future-proof, consider SHA256 (hashlib.sha256) which is more modern, though SHA1 is perfectly acceptable for ID generation.

tests/test_task_database.py (1)

5-8: LGTM: Test setup is clean.

Using an in-memory database for unit tests is the right approach—fast, isolated, and no cleanup required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant