fix(task_database): Use stable SHA1 hash for task ID generation#71
fix(task_database): Use stable SHA1 hash for task ID generation#71alebmorais wants to merge 1 commit intomainfrom
Conversation
Replaced the use of Python's built-in `hash()` with `hashlib.sha1` to generate a deterministic `task_id` for tasks without an explicit ID. The built-in `hash()` is not stable across different Python processes, which could lead to duplicate tasks being created when the application is restarted. This change ensures that the same task payload will always produce the same `task_id`, preventing data duplication. A new test case has been added to `tests/test_task_database.py` to verify that the task ID generation is stable and predictable.
Summary of ChangesHello @alebmorais, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical issue where task IDs were not consistently generated across different Python processes, potentially leading to data duplication. By switching to a stable SHA1 hashing mechanism for task ID generation, the system now ensures that the same task payload will always produce the same identifier. A new test has been introduced to validate this improved determinism. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
WalkthroughThe Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Code Review
This pull request correctly addresses a potential data duplication issue by replacing Python's non-deterministic built-in hash() with hashlib.sha1 for task ID generation. This ensures that task IDs are stable across application restarts. The inclusion of a new test case to validate this behavior is a good practice. My review includes a suggestion to use a more modern hashing algorithm for better future-proofing and a minor code cleanup in the new test file.
| # Bug fix: Use a stable SHA1 hash of the JSON payload | ||
| # to ensure the ID is deterministic. | ||
| payload_bytes = json.dumps(payload, sort_keys=True).encode("utf-8") | ||
| task_id = hashlib.sha1(payload_bytes).hexdigest() |
There was a problem hiding this comment.
While SHA1 is sufficient for generating a unique ID in this context, it is considered a legacy algorithm with known theoretical weaknesses. For better future-proofing and to align with current cryptographic best practices, consider using a more modern hash function from the SHA-2 family, such as SHA-256. This change would also require updating the expected hash value in the corresponding test.
| task_id = hashlib.sha1(payload_bytes).hexdigest() | |
| task_id = hashlib.sha256(payload_bytes).hexdigest() |
| @@ -0,0 +1,29 @@ | |||
| import unittest | |||
| import json | |||
There was a problem hiding this comment.
The json module is imported but not used in this file. Removing unused imports is recommended by PEP 8 to keep the code clean and avoid confusion.1
Style Guide References
Footnotes
-
PEP 8 recommends removing unused imports to improve code readability and maintainability. This is often flagged by linters with codes like F401. ↩
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/test_task_database.py (1)
10-26: Good test validating stable ID generation.The test correctly verifies that
_normalise_taskproduces a deterministic SHA1-based task_id for payloads without an explicit ID. The expected hash value appears to have been verified against the actual implementation.Consider adding a few more test cases to strengthen coverage:
def test_normalise_task_idempotency(self): """Verify that the same payload always produces the same ID.""" payload = {"name": "Consistent Task", "status": "pending"} result1 = self.db._normalise_task(payload, "2023-10-27T10:00:00Z") result2 = self.db._normalise_task(payload, "2023-10-27T11:00:00Z") # Same payload should produce same ID regardless of timestamp self.assertEqual(result1["task_id"], result2["task_id"]) def test_normalise_task_different_payloads_different_ids(self): """Verify that different payloads produce different IDs.""" payload1 = {"name": "Task A", "status": "pending"} payload2 = {"name": "Task B", "status": "pending"} result1 = self.db._normalise_task(payload1, "2023-10-27T10:00:00Z") result2 = self.db._normalise_task(payload2, "2023-10-27T10:00:00Z") self.assertNotEqual(result1["task_id"], result2["task_id"])
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
task_database.py(2 hunks)tests/test_task_database.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_task_database.py (1)
task_database.py (1)
_normalise_task(101-146)
🪛 Ruff (0.14.1)
task_database.py
108-108: Probable use of insecure hash functions in hashlib: sha1
(S324)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (3)
task_database.py (2)
11-11: LGTM: Import added correctly.The hashlib import is properly positioned and necessary for the SHA1-based ID generation.
105-108: Excellent fix for deterministic task IDs.The SHA1-based implementation correctly addresses the non-deterministic hash() problem. The use of
sort_keys=Trueensures consistent JSON serialization, making the generated IDs stable across process restarts.Regarding the static analysis warning about SHA1 being insecure: this is a false positive. SHA1 is appropriate here since you're generating stable identifiers, not performing cryptographic operations like signing or password hashing. The collision resistance of SHA1 is sufficient for this use case.
If you want to future-proof, consider SHA256 (
hashlib.sha256) which is more modern, though SHA1 is perfectly acceptable for ID generation.tests/test_task_database.py (1)
5-8: LGTM: Test setup is clean.Using an in-memory database for unit tests is the right approach—fast, isolated, and no cleanup required.
Replaced the use of Python's built-in
hash()withhashlib.sha1to generate a deterministictask_idfor tasks without an explicit ID.The built-in
hash()is not stable across different Python processes, which could lead to duplicate tasks being created when the application is restarted. This change ensures that the same task payload will always produce the sametask_id, preventing data duplication.A new test case has been added to
tests/test_task_database.pyto verify that the task ID generation is stable and predictable.Summary by CodeRabbit
Bug Fixes
Tests