Skip to content

Commit 40c726d

Browse files
wjddusrb03claude
andcommitted
Initial release: CommitMind v0.1.0 - Semantic Git commit search with TurboQuant compression
- Semantic search for git commit history using sentence embeddings - TurboQuant vector compression (ICLR 2026) for memory efficiency - Auto-indexing: search works without manual index step - CLI commands: index, search, stats, update - 331 tests across 15 topics, all passing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0 parents  commit 40c726d

22 files changed

Lines changed: 4616 additions & 0 deletions

.gitignore

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
__pycache__/
2+
*.py[cod]
3+
*$py.class
4+
*.egg-info/
5+
dist/
6+
build/
7+
*.pkl
8+
.commitmind/
9+
.pytest_cache/
10+
.eggs/
11+
*.egg
12+
.env
13+
.venv/
14+
venv/

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2026 wjddusrb03
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
# CommitMind
2+
3+
**Semantic search for Git commit history, powered by TurboQuant vector compression (ICLR 2026).**
4+
5+
> Stop searching by keywords. Search by *meaning*.
6+
7+
[![PyPI version](https://img.shields.io/pypi/v/commitmind)](https://pypi.org/project/commitmind/)
8+
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
9+
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
10+
11+
## The Problem
12+
13+
```bash
14+
# Current: keyword matching only
15+
git log --grep="memory leak" # Only finds commits with exact text "memory leak"
16+
# Misses: "fix kfree_skb double free"
17+
# Misses: "plug UAF in reset path"
18+
# Misses: "resolve dangling pointer"
19+
```
20+
21+
## The Solution
22+
23+
```bash
24+
# CommitMind: semantic search
25+
commitmind search "memory leak"
26+
# >> #1 [0.94] a3f2c1d Fix kfree_skb double free in netfilter
27+
# >> #2 [0.91] b7e4a2f Plug use-after-free in device reset path
28+
# >> #3 [0.87] c9d1b3e Resolve dangling pointer in slab allocator
29+
```
30+
31+
CommitMind understands the **meaning** of your query and finds semantically related commits - even when the exact words don't match.
32+
33+
## How It Works
34+
35+
```
36+
Git commits --> Sentence embeddings --> TurboQuant compression --> Semantic search
37+
(all-MiniLM-L6-v2) (7.6x compression) (asymmetric scoring)
38+
```
39+
40+
1. **Extract** commit messages + file change metadata from git history
41+
2. **Embed** each commit into a 384-dimensional vector (local model, no API needed)
42+
3. **Compress** vectors with TurboQuant (Google's ICLR 2026 algorithm) - 87% memory savings
43+
4. **Search** using asymmetric inner-product estimation (no decompression needed)
44+
45+
## Installation
46+
47+
```bash
48+
pip install commitmind
49+
```
50+
51+
Or install from source:
52+
53+
```bash
54+
git clone https://github.com/wjddusrb03/commitmind.git
55+
cd commitmind
56+
pip install -e ".[dev]"
57+
```
58+
59+
## Quick Start
60+
61+
```bash
62+
# 1. Index your repository
63+
cd your-project
64+
commitmind index
65+
66+
# Output:
67+
# Indexing complete!
68+
# > 3,842 commits indexed
69+
# > Compressed: 18.2 MB -> 2.4 MB (7.6x)
70+
# > Saved to .commitmind/index.pkl
71+
72+
# 2. Search by meaning
73+
commitmind search "authentication bug fix"
74+
75+
# 3. View stats
76+
commitmind stats
77+
```
78+
79+
## CLI Commands
80+
81+
| Command | Description |
82+
|---|---|
83+
| `commitmind index` | Index commits with TurboQuant compression |
84+
| `commitmind search "query"` | Semantic search over commits |
85+
| `commitmind stats` | Show index statistics |
86+
| `commitmind update` | Add new commits to existing index |
87+
88+
### Options
89+
90+
```bash
91+
# Index with options
92+
commitmind index --max-commits 1000 # Limit to recent 1000 commits
93+
commitmind index --branch main # Index specific branch
94+
commitmind index --bits 2 # Use 2-bit quantization (more compression)
95+
96+
# Search with options
97+
commitmind search "query" -k 10 # Return top 10 results
98+
```
99+
100+
## Use Cases
101+
102+
- **New team member**: "What authentication changes were made recently?"
103+
- **Bug tracking**: "Find commits related to network timeout issues"
104+
- **Security audit**: "Show all SQL injection related fixes"
105+
- **Code archaeology**: Search Linux kernel's 1M+ commits by meaning
106+
- **Cross-language**: Search English commits with Korean queries (and vice versa)
107+
108+
## Memory Efficiency
109+
110+
Thanks to TurboQuant compression:
111+
112+
| Commits | Uncompressed | CommitMind | Savings |
113+
|---|---|---|---|
114+
| 1,000 | 1.5 MB | 0.2 MB | 87% |
115+
| 10,000 | 15 MB | 2.0 MB | 87% |
116+
| 100,000 | 150 MB | 20 MB | 87% |
117+
| 1,000,000 | 1.5 GB | 200 MB | 87% |
118+
119+
## How TurboQuant Works
120+
121+
CommitMind uses [TurboQuant](https://openreview.net/forum?id=mMWatwUUkn) (Google Research, ICLR 2026):
122+
123+
1. **PolarQuant**: Random orthogonal rotation + Lloyd-Max scalar quantization (3-bit)
124+
2. **QJL**: Quantized Johnson-Lindenstrauss residual correction (1-bit)
125+
3. **Asymmetric scoring**: Compute similarity WITHOUT decompressing vectors
126+
127+
This achieves ~7.6x compression with minimal accuracy loss.
128+
129+
## Requirements
130+
131+
- Python 3.9+
132+
- Git repository
133+
- CPU only (no GPU required)
134+
- ~500 MB disk for embedding model (downloaded once)
135+
136+
## Contributing
137+
138+
Issues and pull requests are welcome! If you find a bug or have suggestions, please [open an issue](https://github.com/wjddusrb03/commitmind/issues).
139+
140+
## License
141+
142+
MIT License
143+
144+
## Citation
145+
146+
If you use CommitMind in your research:
147+
148+
```bibtex
149+
@software{commitmind2026,
150+
title={CommitMind: Semantic Git Commit Search with TurboQuant Compression},
151+
author={wjddusrb03},
152+
year={2026},
153+
url={https://github.com/wjddusrb03/commitmind}
154+
}
155+
```
156+
157+
## Related
158+
159+
- [langchain-turboquant](https://github.com/wjddusrb03/langchain-turboquant) - LangChain VectorStore with TurboQuant compression
160+
- [TurboQuant paper](https://openreview.net/forum?id=mMWatwUUkn) - Original ICLR 2026 paper by Google Research

README_KO.md

Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
# CommitMind
2+
3+
**TurboQuant 벡터 압축 기반 Git 커밋 히스토리 의미 검색 도구 (ICLR 2026)**
4+
5+
> 키워드 검색은 그만. *의미*로 검색하세요.
6+
7+
[![PyPI version](https://img.shields.io/pypi/v/commitmind)](https://pypi.org/project/commitmind/)
8+
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
9+
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
10+
11+
## 문제
12+
13+
```bash
14+
# 현재: 글자 매칭만 가능
15+
git log --grep="메모리 누수" # "메모리 누수"라는 글자가 있는 커밋만 찾음
16+
# "fix kfree_skb double free" --> 못 찾음
17+
# "plug UAF in reset path" --> 못 찾음
18+
```
19+
20+
## 해결
21+
22+
```bash
23+
# CommitMind: 의미 기반 검색
24+
commitmind search "메모리 누수"
25+
# >> #1 [0.94] a3f2c1d Fix kfree_skb double free in netfilter
26+
# >> #2 [0.91] b7e4a2f Plug use-after-free in device reset path
27+
# >> #3 [0.87] c9d1b3e Resolve dangling pointer in slab allocator
28+
```
29+
30+
CommitMind는 쿼리의 **의미**를 이해하여 관련 커밋을 찾습니다. 정확한 단어가 없어도 됩니다.
31+
32+
## 작동 원리
33+
34+
```
35+
Git 커밋 --> 문장 임베딩 --> TurboQuant 압축 --> 의미 검색
36+
(all-MiniLM-L6-v2) (7.6배 압축) (비대칭 스코어링)
37+
```
38+
39+
1. **추출**: git 히스토리에서 커밋 메시지 + 파일 변경 메타데이터 추출
40+
2. **임베딩**: 각 커밋을 384차원 벡터로 변환 (로컬 모델, API 불필요)
41+
3. **압축**: TurboQuant(Google ICLR 2026)으로 벡터 압축 - 메모리 87% 절감
42+
4. **검색**: 비대칭 내적 추정으로 검색 (압축 해제 불필요)
43+
44+
## 설치
45+
46+
```bash
47+
pip install commitmind
48+
```
49+
50+
또는 소스에서 설치:
51+
52+
```bash
53+
git clone https://github.com/wjddusrb03/commitmind.git
54+
cd commitmind
55+
pip install -e ".[dev]"
56+
```
57+
58+
## 빠른 시작
59+
60+
```bash
61+
# 1. 저장소 인덱싱
62+
cd your-project
63+
commitmind index
64+
65+
# 출력:
66+
# Indexing complete!
67+
# > 3,842 commits indexed
68+
# > Compressed: 18.2 MB -> 2.4 MB (7.6x)
69+
# > Saved to .commitmind/index.pkl
70+
71+
# 2. 의미로 검색
72+
commitmind search "인증 버그 수정"
73+
74+
# 3. 통계 보기
75+
commitmind stats
76+
```
77+
78+
## CLI 명령어
79+
80+
| 명령어 | 설명 |
81+
|---|---|
82+
| `commitmind index` | 커밋을 TurboQuant 압축으로 인덱싱 |
83+
| `commitmind search "쿼리"` | 커밋 의미 검색 |
84+
| `commitmind stats` | 인덱스 통계 표시 |
85+
| `commitmind update` | 새 커밋만 추가 인덱싱 |
86+
87+
### 옵션
88+
89+
```bash
90+
# 인덱싱 옵션
91+
commitmind index --max-commits 1000 # 최근 1000개만 인덱싱
92+
commitmind index --branch main # 특정 브랜치 인덱싱
93+
commitmind index --bits 2 # 2비트 양자화 (더 높은 압축)
94+
95+
# 검색 옵션
96+
commitmind search "쿼리" -k 10 # 상위 10개 결과 반환
97+
```
98+
99+
## 사용 시나리오
100+
101+
- **새로 합류한 개발자**: "이 프로젝트에서 인증 관련 변경은 언제 있었지?"
102+
- **버그 추적**: "네트워크 타임아웃 관련 수정 이력 보여줘"
103+
- **보안 감사**: "SQL injection 관련 수정 전부 찾아줘"
104+
- **코드 고고학**: Linux 커널 124만 커밋에서 의미 기반 탐색
105+
- **다국어 검색**: 한국어로 영어 커밋 검색 가능
106+
107+
## 메모리 효율
108+
109+
TurboQuant 압축 덕분에:
110+
111+
| 커밋 수 | 비압축 | CommitMind | 절감률 |
112+
|---|---|---|---|
113+
| 1,000 | 1.5 MB | 0.2 MB | 87% |
114+
| 10,000 | 15 MB | 2.0 MB | 87% |
115+
| 100,000 | 150 MB | 20 MB | 87% |
116+
| 1,000,000 | 1.5 GB | 200 MB | 87% |
117+
118+
## TurboQuant 원리
119+
120+
CommitMind는 [TurboQuant](https://openreview.net/forum?id=mMWatwUUkn) (Google Research, ICLR 2026)을 사용합니다:
121+
122+
1. **PolarQuant**: 랜덤 직교 회전 + Lloyd-Max 스칼라 양자화 (3비트)
123+
2. **QJL**: 양자화된 Johnson-Lindenstrauss 잔차 보정 (1비트)
124+
3. **비대칭 스코어링**: 벡터를 압축 해제하지 않고 유사도 계산
125+
126+
약 7.6배 압축률로 정확도 손실을 최소화합니다.
127+
128+
## 요구사항
129+
130+
- Python 3.9+
131+
- Git 저장소
132+
- CPU만으로 동작 (GPU 불필요)
133+
- 임베딩 모델용 디스크 약 500 MB (최초 1회 다운로드)
134+
135+
## 기여하기
136+
137+
이슈와 풀 리퀘스트를 환영합니다! 버그를 발견하거나 제안이 있으시면 [이슈를 열어주세요](https://github.com/wjddusrb03/commitmind/issues).
138+
139+
## 라이선스
140+
141+
MIT License
142+
143+
## 인용 (Citation)
144+
145+
CommitMind를 연구에 사용하시면:
146+
147+
```bibtex
148+
@software{commitmind2026,
149+
title={CommitMind: Semantic Git Commit Search with TurboQuant Compression},
150+
author={wjddusrb03},
151+
year={2026},
152+
url={https://github.com/wjddusrb03/commitmind}
153+
}
154+
```
155+
156+
## 관련 프로젝트
157+
158+
- [langchain-turboquant](https://github.com/wjddusrb03/langchain-turboquant) - TurboQuant 압축 기반 LangChain VectorStore
159+
- [TurboQuant 논문](https://openreview.net/forum?id=mMWatwUUkn) - Google Research ICLR 2026 원본 논문

0 commit comments

Comments
 (0)