Skip to content

matan-h/pyhash-complete

Repository files navigation

HashTable-based python autocomplete

Have you ever opened your favorite text editor, and it show you autocomplete like this? os complete in vscode

Most Python editors/LSPs sort alphabetically. But here, alphabetic sorting is like random sorting. for example, for os it suggests os.CLD_CONTINUED, I had never heard of this before, and to this day there is 0 real usages of this in the whole GitHub.

This project attempts to bring a better autocomplete for Python, using a fast and small precomputed data table. os complete in the new autocomplete

Based on the ManyTypes4PyDataset (MT4P) dataset v1.7 which contains 5.2K Python repositories

Generate dataset

The dataset is available in the GitHub releases (the .2 file includes all more than 2 attr access). .200 is threshold 200, which is a smaller file with lower frequencies removed. You can generate the dataset with threshold 7 using the generate script

go run generate-file.go -t 7 -out output/py_call_freq.7.bin -data ManyTypes4PyDataset-v0.7/processed_projects_complete

(Add -d to output a debug.json, debug-raw.json and debug-projects.json files with <name>:<count>.) I made internative file size explorer in my blog post

Usage

Currently, after download/create dataset, you can lookup things using lookup scripts

go run lookup/lookup-fast.go os.stat

(note this script generated by LLM almost entirely) output

os.stat: score=117

same with

python lookup.py 'os.stat' # os.stat: score=117
# list top 10 scores of module (has to be installed)
python lookup.py 'os.*'

Or use the minimal python editor, with autocomplete using python simpleui.py

A fork of the awesome ty LSP is work in progress. you can use it manually following this

FAQ

What is the file format

The hash score table file format is crafted specifically to be as fast and small as possible

  • Header (magic HSCT+version) - 8 bytes
  • capacity and slot count - 8 bytes
  • Slots hash table 4 bytes repeated:
  • Hash key (FNV-1a >> 8) - 3 bytes
  • frequency - 1 bytes

rare hash collisions are possible. in that case, I take the highest score one.

Why not use ML/AI

PyCharm and some other editors have an option to use AI, that was probably trained on bigger table, similar to this project - it's always off by default.

Autocomplete should be fast, and ML is usually slow. TB-complete making computations in less than 8ms .

Why threshold can go beyond 255

The -t, threshold option gets the raw threshold[s], so it filters the number of times an attr is accessed throughout the MT4P dataset. The debug JSON and the table itself have the normalized frequency. (1-255) A lower threshold includes more rare attributes.

Why go

Its perfect for project: Well-known (!=nim) fast (!=python) language, thats hide the low level controls so I can code it fast (!=rust).