HashTable-based python autocomplete

Have you ever opened your favorite text editor, and it show you autocomplete like this?

Most Python editors/LSPs sort alphabetically. But here, alphabetic sorting is like random sorting. for example, for os it suggests os.CLD_CONTINUED, I had never heard of this before, and to this day there is 0 real usages of this in the whole GitHub.

This project attempts to bring a better autocomplete for Python, using a fast and small precomputed data table.

Based on the ManyTypes4PyDataset (MT4P) dataset v1.7 which contains 5.2K Python repositories

Generate dataset

The dataset is available in the GitHub releases (the .2 file includes all more than 2 attr access). .200 is threshold 200, which is a smaller file with lower frequencies removed. You can generate the dataset with threshold 7 using the generate script

go run generate-file.go -t 7 -out output/py_call_freq.7.bin -data ManyTypes4PyDataset-v0.7/processed_projects_complete

(Add -d to output a debug.json, debug-raw.json and debug-projects.json files with <name>:<count>.) I made internative file size explorer in my blog post

Usage

Currently, after download/create dataset, you can lookup things using lookup scripts

go run lookup/lookup-fast.go os.stat

(note this script generated by LLM almost entirely) output

os.stat: score=117

same with

python lookup.py 'os.stat' # os.stat: score=117
# list top 10 scores of module (has to be installed)
python lookup.py 'os.*'

Or use the minimal python editor, with autocomplete using python simpleui.py

A fork of the awesome ty LSP is work in progress. you can use it manually following this

FAQ

What is the file format

The hash score table file format is crafted specifically to be as fast and small as possible

Header (magic HSCT+version) - 8 bytes
capacity and slot count - 8 bytes
Slots hash table 4 bytes repeated:
Hash key (FNV-1a >> 8) - 3 bytes
frequency - 1 bytes

rare hash collisions are possible. in that case, I take the highest score one.

Why not use ML/AI

PyCharm and some other editors have an option to use AI, that was probably trained on bigger table, similar to this project - it's always off by default.

Autocomplete should be fast, and ML is usually slow. TB-complete making computations in less than 8ms .

Why threshold can go beyond 255

The -t, threshold option gets the raw threshold[s], so it filters the number of times an attr is accessed throughout the MT4P dataset. The debug JSON and the table itself have the normalized frequency. (1-255) A lower threshold includes more rare attributes.

Why go

Its perfect for project: Well-known (!=nim) fast (!=python) language, thats hide the low level controls so I can code it fast (!=rust).

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
create-distribution		create-distribution
lookup		lookup
readme		readme
.gitignore		.gitignore
BUILD-TY.md		BUILD-TY.md
LICENSE		LICENSE
README.md		README.md
generate-file.go		generate-file.go
get_builtins.py		get_builtins.py
lookup.dart		lookup.dart
lookup.py		lookup.py
simpleui.py		simpleui.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HashTable-based python autocomplete

Generate dataset

Usage

FAQ

What is the file format

Why not use ML/AI

Why threshold can go beyond 255

Why go

About

Uh oh!

Releases 3

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HashTable-based python autocomplete

Generate dataset

Usage

FAQ

What is the file format

Why not use ML/AI

Why threshold can go beyond 255

Why go

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Contributors

Uh oh!

Languages