This is a proof-of-concept project to show one approach on how to extract copyright holders from source code tree with the help of AI, in this case using Ollama API and a local model.
This module will traverse a directory tree containing source code, look for markers that indicate possible copyright holders, construct a window where copyright holders might appear, and then ask an AI to return the copyright holders, if any, in that window.
It will output the cleaned up matches. They should not contain any markup. The output will be in UTF-8 regardless of original file encoding.
It is possible that the AI will output incorrect results, or miss valid copyright holders.
It is suggested that you preprocess the directory tree to unpack any archives, decrypt encrypted files, and do any other preprocessing that might be relevant to your use case.
When you use open source software to build your own product, many of the open source licenses require you to acknowledge the people that worked on the open source project you used.
This is a proof-of-concept project, and not recommended for production usage.
Some possible things you might want to change for production usage:
- Fine tune the deterministic code that finds possible copyright holder windows
- Modify prompt to allow email addresses, URLs, and any other data you might care about
- Compare different models to find out which performs best in your use case
- Use different models for extraction and validation
- Use remote APIs for better performance
- Split into main scanner thread, and worker AI threads, and make multiple AI API calls concurrently to process copyright windows faster
- Run with PyPy or port to a different language for better performance
One time setup:
- Install uv
uv sync
Run checks after each change:
make check
Run on a directory, for example this repository:
uv run ./copyright.py .
On Mac M1 laptop the last command takes about one minute and prints four copyright holders.
Testing against a random ~50 MB source tree took about 40 minutes. It found about 140 results, of which about 6% seemed to be invalid.