Extract copyright holders from source code with help of AI

This is a proof-of-concept project to show one approach on how to extract copyright holders from source code tree with the help of AI, in this case using Ollama API and a local model.

This module will traverse a directory tree containing source code, look for markers that indicate possible copyright holders, construct a window where copyright holders might appear, and then ask an AI to return the copyright holders, if any, in that window.

It will output the cleaned up matches. They should not contain any markup. The output will be in UTF-8 regardless of original file encoding.

It is possible that the AI will output incorrect results, or miss valid copyright holders.

It is suggested that you preprocess the directory tree to unpack any archives, decrypt encrypted files, and do any other preprocessing that might be relevant to your use case.

Why?

When you use open source software to build your own product, many of the open source licenses require you to acknowledge the people that worked on the open source project you used.

Production Usage?

This is a proof-of-concept project, and not recommended for production usage.

Some possible things you might want to change for production usage:

Fine tune the deterministic code that finds possible copyright holder windows
Modify prompt to allow email addresses, URLs, and any other data you might care about
Compare different models to find out which performs best in your use case
Use different models for extraction and validation
Use remote APIs for better performance
Split into main scanner thread, and worker AI threads, and make multiple AI API calls concurrently to process copyright windows faster
Run with PyPy or port to a different language for better performance

Development Environment

One time setup:

Install uv
uv sync

Run checks after each change:

make check

Run on a directory, for example this repository:

uv run ./copyright.py .

On Mac M1 laptop the last command takes about one minute and prints four copyright holders.

Testing against a random ~50 MB source tree took about 40 minutes. It found about 140 results, of which about 6% seemed to be invalid.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
test		test
.gitignore		.gitignore
.python-version		.python-version
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
copyright.py		copyright.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extract copyright holders from source code with help of AI

Why?

Production Usage?

Development Environment

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Extract copyright holders from source code with help of AI

Why?

Production Usage?

Development Environment

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages