Skip to content

heikkitoivonen/copyright-extraction-poc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Extract copyright holders from source code with help of AI

This is a proof-of-concept project to show one approach on how to extract copyright holders from source code tree with the help of AI, in this case using Ollama API and a local model.

This module will traverse a directory tree containing source code, look for markers that indicate possible copyright holders, construct a window where copyright holders might appear, and then ask an AI to return the copyright holders, if any, in that window.

It will output the cleaned up matches. They should not contain any markup. The output will be in UTF-8 regardless of original file encoding.

It is possible that the AI will output incorrect results, or miss valid copyright holders.

It is suggested that you preprocess the directory tree to unpack any archives, decrypt encrypted files, and do any other preprocessing that might be relevant to your use case.

Why?

When you use open source software to build your own product, many of the open source licenses require you to acknowledge the people that worked on the open source project you used.

Production Usage?

This is a proof-of-concept project, and not recommended for production usage.

Some possible things you might want to change for production usage:

  • Fine tune the deterministic code that finds possible copyright holder windows
  • Modify prompt to allow email addresses, URLs, and any other data you might care about
  • Compare different models to find out which performs best in your use case
  • Use different models for extraction and validation
  • Use remote APIs for better performance
  • Split into main scanner thread, and worker AI threads, and make multiple AI API calls concurrently to process copyright windows faster
  • Run with PyPy or port to a different language for better performance

Development Environment

One time setup:

  1. Install uv
  2. uv sync

Run checks after each change:

make check

Run on a directory, for example this repository:

uv run ./copyright.py .

On Mac M1 laptop the last command takes about one minute and prints four copyright holders.

Testing against a random ~50 MB source tree took about 40 minutes. It found about 140 results, of which about 6% seemed to be invalid.

About

PoC: Extract copyright holders from source code with help of AI

Topics

Resources

License

Stars

Watchers

Forks

Contributors