Semantic Retrieval

This repository features the testing code (and probably final code) we used for extracting the embeddings out of video transcripts for Bizarro-Devin. There are a few files in this repository, all having their own purpose

embeddings-transformers.js is the file that generates embeddings from transcripts in the transcripts directory semantic-retrieval.js can be used for retrieving from the embeddings based on a query semantic-retrieval-benchmark.js is used for benchmarking the retrieval, during my own tests it was ~180ms / retrieval

How to use

Generating embeddings

Make sure you've installed all dependencies by running npm install
Create a directory called transcripts and insert all json transcript files in here. Each file being a transcript of a video. The transcript json should be in the following format:

{
    "text": "full transcript text",
    "chunks": [
        {
            "timestamp": [0.48, 7.04],
            "text": "..."
        }
    ]
}

However, the chunks array is currently not used. So this can be left out.

Create a embeddings directory for the embeddings of each transcript to be written to
Run node embeddings-transformers.js to run the script that generates the embeddings. All embeddings should now be in the embeddings folder, as well as an embeddings.json file being present in the current working directory. This embeddings.json file is the combination of all embeddings generated from the transcripts.