This repository features the testing code (and probably final code) we used for extracting the embeddings out of video transcripts for Bizarro-Devin. There are a few files in this repository, all having their own purpose
embeddings-transformers.js is the file that generates embeddings from transcripts in the transcripts directory
semantic-retrieval.js can be used for retrieving from the embeddings based on a query
semantic-retrieval-benchmark.js is used for benchmarking the retrieval, during my own tests it was ~180ms / retrieval
- Make sure you've installed all dependencies by running npm install
- Create a directory called transcriptsand insert all json transcript files in here. Each file being a transcript of a video. The transcript json should be in the following format:
{
    "text": "full transcript text",
    "chunks": [
        {
            "timestamp": [0.48, 7.04],
            "text": "..."
        }
    ]
}However, the chunks array is currently not used. So this can be left out.
- Create a embeddingsdirectory for the embeddings of each transcript to be written to
- Run node embeddings-transformers.jsto run the script that generates the embeddings. All embeddings should now be in the embeddings folder, as well as anembeddings.jsonfile being present in the current working directory. Thisembeddings.jsonfile is the combination of all embeddings generated from the transcripts.
- Make sure you've installed all dependencies by running npm install
- Make sure you have the embeddings you want to retrieve from in an embeddings.jsonfile. This file is usually already generated if you've generated them using the previous generating embeddings section.
- Open up the semantic-retrieval.jsfile and edit your query on line25.
- Save the file and run node semantic-retrieval.jsto retrieve the top 5 results from the embeddings.