This repo contains scripts to:
-
Enumerate
.m4afiles from a Box shared folder and generate a manifest (videos.jsonl) -
Run each file through the Azure Functions pipeline:
- Submit batch transcription (
TranscribeHttp) - Write 30s segments JSON to Blob (
segments/<video_id>.json) - Embed + index segments into Azure AI Search (
EmbedAndIndex)
- Submit batch transcription (
-
Query indexed segments (
SearchSegments)
-
Python 3.11+ recommended
-
Azure Functions already deployed (or runnable locally)
-
Box shared folder link that contains
.m4afiles -
Working Box API token:
- EITHER a Developer Token (quick + expires)
- OR OAuth tokens (
BOX_ACCESS_TOKEN+BOX_REFRESH_TOKEN+ client id/secret)
transcribe/
scripts/
box_auth.py
box_shared_folder_manifest.py
import_videos.py
videos.jsonl # generated
requirements.txt
.env # you create this (NOT committed)
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtIf you don’t have a requirements.txt for scripts yet, minimally you’ll need:
requests
python-dotenv(Box listing can be done via raw REST calls, so you may not need boxsdk.)
Create a .env file in the project root (same directory you run scripts from):
Set the shared folder URL:
BOX_SHARED_FOLDER_URL=https://tulane.box.com/s/<shared-folder-token>Choose one auth method:
BOX_TOKEN=<your_box_developer_token>BOX_CLIENT_ID=<your_box_client_id>
BOX_CLIENT_SECRET=<your_box_client_secret>
BOX_ACCESS_TOKEN=<your_box_access_token>
BOX_REFRESH_TOKEN=<your_box_refresh_token>Note: refresh tokens can become invalid if rotated/revoked. If you see
invalid_grant, re-run your OAuth login flow and update.env.
These should be the full function URLs, including ?code=...:
TRANSCRIBE_URL=https://<yourapp>.azurewebsites.net/api/TranscribeHttp?code=...
EMBED_INDEX_URL=https://<yourapp>.azurewebsites.net/api/EmbedAndIndex?code=...
SEARCH_FN_URL=https://<yourapp>.azurewebsites.net/api/SearchSegments?code=...SEGMENTS_CONTAINER=segments
POLL_SECONDS=15
MAX_ACTIVE=10This script reads your Box shared folder and outputs videos.jsonl with one line per .m4a:
{"video_id":"vid_123","media_url":"https://..."}
{"video_id":"vid_456","media_url":"https://..."}Run:
python scripts/box_shared_folder_manifest.pyPick one entry from videos.jsonl and confirm it downloads:
python - <<'PY'
import json
with open("videos.jsonl","r") as f:
print(json.loads(next(f)))
PY
curl -I -L "<media_url>"You want 200 OK (not HTML/404). If this fails, Speech won’t be able to fetch it either.
This script:
- reads
videos.jsonl - submits transcription jobs via
TranscribeHttp - polls until each completes
- indexes segments via
EmbedAndIndex
Run:
python import_videos.pyThe importer writes a pipeline_state.json file as it runs. If the script stops, you can rerun it and it will resume from the saved state.
Once a few videos are indexed, query your SearchSegments function:
curl -X POST "$SEARCH_FN_URL" \
-H "Content-Type: application/json" \
-d '{"q":"measles","mode":"hybrid","top":5,"k":40}'If you get results, your segments are searchable.
- Ensure the manifest script is producing working
media_urls - Validate with
curl -I -L "<media_url>"(must end in 200) - If a shared link works in browser but not via curl, it may rely on cookies/redirects. The manifest script should output a direct download URL.
- Speech batch jobs can take time; check your
TranscribeHttpfunction logs / Application Insights - Consider increasing
POLL_SECONDSto reduce throttling - Reduce
MAX_ACTIVEif you see rate-limit behavior
- Azure AI Search keys can’t contain
:etc. If you use segment keys likevid:0001, replace:with_or-.
- Do not commit
.env,pipeline_state.json, or any token/key material. - Prefer query keys (read-only) for Search in front-end scenarios.
- For long-term automation, use a Box app auth method approved by your org (not developer token).