In the name of all that is methodical and systematic, behold! TIPS — The Temporal Interview Profiling System — a most noble contraption whereby the souls of candidates may be weighed, measured, and thoroughly examined in the crucible of the modern interview.
Whereas the art of hiring hath long suffered under the yoke of subjectivity and caprice, TIPS emergeth as a beacon of analytical rigour. This system doth process recorded video interviews through a multi-stage pipeline of considerable sophistication, extracting signals both auditory and visual, evaluating responses with the keen eye of Large Language Models, and rendering time-evolving verdicts upon the candidate's worthiness.
The Temporal Interview Profiling System (TIPS) standeth as an automated interview analysis engine, designed to process pre-recorded video interviews and generate comprehensive behavioural and semantic assessments of candidates whomst seek employment.
The system doth operate as a batch processing apparatus, wherein interviews are first recorded and subsequently subjected to six distinct stages of analysis, each building upon the last, until finally there emerge time-evolving candidate scores and verdicts upon hiring.
TIPS striveth to provide an objective, data-driven assessment by analysing:
- Verbal Responses — What the candidate doth uttereth
- Vocal Characteristics — How the candidate doth speaketh
- Visual Behaviour — Whither the candidate doth looketh
Unlike实时 systems of lesser capability, TIPS understandeth not merely what is said, but when it is said, and how the candidate's confidence doth fluctuate throughout the interrogation.
The architecture of TIPS consisteth of three principal components, each serving its sacred purpose:
┌─────────────────────┐ ┌───────────────────────┐ ┌──────────────────┐
│ Interview UI │ │ Backend Pipeline │ │ The Dashboard │
│ (Recording) │────▶│ (Six Stages) │────▶│ (Visualisation)│
└─────────────────────┘ └───────────────────────┘ └──────────────────┘
Pillar I — The Interview Recording UI
A WebRTC-powered browser interface for conducting video interviews with synchronised audio capture from both interviewer and candidate.
Pillar II — The Backend Pipeline
A six-stage processing system that analyseth recorded interviews and extracteth behavioural metrics and semantic relevance scores.
Pillar III — The Dashboard
An interactive web-based visualisation interface for presenting and exploring analysis results in their full splendour.
| Component | Technology |
|---|---|
| Tongue | Python 3.11 |
| Web Framework | FastAPI |
| WebRTC | aiortc |
| ASGI Server | uvicorn |
| Speech-to-Text | faster-whisper (small model) |
| Audio Analysis | librosa, webrtcvad |
| Video Processing | OpenCV, MediaPipe |
| LLM Inference | Transformers + PyTorch |
| LLM Quantization | BitsAndBytes (4-bit NF4) |
| Video Codec | ffmpeg, PyAV |
| Component | Technology |
|---|---|
| Backend Server | FastAPI (Python) |
| Real-time Comm. | WebSocket |
| Video/Audio Capture | WebRTC (Browser API) |
| Media Processing | aiortc (Python) |
| Frontend | HTML5, CSS3, JavaScript (ES6+) |
| Component | Technology |
|---|---|
| Frontend Framework | Vanilla JavaScript (ES6 Modules) |
| Routing | Custom SPA Router |
| Data Visualisation | Chart.js 4.4.2 |
| Styling | CSS3 with Custom Properties |
The LLM component employeth Qwen2.5-3B-Instruct with 4-bit quantisation, enabling operation upon GPUs of modest VRAM (approximately 3GB) whilst maintaining reasonable inference quality.
├── backend/ # The Great Engine (Pipeline)
│ ├── main.py # Pipeline orchestration
│ ├── config/ # Configuration module
│ ├── requirements.txt # Python dependencies
│ ├── src/
│ │ ├── stage0_timebase/ # Stage 0 — Timebase Establishment
│ │ ├── stage1_extraction/ # Stage 1 — Signal Extraction
│ │ │ ├── candidate_audio.py # Stage 1A
│ │ │ ├── interviewer_audio.py# Stage 1B
│ │ │ └── candidate_video.py # Stage 1C
│ │ ├── stage2_temporal/ # Stage 2 — Temporal Segmentation
│ │ ├── stage3_behavior/ # Stage 3 — Behavioural Metrics
│ │ ├── stage4_semantic/ # Stage 4 — Semantic Scoring
│ │ └── stage5_aggregation/ # Stage 5 — Verdict Aggregation
│ ├── trans/ # Interview recordings (inputs)
│ ├── jd/ # Job descriptions
│ ├── output/ # Pipeline JSON outputs
│ └── results/ # Timestamped results
│
├── web_ui/ # The Interview Recording UI
│ ├── server.py # FastAPI WebSocket server
│ ├── index.html # Landing page (role selection)
│ ├── interviewer.html # Interviewer control panel
│ ├── candidate.html # Candidate interface
│ ├── interviewer.js # Interviewer WebRTC client
│ ├── candidate.js # Candidate WebRTC client
│ └── recordings/ # Stored interview sessions
│
├── docs/
│ ├── photos/ # Dashboard screenshots (1-16)
│ ├── TIPS_Body.tex # The great tome of documentation
│ └── dfd.py # Data flow diagram generator
│
├── Interview_scripts/ # LaTeX interview question scripts
│ ├── machine_learning_engineer.tex
│ ├── system_architect.tex
│ └── product_manager.tex
│
├── README.md
└── .gitattributes
The pipeline consisteth of six stages, each with its sacred duty:
| Stage | Name | Description |
|---|---|---|
| 0 | Timebase | Establisheth canonical time base, synchroniseth all streams |
| 1 | Extraction | Three parallel sub-stages extract audio features, video features, and transcriptions |
| 2 | Temporal | Groupeth speaking segments, detecteth silence, paireth Q&A |
| 3 | Behaviour | Computeth confidence, fluency, eye contact, latency |
| 4 | Semantic | Scoreth answer relevance using LLM against Job Description |
| 5 | Aggregation | JD-conditioned scoring with chronological accumulation |
Stage 0 establisheth the canonical time base that synchroniseth all subsequent processing. This foundational stage extracteth:
- Video metadata (FPS, frame count, duration)
- Audio metadata (sample rate, channel count, duration)
- Timestamp alignment between video and audio streams
- Dataset identification for tracking
All subsequent stages depend upon this timebase, for it is the very foundation upon which the temple of analysis is built.
Stage 1 representeth the primary data extraction phase, uniquely designed for parallel execution:
Stage 1A — Candidate Audio Processing
Extracteth RMS energy, pitch, voice activity segments, and speech transcription via Faster-Whisper.
Stage 1B — Interviewer Audio Processing
Transcribeth the interviewer's utterances for question identification, with word-level timestamps.
Stage 1C — Candidate Video Processing
Extracteth facial features, head pose estimation (yaw, pitch, roll), and gaze direction from sampled frames.
These three substages run concurrently, maximising throughput whilst minimising total processing time.
Stage 2 combineth the outputs of Stage 1 to create a unified temporal view:
- Speaking Segment Detection — Identifyeth when each party doth speak
- Q&A Pairing — Mapeth interviewer questions to candidate answers using temporal proximity and silence tolerance
The algorithm thus handleth natural interview flow wherein candidates may pause whilst collecting their thoughts.
Stage 3 computeth behavioural metrics for each candidate speaking segment:
Audio Metrics:
- Pitch (fundamental frequency) — indicateth confidence
- Energy (RMS) — indicateth assertiveness
- Speech rate — indicateth fluency
- Pause density — indicateth thinking time
Video Metrics:
- Face presence ratio — indicateth camera engagement
- Head motion — indicateth nervousness or engagement
- Gaze stability — indicateth eye contact quality
Stage 4 implementeth the core intelligence of TIPS, employing the Qwen2.5-3B-Instruct LLM with 4-bit quantisation to:
- Evaluate semantic relevance between answers and job description (0.0-1.0)
- Extract matched keywords from technical vocabulary
- Assess five competency dimensions:
- Technical Depth
- System Design
- Production Experience
- Communication Clarity
- Problem Solving
- Generate incremental verdicts after each question
Stage 5 aggregateth the incremental assessments to render a final hiring recommendation:
Verdict Options:
- STRONG_HIRE — Exceedeth expectations
- HIRE — Meeteth requirements
- BORDERLINE — Mixed signals
- NO_HIRE — doth not meet requirements
With confidence levels of HIGH, MEDIUM, or LOW, and a natural language justification.
Behold! The Dashboard, a panoptic vessel wherein the entirety of the candidate's performance shall be laid bare unto the analyst's discerning eye.
The Dashboard consisteth of five principal views, each presenting different facets of the analysis:
The grand entryway presenting an overview of the entire interview — candidate name, position sought, and all metadata pertaining to the assessment.
A scrolling timeline of all utterances, colour-coded by speaker, enabling traversal through the interview's chronology. Click upon any segment, and the video shall reposition itself to that moment whilst the transcript appeareth overlaid upon the visual.
Two graphs of great utility:
- Score Trajectory — The candidate's performance over time
- Behavioural Signals — Three crucial metrics (Vocal Energy, Speech Rate, Gaze Stability), each toggleable at the analyst's pleasure
A cornucopia of visualisations:
- Question-Wise Relevance Score — Bar graph for each interrogatory
- Competency Breakdown — Aggregated performance across all dimensions
- Keyword Match Heatmap — Distribution of key terms from the job description
- Behavioural Distributions — Confidence levels, fluency scores, response latencies
- Performance Trajectory — How performance evolved throughout
All question-answer pairs in unified view, each expandable to reveal the full particulars: question, answer, relevance score, behavioural metrics, and competency.
The Interview Recording UI provides a browser-based interface for conducting video interviews:
- Landing Page — Role selection (Interviewer / Candidate)
- Interviewer Panel — Start/control interview, view participant status
- Candidate Interface — Camera/microphone selection, local recording backup
- Server-side Recording — Captures interviewer audio, candidate audio, and candidate video
- Dark/Light Theme — Toggle support
The UI producestandardised output files in MP4 (video) and WAV (audio) formats, ensuring consistent input to the backend pipeline.
The pipeline accepteth the following:
| File | Format | Description |
|---|---|---|
candidate_video.mp4 |
MP4 (H.264) | Candidate's video feed |
candidate_audio.wav |
WAV (48kHz) | Candidate's microphone |
interviewer_audio.wav |
WAV (48kHz) | Interviewer's microphone |
*.md |
Plain text | Job description |
The pipeline producesthe following JSON files in backend/output/:
| File | Description |
|---|---|
timeline.json |
Master timebase synchronisation |
candidate_audio_raw.json |
Audio features + transcription |
candidate_video_raw.json |
Video frame features |
interviewer_transcript.json |
Interviewer speech-to-text |
speaking_segments.json |
Speaking vs silence segments |
qa_pairs.json |
Question-answer mappings |
candidate_behavior_metrics.json |
Behavioural metrics |
relevance_scores.json |
LLM-evaluated relevance scores |
candidate_score_timeline.json |
Time-evolving performance scores + final verdict |
cd backend
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/macOS
# venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
#iring interview recordings in trans Run pipeline (requ/ and JD in jd/)
python main.pycd web_ui
# Install Python dependencies
pip install fastapi uvicorn aiortc av
# Start the server
python server.py --host 0.0.0.0 --port 8000
# Open in browser
# Interviewer: http://localhost:8000/interviewer.html
# Candidate: http://localhost:8000/candidate.html# Sample recordings and JDs are included within
cd backend
python main.py- Interview Recording (WebRTC)
- Signal Extraction Pipeline
- Temporal Segmentation
- Behavioural Metrics Computation
- LLM-Powered Semantic Scoring
- Dashboard Visualisation
- Real-time streaming (contemplated)
MIT License — freely use, modify, and distribute this system.
Thus concludes our discourse upon TIPS — The Temporal Interview Profiling System. May it serve thee well in the noble quest of finding worthy candidates amongst the multitude of applicants. May thy hiring decisions be ever more informed, thy analysis ever more precise, and thy process ever more streamlined.
For in the final analysis, the truth shall set thee free — and TIPS shall help thee find it.
A systems-level exploration of temporal human-computer interaction in the domain of interviews.
Written in the year of our Lord 2026.




