7 specialized modules for computer vision
Production-ready MediaPipe implementation with separate, testable modules for each capability. All modules support single-frame processing and async streaming.
Reference: https://ai.google.dev/edge/mediapipe/solutions/guide
Capability: Detect faces with 6 keypoints per face
Keypoints:
- Right eye, Left eye
- Nose tip, Mouth center
- Right ear, Left ear
Models:
- Model 0: Short-range (within 2 meters)
- Model 1: Full-range (within 5 meters)
Usage:
from mediapipe import FaceDetector
detector = FaceDetector(model_selection=1, min_detection_confidence=0.5)
# Single frame
detections = detector.detect_single(rgb_image)
for face in detections:
bbox = face['bbox'] # x, y, width, height
keypoints = face['keypoints'] # 6 points
confidence = face['confidence']
# Streaming
async for detections in detector.detect_stream(frame_generator):
print(f"Found {len(detections)} faces")
detector.close()Performance: 200 FPS @ 640x480 (RTX 4090)
Capability: 468-point 3D face landmarks with optional iris refinement
Landmarks:
- Face contours, eyebrows, eyes, nose, mouth
- Optional: Iris landmarks (468 → 478 total)
Usage:
from mediapipe import FaceMesh
mesh = FaceMesh(
max_num_faces=2,
refine_landmarks=True, # Include iris
min_detection_confidence=0.5
)
faces = mesh.process_single(rgb_image)
for face in faces:
landmarks = face['landmarks'] # 468 or 478 points
# Each landmark: {x, y, z, visibility, presence}
# Extract iris
if len(landmarks) == 478:
iris_data = mesh.get_iris_landmarks(landmarks)
left_iris = iris_data['left_iris'] # 5 points
right_iris = iris_data['right_iris'] # 5 points
mesh.close()Performance: 60 FPS @ 640x480 (RTX 4090)
Capability: 21-point hand landmarks with gesture recognition
Landmarks (per hand):
- Wrist
- Thumb (4 joints)
- Index, Middle, Ring, Pinky (4 joints each)
Detected Gestures:
- Thumb Up
- Victory / Peace Sign
- Open Palm
- Fist
- Pointing
- OK Sign
- Rock Sign
Usage:
from mediapipe import HandTracker
tracker = HandTracker(max_num_hands=2)
hands = tracker.track_single(rgb_image)
for hand in hands:
landmarks = hand['landmarks'] # 21 points
handedness = hand['handedness'] # 'Left' or 'Right'
gestures = hand['gestures']
for gesture in gestures:
print(f"{gesture['name']}: {gesture['confidence']}")
tracker.close()Performance: 100 FPS @ 640x480 (RTX 4090)
Capability: 33-point body pose with world coordinates
Landmarks:
- Face: nose, eyes, ears, mouth
- Upper body: shoulders, elbows, wrists, hands
- Torso: hips
- Lower body: knees, ankles, feet, heels
World Landmarks: Real-world 3D coordinates (meters, origin at hip center)
Usage:
from mediapipe import PoseTracker
tracker = PoseTracker(
model_complexity=2, # 0=lite, 1=full, 2=heavy
smooth_landmarks=True
)
pose = tracker.track_single(rgb_image)
if pose:
landmarks = pose['landmarks'] # 33 points (normalized)
world_landmarks = pose['world_landmarks'] # 33 points (meters)
confidence = pose['confidence']
# Calculate joint angles
angles = tracker.calculate_angles(landmarks)
print(f"Left elbow: {angles['left_elbow']}°")
print(f"Right knee: {angles['right_knee']}°")
tracker.close()Performance: 80 FPS @ 640x480 (RTX 4090)
Capability: Face mesh + hands + pose in single pass (543 landmarks!)
Components:
- Face: 468 landmarks
- Pose: 33 landmarks
- Left hand: 21 landmarks
- Right hand: 21 landmarks
Usage:
from mediapipe import HolisticTracker
tracker = HolisticTracker(model_complexity=2)
results = tracker.track_single(rgb_image)
if results['face']:
face_landmarks = results['face']['landmarks'] # 468
if results['pose']:
pose_landmarks = results['pose']['landmarks'] # 33
if results['left_hand']:
left_hand_landmarks = results['left_hand']['landmarks'] # 21
if results['right_hand']:
right_hand_landmarks = results['right_hand']['landmarks'] # 21
tracker.close()Performance: 40 FPS @ 640x480 (RTX 4090)
Note: Most efficient way to track everything at once
Capability: Iris landmarks + gaze direction
Provides:
- 5 iris landmarks per eye
- Eye region landmarks
- Gaze direction (normalized [-1, 1])
Usage:
from mediapipe import IrisTracker
tracker = IrisTracker(max_num_faces=1)
eyes = tracker.track_single(rgb_image)
for eye in eyes:
iris_landmarks = eye['iris_landmarks'] # 5 points
eye_landmarks = eye['eye_landmarks'] # ~25 points
gaze = eye['gaze_direction'] # {x, y}
print(f"{eye['eye']}: Looking {gaze['x']:.2f}, {gaze['y']:.2f}")
tracker.close()Gaze Interpretation:
x < 0: Looking left,x > 0: Looking righty < 0: Looking up,y > 0: Looking downx, y near 0: Looking center
Performance: 60 FPS @ 640x480 (RTX 4090)
Capability: Real-time person segmentation with effects
Modes:
- Selfie segmentation (person vs background)
- Hair segmentation (experimental)
Effects:
- Background blur
- Background replacement
- Foreground extraction (RGBA with alpha)
Usage:
from mediapipe import Segmenter
segmenter = Segmenter(
model_selection=1, # 0=general, 1=landscape
segmentation_type='selfie'
)
result = segmenter.segment_single(rgb_image)
mask_uint8 = result['mask'] # [0, 255]
mask_float = result['mask_float'] # [0.0, 1.0]
# Apply background blur
blurred = segmenter.apply_background(
image=rgb_image,
mask=mask_float,
background=None, # None = blur
blur_amount=15
)
# Replace background
new_bg = np.zeros_like(rgb_image) # Black background
replaced = segmenter.apply_background(
image=rgb_image,
mask=mask_float,
background=new_bg
)
# Extract foreground with alpha
rgba = segmenter.extract_foreground(rgb_image, mask_float)
segmenter.close()Performance: 100 FPS @ 640x480 (RTX 4090)
All modules support async streaming for real-time video:
async def process_stream(detector, frames):
async for frame in detector.detect_stream(frames):
# Process results
yield results
# Example frame generator
async def camera_frames():
cap = cv2.VideoCapture(0)
while True:
ret, frame = cap.read()
if not ret:
break
frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
yield frame_rgb
await asyncio.sleep(1/30) # 30 FPSAlways close modules when done:
# Using context manager (if implemented)
with FaceDetector() as detector:
faces = detector.detect_single(image)
# Manual cleanup
detector = FaceDetector()
try:
faces = detector.detect_single(image)
finally:
detector.close()Modules return empty results on failure (no exceptions for missing detections):
faces = detector.detect_single(image)
if not faces:
print("No faces detected")
else:
print(f"Found {len(faces)} faces")# Run all MediaPipe tests
pytest tests/test_mediapipe.py -v
# Test specific module
pytest tests/test_mediapipe.py::TestFaceDetection -v
pytest tests/test_mediapipe.py::TestHandTracking -v
# Test with real camera
python -c "
from mediapipe import FaceDetector
import cv2
detector = FaceDetector()
cap = cv2.VideoCapture(0)
while True:
ret, frame = cap.read()
if not ret:
break
rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
faces = detector.detect_single(rgb)
print(f'Faces: {len(faces)}')
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
detector.close()
"Core:
mediapipe==0.10.9opencv-python==4.8.1.78numpy==1.24.3
Optional:
Pillow==10.1.0(for additional image formats)
-
Choose right model complexity:
- Lite (0): Fast, less accurate
- Full (1): Balanced
- Heavy (2): Accurate, slower
-
Use streaming mode for video:
- Set
static_image_mode=False - Enables tracking (faster than per-frame detection)
- Set
-
Reduce resolution:
- MediaPipe works well at 640x480
- Downscale before processing if higher res
-
Use holistic for multiple features:
- More efficient than running face+hands+pose separately
-
Close modules when idle:
- Free GPU memory when not in use
- MediaPipe Official Docs - Complete API reference
- Services - How gRPC layer uses these modules
- Tests - Usage examples