,
- [2026.03.30] 🔥🔥 We release Penguin-Recap-V! This dataset features multi-granularity video annotations with descriptions across three temporal scales: Dense time-level, Paragraph-level, and Video-level.
- [2026.03.26] 🔥🔥 The evaluation of Penguin-VL on benchmarks is now supported in lmms-eval.
- [2026.03.20] 🔥🔥 We release Penguin-Recap-I, our reconstructed high-quality image training data for Penguin-VL, on Hugging Face.
- [2026.03.17] We release training code for Penguin-VL, details see § Training.
- [2026.03.10] Penguin-VL got
#1 Paper of the dayin huggingface daily paper. - [2026.03.09] Release inference code, vLLM plugin, and Gradio demo for Penguin-VL.
- [2026.03.09] Release Penguin-VL-2B, Penguin-VL-8B, and Penguin Vision Encoder on Hugging Face.
- Release our re-captioned training data - Penguin-Recap-I (Image)
- Release training code
- Release model checkpoint
- Release inference code
Penguin-VL is a compact vision-language model family built to study how far multimodal efficiency can be pushed by redesigning the vision encoder, rather than only scaling data or model size.
Most modern VLMs rely on vision encoders pretrained with large-scale contrastive objectives such as CLIP/SigLIP-style pretraining. Penguin-VL argues that this setup can be suboptimal for multimodal reasoning because contrastive learning favors coarse category-level invariances over the fine-grained signals needed for OCR, document understanding, dense captioning, and complex reasoning. Instead, Penguin-VL introduces Penguin-Encoder, a vision encoder initialized from a text-only LLM, so the visual backbone starts closer to the language model representation space and learns more data-efficiently.
Framework overview of Penguin-VL: an LLM-initialized vision encoder, mixed-supervision pretraining, and efficient video token compression.
-
LLM → Vision Encoder initialization (Penguin-Encoder)
Initialize the vision encoder from a text-only LLM (e.g., Qwen3-0.6B), convert causal attention to bidirectional attention, and add 2D-RoPE for variable-resolution vision tokens. -
Mixed-supervision encoder pretraining
Warm up the LLM-initialized encoder with a reconstruction/distillation objective under a teacher vision encoder (amplitude / direction / relation losses) to inject visual knowledge stably, then switch to high-resolution alignment. -
Video efficiency via Temporal Redundancy-Aware (TRA) token compression
Dynamically allocate token budgets across key frames vs. intermediate frames under a global token budget to scale to long videos more efficiently. -
Unified training recipe
A low-to-high resolution curriculum + instruction tuning strategy that balances image and video capabilities at compact scale.
Penguin-VL-2B delivers a strong accuracy-efficiency tradeoff across image and video benchmarks, with especially solid gains on OCR-heavy and reasoning-heavy tasks where fine-grained visual understanding matters most.
Benchmark snapshot for Penguin-VL-2B across image and video evaluation suites.
The released checkpoints and encoder weights are listed below.
| Model | Hugging Face |
|---|---|
| Penguin-VL-2B | tencent/Penguin-VL-2B |
| Penguin-VL-8B | tencent/Penguin-VL-8B |
| Penguin Vision Encoder | tencent/Penguin-Encoder |
- Python = 3.11.13 (recommended)
- PyTorch ≥ 2.5 (CUDA 12.4 recommended)
- CUDA ≥ 11.8
# Clone the repository
git clone <repo_url>
cd <repo_name>
# Recommended: create and activate a clean conda environment
conda create -n PenguinVL python=3.11.13 -y
conda activate PenguinVL
# INSTALL ffmpeg if you don't have it on your system
conda install ffmpeg -y # optional
# Install dependencies (inference + Gradio demo)
pip install -r requirements.txt
# NOTE: If you plan to use vLLM, it's recommended to install vLLM before flash-attn (see § vLLM Inference).
# Install Flash Attention (recommended for faster inference)
pip install flash-attn==2.8.3 --no-build-isolation| Use Case | Recommended |
|---|---|
| Transformers inference | transformers==4.51.3 |
| vLLM inference | Install vLLM separately (see § vLLM Inference) |
Use HuggingFace AutoModelForCausalLM + AutoProcessor for image, video, and text.
python inference/example_penguinvl.pyYou can provide a customized --model-path argument to the script (default: tencent/Penguin-VL-8B). You can also set it to tencent/Penguin-VL-2B. Supported formats:
- Video:
type: "video"withvideo_path,fps,max_frames - Image:
type: "image"withimage_path - Mixed: image + video + text in one conversation
- Text-only: plain text dialogue
Checkout the inference notebook for a GitHub-friendly walkthrough of Penguin-VL across diverse tasks.
Unlike a multi-notebook cookbook, Penguin-VL currently provides one consolidated notebook that covers multiple representative examples in a single place.
| Notebook | Description |
|---|---|
| Inference Recipes | Demonstrations of Penguin-VL for visual code generation, OCR/document parsing, creative image understanding, table extraction, multi-round chart analysis, multi-round video understanding, mixed video+image prompting, and a text-only baseline. |
If you want to re-execute the notebook locally and regenerate the GitHub-previewable output:
export PENGUIN_VL_MODEL_PATH=tencent/Penguin-VL-8B
jupyter nbconvert \
--to notebook \
--execute \
--output 01_penguinvl_inference_recipes.public.ipynb \
--ExecutePreprocessor.timeout=-1 \
--ExecutePreprocessor.kernel_name=penguinvl \
inference/notebooks/01_penguinvl_inference_recipes.source.ipynbThe clean source notebook lives at inference/notebooks/01_penguinvl_inference_recipes.source.ipynb.
Launch a local web UI with image/video upload and chat.
python inference/launch_gradio_demo.py --model-path tencent/Penguin-VL-8BThen open http://localhost:33666 (or your machine’s IP + port) in a browser.
| Option | Description | Default |
|---|---|---|
--model-path |
Model path or HuggingFace ID | required |
--server-port |
Backend inference server port | 16667 |
--interface-port |
Gradio web UI port | 33666 |
--nproc |
Number of backend worker processes | 1 |
# 2B model, default ports
python inference/launch_gradio_demo.py --model-path tencent/Penguin-VL-2B
# 8B model, custom UI port
python inference/launch_gradio_demo.py --model-path tencent/Penguin-VL-8B --interface-port 8080
# Multi-worker backend
python inference/launch_gradio_demo.py --model-path tencent/Penguin-VL-8B --nproc 4Installing vLLM 0.11.0 requires PyTorch 2.8 and the corresponding compatible version of Flash Attention. This setup may different from the default Transformers inference environment (which recommends PyTorch ≥ 2.5). To avoid version conflicts, you may need to create a separate environment or upgrade dependencies accordingly.
Install order note: if you plan to use vLLM, it's recommended to install vLLM first, and then install Flash Attention.
- The vLLM plugin targets vLLM 0.11.0 (
penguinvl/plugin/vllm/v0_11_0/). - vLLM is not in
requirements.txtby default; install it separately:
pip install vllm==0.11.0- Flash Attention /
flash-attnimport errors (e.g.,ImportError: ... undefined symbol: ...): try reinstallingflash-attn:
pip uninstall flash-attn
pip install flash-attn --no-cache --no-build-isolationcannot find -lcudaduring flashinfer build:
export LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LIBRARY_PATH
# or /usr/local/cuda/lib64 depending on your CUDA install# Single GPU
python -m penguinvl.plugin.vllm serve tencent/Penguin-VL-8B
# Multi-GPU (e.g. 8B on 2 GPUs)
python -m penguinvl.plugin.vllm serve tencent/Penguin-VL-8B --port 8000 --tensor-parallel-size 2Additional options: --host, --max-model-len, etc. (see vLLM 0.11 serve docs).
Run text, image, video, and batch demos:
# All demos (single GPU)
CUDA_VISIBLE_DEVICES=0 python inference/test_vllm_infer.py --model-path tencent/Penguin-VL-8B
# Text-only
CUDA_VISIBLE_DEVICES=0 python inference/test_vllm_infer.py --model-path tencent/Penguin-VL-8B --demo text
# Image (requires --image-path)
CUDA_VISIBLE_DEVICES=0 python inference/test_vllm_infer.py --model-path tencent/Penguin-VL-8B --demo image --image-path assets/inputs/horse_poet.png
# Video
CUDA_VISIBLE_DEVICES=0 python inference/test_vllm_infer.py --model-path tencent/Penguin-VL-8B --demo video --video-path assets/inputs/polar_bear.mp4
# 8B with tensor parallelism (2 GPUs)
CUDA_VISIBLE_DEVICES=0,1 python inference/test_vllm_infer.py --model-path tencent/Penguin-VL-8B --tensor-parallel-size 2| Argument | Description |
|---|---|
--model-path |
HuggingFace model name or local path |
--demo |
text | image | batch | video | all |
--tensor-parallel-size |
Number of GPUs for tensor parallelism |
--max-new-tokens |
Max tokens to generate |
--max-model-len |
Max context length |
--gpu-memory-utilization |
GPU memory fraction (0–1) |
We release Penguin-Recap-I as the public image training data accompanying Penguin-VL: https://huggingface.co/datasets/tencent/Penguin-Recap-I
The release currently covers the image-side recap data and contains three subsets:
datacomp_coyo_penguinsa1b_penguinopenimages_penguin
For datacomp_coyo_penguin, we provide the original image URLs in each JSON entry for downloading.
For sa1b_penguin and openimages_penguin, we provide the training annotations together with image file names / relative paths, so users can map each sample back to the original image resources and download the raw images from OpenDataLab or the official sources:
- OpenDataLab OpenImagesV6: https://opendatalab.com/OpenDataLab/OpenImagesV6/tree/main/raw
- OpenDataLab SA-1B: https://opendatalab.com/OpenDataLab/SA-1B/tree/main/raw
- Official Segment Anything / SA-1B: https://ai.meta.com/datasets/segment-anything/
- Official OpenImages: https://storage.googleapis.com/openimages/web/index.html
Penguin-VL adopts a 4-stage curriculum:
| Stage | Script | Description | Trainable Modules |
|---|---|---|---|
| Stage 1 | vision_encoder_pretrain.sh |
Vision encoder warm-up with reconstruction / distillation losses. The LLM-initialized encoder learns to extract visual features under supervision from a VideoLLaMA3 vision encoder teacher. | Vision encoder + projector |
| Stage 2 | vision_encoder_pretrain_hres.sh |
High-resolution alignment. Continues from Stage 1 with higher sequence budgets to handle dense text and document images. | All parameters |
| Stage 3 | pretrain.sh |
Full multi-modal pre-training on large-scale image and video corpora. | All parameters |
| Stage 4 | sft.sh |
Supervised fine-tuning (instruction tuning) on high-quality chat/task data. | All parameters |
Organize all images and videos under a single data_root directory:
data_root/
├── images/
│ ├── image_0001.jpg
│ └── ...
├── videos/
│ ├── video_0001.mp4
│ └── ...
├── annotations_image.jsonl
├── annotations_video.jsonl
└── ...Each annotation file is a JSONL file where every line is a JSON object in the following format:
Image example:
{
"id": "sample_0001",
"image": ["images/image_0001.jpg"],
"conversations": [
{"from": "human", "value": "<image>\nWhat is shown in the image?"},
{"from": "gpt", "value": "The image shows a golden retriever playing on a beach."}
]
}Video example:
{
"id": "sample_0002",
"video": ["videos/video_0001.mp4"],
"conversations": [
{"from": "human", "value": "<video>\nBriefly describe what happens in the video."},
{"from": "gpt", "value": "A person assembles a bicycle in a garage, checking each component carefully."}
]
}Text-only example:
{
"id": "sample_0003",
"conversations": [
{"from": "human", "value": "What is the capital of France?"},
{"from": "gpt", "value": "The capital of France is Paris."}
]
}Notes
- Multiple annotation files can be passed simultaneously to
--data_path.- If
<image>/<video>tokens are absent from the first user turn, they will be prepended automatically.- Both
.json(list) and.jsonl(one object per line) formats are supported..jsonlwith HuggingFacedatasetsis recommended for large corpora.
penguinvl/tools/calculate_seqlen.py is a preprocessing utility that runs before training to pre-compute the approximate sequence length of every sample in a JSONL annotation file. The resulting length index can be passed to --data_lengths_path so the dataloader can sort samples by length, reducing padding waste and speeding up training.
The script runs in two phases internally:
- Metadata extraction — resolves each sample's image resolution (via PIL) or video dimensions / frame count (via
ffprobe), then writes an enriched<input>_meta.jsonlwithwidth,height, andframesfields added to each record. - Length estimation — tokenizes all conversation text with the specified tokenizer and adds an estimated visual token count based on the resolution, then saves a length-sorted index tensor to
lengths.pt.
Both phases run in parallel across all available CPU cores.
python penguinvl/tools/calculate_seqlen.py \
--input /path/to/annotations.jsonl \
--root /path/to/data_root \
--tk-path Qwen/Qwen3-0.6B \
--fps 1 \
--max-frames 180| Argument | Description | Default |
|---|---|---|
--input / -i |
Input JSONL annotation file | required |
--root / -r |
Root directory for resolving image/video paths | "" |
--tk-path |
Tokenizer path or HuggingFace model ID used for text length estimation | Qwen/Qwen3-0.6B |
--fps |
Frame rate used to estimate the number of video frames | 1 |
--max-frames |
Maximum frame count cap for video length estimation | 180 |
--chunksize |
Lines per worker chunk for imap_unordered |
100 |
| File | Description |
|---|---|
<input>_meta.jsonl |
Copy of the input JSONL with width, height, and frames fields added to each record. |
<root>/lengths.pt |
A torch.LongTensor of length-sorted sample indices. Pass this to --data_lengths_path in the training script. |
After generating lengths.pt, add --data_lengths_path and --group_by_modality_length True to your training script:
--group_by_modality_length True
--data_lengths_path /path/to/data_root/lengths.ptThis enables length-sorted batching, which significantly reduces padding overhead when training on datasets with high length variance (e.g. mixed image + video data).
Training scripts live in scripts/train/. Edit the following variables at the top of each script before launching:
| Variable | Description | Example |
|---|---|---|
DATA_DIR |
Root directory of your dataset | /data/penguinvl_data |
OUTP_DIR |
Root directory for checkpoints | work_dirs |
WANDB_PROJECT |
W&B project name | penguinvl_qwen3_exp |
ARG_WORLD_SIZE |
Number of nodes | 1 |
ARG_NPROC_PER_NODE |
Number of GPUs per node | 8 |
GLOBAL_BATCH_SIZE |
Effective global batch size | 128 |
LOCAL_BATCH_SIZE |
Per-GPU batch size | 2 |
Gradient accumulation is derived automatically:
GRADIENT_ACCUMULATION_STEPS = GLOBAL_BATCH_SIZE / (WORLD_SIZE × NPROC_PER_NODE × LOCAL_BATCH_SIZE)
bash scripts/train/vision_encoder_pretrain.sh [NUM_NODES] [NUM_GPUS_PER_NODE]Key arguments specific to Stage 1:
--model_path Qwen/Qwen3-1.7B # LLM part
--vision_encoder Cyril666/SFL-Encoder-Pretrained-Qwen3 # LLM-initialized vision encoder (converted from Qwen/Qwen-0.6B and modified the layer parameter names.)
--use_reconstruct True # Enable Stage 1 reconstruction / distillation loss
--vision_encoder_teacher DAMO-NLP-SG/VL3-SigLIP-NaViT # VideoLLaMA3 vision encoder teacher checkpoint
--model_max_length 4096
--mm_max_length 2048bash scripts/train/vision_encoder_pretrain_hres.sh [NUM_NODES] [NUM_GPUS_PER_NODE]Loads from stage_1 checkpoint. Increases context budgets for high-resolution inputs:
--model_max_length 16384
--mm_max_length 10240bash scripts/train/pretrain.sh [NUM_NODES] [NUM_GPUS_PER_NODE]Loads from stage_2 checkpoint. All three modules (vision encoder, projector, LLM) are jointly trained.
bash scripts/train/sft.sh [NUM_NODES] [NUM_GPUS_PER_NODE]Loads from stage_3 checkpoint. Uses high-quality instruction-following data for final alignment.
| Argument | Description | Default |
|---|---|---|
--model_type |
Model architecture type | penguinvl_qwen3 |
--model_path |
Path to LLM backbone or previous stage checkpoint | — |
--vision_encoder |
Path or HF ID of the vision encoder | — |
--vision_projector_type |
Projector architecture | mlp2x_gelu |
--use_reconstruct |
Enable the Stage 1 visual reconstruction / distillation loss | False |
--vision_encoder_teacher |
VideoLLaMA3 vision encoder teacher checkpoint | None |
--data_path |
Space-separated list of annotation files | — |
--data_folder |
Root folder for all media files | — |
--fps |
Video sampling frame rate | 1 |
--max_frames |
Maximum number of frames per video | 180 |
--image_merge_size |
Token merge factor for images | 1 |
--video_merge_size |
Token merge factor for video frames | 2 |
--model_max_length |
Maximum total sequence length (truncation) | 512 |
--mm_max_length |
Maximum visual token budget per sample | 10240 |
--llm_lr |
Learning rate for the LLM backbone | None |
--vision_encoder_lr |
Learning rate for the vision encoder | None |
--vision_projector_lr |
Learning rate for the MLP projector | None |
--embedding_lr |
Learning rate for embedding layers | None |
--deepspeed |
DeepSpeed config path | scripts/zero1.json |
--gradient_checkpointing |
Enable gradient checkpointing | True |
--use_batch_flattening |
Flatten variable-length sequences in a batch | True |
The scripts support multi-node training via torchrun. Pass WORLD_SIZE, NPROC_PER_NODE, MASTER_ADDR, MASTER_PORT, and RANK as environment variables or positional arguments:
# Node 0 (master)
WORLD_SIZE=2 NPROC_PER_NODE=8 MASTER_ADDR=<node0_ip> MASTER_PORT=16667 RANK=0 \
bash scripts/train/sft.sh
# Node 1
WORLD_SIZE=2 NPROC_PER_NODE=8 MASTER_ADDR=<node0_ip> MASTER_PORT=16667 RANK=1 \
bash scripts/train/sft.sh.
├── penguinvl/ # Core model and processor code
│ ├── plugin/vllm/ # vLLM plugin (v0_11_0)
│ ├── tools/ # Tool scripts
│ └── train/ # Training code
├── scripts/ # Training scripts
├── inference/
│ ├── example_penguinvl.py # Transformers inference example
│ ├── test_vllm_infer.py # vLLM inference demo
│ ├── launch_gradio_demo.py # Gradio local demo
│ ├── notebooks/ # Executed and source Jupyter notebooks
│ ├── server/ # Backend for Gradio
│ ├── interface/ # Gradio UI
│ └── transformers_api/ # Transformers model/processor wrappers
├── assets/
│ ├── framework.png # README framework figure
│ ├── 2b_results.png # README benchmark figure
│ └── inputs/ # Demo images and videos
└── requirements.txt
This project is released under the Apache 2.0 License.
If you use Penguin-VL in your research, please cite:
@article{Penguin-VL,
title={Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders},
author={Boqiang Zhang and Lei Ke and Ruihan Yang and Qi Gao and Tianyuan Qu and Rossell Chen and Dong Yu and Leoweiliang},
journal={arXiv preprint arXiv:2603.06569},
year={2026}
}If you find this project useful, please consider giving it a ⭐ on GitHub. Issues and PRs are welcome.


