GitHub - tencent-ailab/Penguin-VL: Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders [Technical Report]

,

Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

📰 News

[2026.03.30] 🔥🔥 We release Penguin-Recap-V! This dataset features multi-granularity video annotations with descriptions across three temporal scales: Dense time-level, Paragraph-level, and Video-level.
[2026.03.26] 🔥🔥 The evaluation of Penguin-VL on benchmarks is now supported in lmms-eval.
[2026.03.20] 🔥🔥 We release Penguin-Recap-I, our reconstructed high-quality image training data for Penguin-VL, on Hugging Face.
[2026.03.17] We release training code for Penguin-VL, details see § Training.
[2026.03.10] Penguin-VL got #1 Paper of the day in huggingface daily paper.
[2026.03.09] Release inference code, vLLM plugin, and Gradio demo for Penguin-VL.
[2026.03.09] Release Penguin-VL-2B, Penguin-VL-8B, and Penguin Vision Encoder on Hugging Face.

📌 TODO

Release our re-captioned training data - Penguin-Recap-I (Image)
Release training code
Release model checkpoint
Release inference code

✨ Overview

Penguin-VL is a compact vision-language model family built to study how far multimodal efficiency can be pushed by redesigning the vision encoder, rather than only scaling data or model size.

Most modern VLMs rely on vision encoders pretrained with large-scale contrastive objectives such as CLIP/SigLIP-style pretraining. Penguin-VL argues that this setup can be suboptimal for multimodal reasoning because contrastive learning favors coarse category-level invariances over the fine-grained signals needed for OCR, document understanding, dense captioning, and complex reasoning. Instead, Penguin-VL introduces Penguin-Encoder, a vision encoder initialized from a text-only LLM, so the visual backbone starts closer to the language model representation space and learns more data-efficiently.

Framework overview of Penguin-VL: an LLM-initialized vision encoder, mixed-supervision pretraining, and efficient video token compression.

Highlights

LLM → Vision Encoder initialization (Penguin-Encoder)
Initialize the vision encoder from a text-only LLM (e.g., Qwen3-0.6B), convert causal attention to bidirectional attention, and add 2D-RoPE for variable-resolution vision tokens.
Mixed-supervision encoder pretraining
Warm up the LLM-initialized encoder with a reconstruction/distillation objective under a teacher vision encoder (amplitude / direction / relation losses) to inject visual knowledge stably, then switch to high-resolution alignment.
Video efficiency via Temporal Redundancy-Aware (TRA) token compression
Dynamically allocate token budgets across key frames vs. intermediate frames under a global token budget to scale to long videos more efficiently.
Unified training recipe
A low-to-high resolution curriculum + instruction tuning strategy that balances image and video capabilities at compact scale.

📈 Results

Penguin-VL-2B delivers a strong accuracy-efficiency tradeoff across image and video benchmarks, with especially solid gains on OCR-heavy and reasoning-heavy tasks where fine-grained visual understanding matters most.

Benchmark snapshot for Penguin-VL-2B across image and video evaluation suites.

The released checkpoints and encoder weights are listed below.

📦 Model Zoo

Model	Hugging Face
Penguin-VL-2B	tencent/Penguin-VL-2B
Penguin-VL-8B	tencent/Penguin-VL-8B
Penguin Vision Encoder	tencent/Penguin-Encoder

🛠️ Environment Setup

Requirements

Python = 3.11.13 (recommended)
PyTorch ≥ 2.5 (CUDA 12.4 recommended)
CUDA ≥ 11.8

Installation

# Clone the repository
git clone <repo_url>
cd <repo_name>

# Recommended: create and activate a clean conda environment
conda create -n PenguinVL python=3.11.13 -y
conda activate PenguinVL

# INSTALL ffmpeg if you don't have it on your system
conda install ffmpeg -y # optional

# Install dependencies (inference + Gradio demo)
pip install -r requirements.txt

# NOTE: If you plan to use vLLM, it's recommended to install vLLM before flash-attn (see § vLLM Inference).
# Install Flash Attention (recommended for faster inference)
pip install flash-attn==2.8.3 --no-build-isolation

Version Notes

Use Case	Recommended
Transformers inference	`transformers==4.51.3`
vLLM inference	Install vLLM separately (see § vLLM Inference)

🤖 Inference (Transformers)

Use HuggingFace AutoModelForCausalLM + AutoProcessor for image, video, and text.

python inference/example_penguinvl.py

You can provide a customized --model-path argument to the script (default: tencent/Penguin-VL-8B). You can also set it to tencent/Penguin-VL-2B. Supported formats:

Video: type: "video" with video_path, fps, max_frames
Image: type: "image" with image_path
Mixed: image + video + text in one conversation
Text-only: plain text dialogue

📓 Cookbook

Checkout the inference notebook for a GitHub-friendly walkthrough of Penguin-VL across diverse tasks.
Unlike a multi-notebook cookbook, Penguin-VL currently provides one consolidated notebook that covers multiple representative examples in a single place.

Notebook	Description
Inference Recipes	Demonstrations of Penguin-VL for visual code generation, OCR/document parsing, creative image understanding, table extraction, multi-round chart analysis, multi-round video understanding, mixed video+image prompting, and a text-only baseline.

If you want to re-execute the notebook locally and regenerate the GitHub-previewable output:

export PENGUIN_VL_MODEL_PATH=tencent/Penguin-VL-8B

jupyter nbconvert \
  --to notebook \
  --execute \
  --output 01_penguinvl_inference_recipes.public.ipynb \
  --ExecutePreprocessor.timeout=-1 \
  --ExecutePreprocessor.kernel_name=penguinvl \
  inference/notebooks/01_penguinvl_inference_recipes.source.ipynb

The clean source notebook lives at inference/notebooks/01_penguinvl_inference_recipes.source.ipynb.

🤗 Gradio Demo (Local UI)

Launch a local web UI with image/video upload and chat.

Quick Start

python inference/launch_gradio_demo.py --model-path tencent/Penguin-VL-8B

Then open http://localhost:33666 (or your machine’s IP + port) in a browser.

Options

Option	Description	Default
`--model-path`	Model path or HuggingFace ID	required
`--server-port`	Backend inference server port	16667
`--interface-port`	Gradio web UI port	33666
`--nproc`	Number of backend worker processes	1

Examples

# 2B model, default ports
python inference/launch_gradio_demo.py --model-path tencent/Penguin-VL-2B

# 8B model, custom UI port
python inference/launch_gradio_demo.py --model-path tencent/Penguin-VL-8B --interface-port 8080

# Multi-worker backend
python inference/launch_gradio_demo.py --model-path tencent/Penguin-VL-8B --nproc 4

⚡ vLLM Inference

Installing vLLM 0.11.0 requires PyTorch 2.8 and the corresponding compatible version of Flash Attention. This setup may different from the default Transformers inference environment (which recommends PyTorch ≥ 2.5). To avoid version conflicts, you may need to create a separate environment or upgrade dependencies accordingly.
Install order note: if you plan to use vLLM, it's recommended to install vLLM first, and then install Flash Attention.

Environment

The vLLM plugin targets vLLM 0.11.0 (penguinvl/plugin/vllm/v0_11_0/).
vLLM is not in requirements.txt by default; install it separately:

pip install vllm==0.11.0

Troubleshooting

Flash Attention / flash-attn import errors (e.g., ImportError: ... undefined symbol: ...): try reinstalling flash-attn:

pip uninstall flash-attn
pip install flash-attn --no-cache --no-build-isolation

cannot find -lcuda during flashinfer build:

export LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LIBRARY_PATH
# or /usr/local/cuda/lib64 depending on your CUDA install

Start vLLM Server

# Single GPU
python -m penguinvl.plugin.vllm serve tencent/Penguin-VL-8B

# Multi-GPU (e.g. 8B on 2 GPUs)
python -m penguinvl.plugin.vllm serve tencent/Penguin-VL-8B --port 8000 --tensor-parallel-size 2

Additional options: --host, --max-model-len, etc. (see vLLM 0.11 serve docs).

vLLM Demo Script

Run text, image, video, and batch demos:

# All demos (single GPU)
CUDA_VISIBLE_DEVICES=0 python inference/test_vllm_infer.py --model-path tencent/Penguin-VL-8B

# Text-only
CUDA_VISIBLE_DEVICES=0 python inference/test_vllm_infer.py --model-path tencent/Penguin-VL-8B --demo text

# Image (requires --image-path)
CUDA_VISIBLE_DEVICES=0 python inference/test_vllm_infer.py --model-path tencent/Penguin-VL-8B --demo image --image-path assets/inputs/horse_poet.png

# Video
CUDA_VISIBLE_DEVICES=0 python inference/test_vllm_infer.py --model-path tencent/Penguin-VL-8B --demo video --video-path assets/inputs/polar_bear.mp4

# 8B with tensor parallelism (2 GPUs)
CUDA_VISIBLE_DEVICES=0,1 python inference/test_vllm_infer.py --model-path tencent/Penguin-VL-8B --tensor-parallel-size 2

Argument	Description
`--model-path`	HuggingFace model name or local path
`--demo`	`text` \| `image` \| `batch` \| `video` \| `all`
`--tensor-parallel-size`	Number of GPUs for tensor parallelism
`--max-new-tokens`	Max tokens to generate
`--max-model-len`	Max context length
`--gpu-memory-utilization`	GPU memory fraction (0–1)

🗝️ Training

Training Data

We release Penguin-Recap-I as the public image training data accompanying Penguin-VL: https://huggingface.co/datasets/tencent/Penguin-Recap-I

The release currently covers the image-side recap data and contains three subsets:

datacomp_coyo_penguin
sa1b_penguin
openimages_penguin

For datacomp_coyo_penguin, we provide the original image URLs in each JSON entry for downloading.

For sa1b_penguin and openimages_penguin, we provide the training annotations together with image file names / relative paths, so users can map each sample back to the original image resources and download the raw images from OpenDataLab or the official sources:

OpenDataLab OpenImagesV6: https://opendatalab.com/OpenDataLab/OpenImagesV6/tree/main/raw
OpenDataLab SA-1B: https://opendatalab.com/OpenDataLab/SA-1B/tree/main/raw
Official Segment Anything / SA-1B: https://ai.meta.com/datasets/segment-anything/
Official OpenImages: https://storage.googleapis.com/openimages/web/index.html

Training Pipeline Overview

Penguin-VL adopts a 4-stage curriculum:

Stage	Script	Description	Trainable Modules
Stage 1	`vision_encoder_pretrain.sh`	Vision encoder warm-up with reconstruction / distillation losses. The LLM-initialized encoder learns to extract visual features under supervision from a VideoLLaMA3 vision encoder teacher.	Vision encoder + projector
Stage 2	`vision_encoder_pretrain_hres.sh`	High-resolution alignment. Continues from Stage 1 with higher sequence budgets to handle dense text and document images.	All parameters
Stage 3	`pretrain.sh`	Full multi-modal pre-training on large-scale image and video corpora.	All parameters
Stage 4	`sft.sh`	Supervised fine-tuning (instruction tuning) on high-quality chat/task data.	All parameters

Step 1: Prepare Training Data

Organize all images and videos under a single data_root directory:

data_root/
├── images/
│   ├── image_0001.jpg
│   └── ...
├── videos/
│   ├── video_0001.mp4
│   └── ...
├── annotations_image.jsonl
├── annotations_video.jsonl
└── ...

Each annotation file is a JSONL file where every line is a JSON object in the following format:

Image example:

{
    "id": "sample_0001",
    "image": ["images/image_0001.jpg"],
    "conversations": [
        {"from": "human", "value": "<image>\nWhat is shown in the image?"},
        {"from": "gpt",   "value": "The image shows a golden retriever playing on a beach."}
    ]
}

Video example:

{
    "id": "sample_0002",
    "video": ["videos/video_0001.mp4"],
    "conversations": [
        {"from": "human", "value": "<video>\nBriefly describe what happens in the video."},
        {"from": "gpt",   "value": "A person assembles a bicycle in a garage, checking each component carefully."}
    ]
}

Text-only example:

{
    "id": "sample_0003",
    "conversations": [
        {"from": "human", "value": "What is the capital of France?"},
        {"from": "gpt",   "value": "The capital of France is Paris."}
    ]
}

Notes

Multiple annotation files can be passed simultaneously to --data_path.

If <image> / <video> tokens are absent from the first user turn, they will be prepended automatically.

Both .json (list) and .jsonl (one object per line) formats are supported. .jsonl with HuggingFace datasets is recommended for large corpora.

Step 1.5: (Optional) Pre-compute Sequence Lengths

penguinvl/tools/calculate_seqlen.py is a preprocessing utility that runs before training to pre-compute the approximate sequence length of every sample in a JSONL annotation file. The resulting length index can be passed to --data_lengths_path so the dataloader can sort samples by length, reducing padding waste and speeding up training.

The script runs in two phases internally:

Metadata extraction — resolves each sample's image resolution (via PIL) or video dimensions / frame count (via ffprobe), then writes an enriched <input>_meta.jsonl with width, height, and frames fields added to each record.
Length estimation — tokenizes all conversation text with the specified tokenizer and adds an estimated visual token count based on the resolution, then saves a length-sorted index tensor to lengths.pt.

Both phases run in parallel across all available CPU cores.

Usage

python penguinvl/tools/calculate_seqlen.py \
    --input  /path/to/annotations.jsonl \
    --root   /path/to/data_root \
    --tk-path Qwen/Qwen3-0.6B \
    --fps 1 \
    --max-frames 180

Argument	Description	Default
`--input` / `-i`	Input JSONL annotation file	required
`--root` / `-r`	Root directory for resolving image/video paths	`""`
`--tk-path`	Tokenizer path or HuggingFace model ID used for text length estimation	`Qwen/Qwen3-0.6B`
`--fps`	Frame rate used to estimate the number of video frames	`1`
`--max-frames`	Maximum frame count cap for video length estimation	`180`
`--chunksize`	Lines per worker chunk for `imap_unordered`	`100`

Outputs

File	Description
`<input>_meta.jsonl`	Copy of the input JSONL with `width`, `height`, and `frames` fields added to each record.
`<root>/lengths.pt`	A `torch.LongTensor` of length-sorted sample indices. Pass this to `--data_lengths_path` in the training script.

Connecting to the Training Script

After generating lengths.pt, add --data_lengths_path and --group_by_modality_length True to your training script:

--group_by_modality_length True
--data_lengths_path /path/to/data_root/lengths.pt

This enables length-sorted batching, which significantly reduces padding overhead when training on datasets with high length variance (e.g. mixed image + video data).

Step 2: Configure Training Scripts

Training scripts live in scripts/train/. Edit the following variables at the top of each script before launching:

Variable	Description	Example
`DATA_DIR`	Root directory of your dataset	`/data/penguinvl_data`
`OUTP_DIR`	Root directory for checkpoints	`work_dirs`
`WANDB_PROJECT`	W&B project name	`penguinvl_qwen3_exp`
`ARG_WORLD_SIZE`	Number of nodes	`1`
`ARG_NPROC_PER_NODE`	Number of GPUs per node	`8`
`GLOBAL_BATCH_SIZE`	Effective global batch size	`128`
`LOCAL_BATCH_SIZE`	Per-GPU batch size	`2`

Gradient accumulation is derived automatically:

GRADIENT_ACCUMULATION_STEPS = GLOBAL_BATCH_SIZE / (WORLD_SIZE × NPROC_PER_NODE × LOCAL_BATCH_SIZE)

Step 3: Run Training

Stage 1 — Vision Encoder Pretraining

bash scripts/train/vision_encoder_pretrain.sh [NUM_NODES] [NUM_GPUS_PER_NODE]

Key arguments specific to Stage 1:

--model_path        Qwen/Qwen3-1.7B                            # LLM part
--vision_encoder    Cyril666/SFL-Encoder-Pretrained-Qwen3      # LLM-initialized vision encoder (converted from Qwen/Qwen-0.6B and modified the layer parameter names.)
--use_reconstruct   True                                       # Enable Stage 1 reconstruction / distillation loss
--vision_encoder_teacher DAMO-NLP-SG/VL3-SigLIP-NaViT          # VideoLLaMA3 vision encoder teacher checkpoint
--model_max_length  4096
--mm_max_length     2048

Stage 2 — High-Resolution Encoder Pretraining

bash scripts/train/vision_encoder_pretrain_hres.sh [NUM_NODES] [NUM_GPUS_PER_NODE]

Loads from stage_1 checkpoint. Increases context budgets for high-resolution inputs:

--model_max_length  16384
--mm_max_length     10240

Stage 3 — Full Pre-training

bash scripts/train/pretrain.sh [NUM_NODES] [NUM_GPUS_PER_NODE]

Loads from stage_2 checkpoint. All three modules (vision encoder, projector, LLM) are jointly trained.

Stage 4 — Supervised Fine-tuning

bash scripts/train/sft.sh [NUM_NODES] [NUM_GPUS_PER_NODE]

Loads from stage_3 checkpoint. Uses high-quality instruction-following data for final alignment.

Key Training Arguments Reference

Argument	Description	Default
`--model_type`	Model architecture type	`penguinvl_qwen3`
`--model_path`	Path to LLM backbone or previous stage checkpoint	—
`--vision_encoder`	Path or HF ID of the vision encoder	—
`--vision_projector_type`	Projector architecture	`mlp2x_gelu`
`--use_reconstruct`	Enable the Stage 1 visual reconstruction / distillation loss	`False`
`--vision_encoder_teacher`	VideoLLaMA3 vision encoder teacher checkpoint	`None`
`--data_path`	Space-separated list of annotation files	—
`--data_folder`	Root folder for all media files	—
`--fps`	Video sampling frame rate	`1`
`--max_frames`	Maximum number of frames per video	`180`
`--image_merge_size`	Token merge factor for images	`1`
`--video_merge_size`	Token merge factor for video frames	`2`
`--model_max_length`	Maximum total sequence length (truncation)	`512`
`--mm_max_length`	Maximum visual token budget per sample	`10240`
`--llm_lr`	Learning rate for the LLM backbone	`None`
`--vision_encoder_lr`	Learning rate for the vision encoder	`None`
`--vision_projector_lr`	Learning rate for the MLP projector	`None`
`--embedding_lr`	Learning rate for embedding layers	`None`
`--deepspeed`	DeepSpeed config path	`scripts/zero1.json`
`--gradient_checkpointing`	Enable gradient checkpointing	`True`
`--use_batch_flattening`	Flatten variable-length sequences in a batch	`True`

Distributed Training (Multi-node)

The scripts support multi-node training via torchrun. Pass WORLD_SIZE, NPROC_PER_NODE, MASTER_ADDR, MASTER_PORT, and RANK as environment variables or positional arguments:

# Node 0 (master)
WORLD_SIZE=2 NPROC_PER_NODE=8 MASTER_ADDR=<node0_ip> MASTER_PORT=16667 RANK=0 \
    bash scripts/train/sft.sh

# Node 1
WORLD_SIZE=2 NPROC_PER_NODE=8 MASTER_ADDR=<node0_ip> MASTER_PORT=16667 RANK=1 \
    bash scripts/train/sft.sh

📁 Project Structure

.
├── penguinvl/                    # Core model and processor code
│   ├── plugin/vllm/              # vLLM plugin (v0_11_0)
│   ├── tools/                    # Tool scripts
│   └── train/                    # Training code
├── scripts/                      # Training scripts
├── inference/
│   ├── example_penguinvl.py      # Transformers inference example
│   ├── test_vllm_infer.py        # vLLM inference demo
│   ├── launch_gradio_demo.py     # Gradio local demo
│   ├── notebooks/                # Executed and source Jupyter notebooks
│   ├── server/                   # Backend for Gradio
│   ├── interface/                # Gradio UI
│   └── transformers_api/         # Transformers model/processor wrappers
├── assets/
│   ├── framework.png             # README framework figure
│   ├── 2b_results.png            # README benchmark figure
│   └── inputs/                   # Demo images and videos
└── requirements.txt

📄 License

This project is released under the Apache 2.0 License.

📚 Citation

If you use Penguin-VL in your research, please cite:

@article{Penguin-VL,
  title={Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders},
  author={Boqiang Zhang and Lei Ke and Ruihan Yang and Qi Gao and Tianyuan Qu and Rossell Chen and Dong Yu and Leoweiliang},
  journal={arXiv preprint arXiv:2603.06569},
  year={2026}
}

If you find this project useful, please consider giving it a ⭐ on GitHub. Issues and PRs are welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
assets		assets
inference		inference
penguinvl		penguinvl
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

📰 News

📌 TODO

✨ Overview

Highlights

📈 Results

📦 Model Zoo

🛠️ Environment Setup

Requirements

Installation

Version Notes

🤖 Inference (Transformers)

📓 Cookbook

🤗 Gradio Demo (Local UI)

Quick Start

Options

Examples

⚡ vLLM Inference

Environment

Troubleshooting

Start vLLM Server

vLLM Demo Script

🗝️ Training

Training Data

Training Pipeline Overview

Step 1: Prepare Training Data

Step 1.5: (Optional) Pre-compute Sequence Lengths

Usage

Outputs

Connecting to the Training Script

Step 2: Configure Training Scripts

Step 3: Run Training

Stage 1 — Vision Encoder Pretraining

Stage 2 — High-Resolution Encoder Pretraining

Stage 3 — Full Pre-training

Stage 4 — Supervised Fine-tuning

Key Training Arguments Reference

Distributed Training (Multi-node)

📁 Project Structure

📄 License

📚 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages