Perception Models

Perception Models is a state-of-the-art framework developed by Facebook Research for advanced image and video perception tasks. It introduces two primary components: the Perception Encoder (PE) for visual feature extraction and the Perception Language Model (PLM) for multimodal decoding and reasoning. The PE module is a family of vision encoders designed to excel in image and video understanding, surpassing models like SigLIP2, InternVideo2, and DINOv2 across multiple benchmarks. Meanwhile, PLM integrates with PE to power vision-language modeling, achieving results competitive with leading multimodal systems such as QwenVL2.5 and InternVL3, all while being fully reproducible with open data. The project supports a wide range of research applications, from visual recognition and dense prediction to fine-grained multimodal understanding. Additionally, it includes several large-scale open datasets for both image and video perception.

Features

Combines Perception Encoder (PE) for vision encoding and Perception Language Model (PLM) for multimodal decoding
State-of-the-art performance in image, video, and vision-language benchmarks
Open, reproducible models using freely available datasets for transparency
Multiple PE variants specialized for core, language-aligned, and spatial tasks
PLM available in 1B, 3B, and 8B parameter sizes for flexible research needs
Integrated with popular tools such as Hugging Face Transformers, timm, and lmms-eval

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow Perception Models

Perception Models Web Site

Other Useful Business Software

MongoDB Atlas runs apps anywhere

Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free

Rate This Project

User Reviews

Be the first to post a review of Perception Models!

Additional Project Details

Operating Systems

Linux

Programming Language

Python

Related Categories

Python AI Models

Registered

3 days ago

Similar Business Software

Pixtral Large

Pixtral Large is a 124-billion-parameter open-weight multimodal model developed by Mistral AI, building upon their Mistral Large 2 architecture. It integrates a 123-billion-parameter multimodal decoder with a 1-billion-parameter vision encoder, enabling advanced understanding of documents,...

See Software
LLaVA

LLaVA (Large Language-and-Vision Assistant) is an innovative multimodal model that integrates a vision encoder with the Vicuna language model to facilitate comprehensive visual and language understanding. Through end-to-end training, LLaVA exhibits impressive chat capabilities, emulating the...

See Software
Qwen-Image

Qwen-Image is a multimodal diffusion transformer (MMDiT) foundation model offering state-of-the-art image generation, text rendering, editing, and understanding. It excels at complex text integration, seamlessly embedding alphabetic and logographic scripts into visuals with typographic fidelity,...

See Software
Wan2.1

Wan2.1 is an open-source suite of advanced video foundation models designed to push the boundaries of video generation. This cutting-edge model excels in various tasks, including Text-to-Video, Image-to-Video, Video Editing, and Text-to-Image, offering state-of-the-art performance across...

See Software
NVIDIA Cosmos

NVIDIA Cosmos is a developer-first platform of state-of-the-art generative World Foundation Models (WFMs), advanced video tokenizers, guardrails, and an accelerated data processing and curation pipeline designed to supercharge physical AI development. It enables developers working on autonomous...

See Software
Qwen2.5-VL-32B

Qwen2.5-VL-32B is a state-of-the-art AI model designed for multimodal tasks, offering advanced capabilities in both text and image reasoning. It builds upon the earlier Qwen2.5-VL series, improving response quality with more human-like, formatted answers. The model excels in mathematical...

See Software

Report inappropriate content

Perception Models

State-of-the-art Image & Video CLIP, Multimodal Large Language Models

Get an email when there's a new version of Perception Models

Features

Project Samples

Project Activity

Categories

License

Follow Perception Models

User Reviews

Additional Project Details

Operating Systems

Programming Language

Related Categories

Registered