Agentic, Reasoning, and Coding (ARC) foundation models
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
Code for the paper "Evaluating Large Language Models Trained on Code"
LongBench v2 and LongBench (ACL 25'&24')
A.S.E (AICGSecEval) is a repository-level AI-generated code security
Leaderboard Comparing LLM Performance at Producing Hallucinations
CodeGeeX: An Open Multilingual Code Generation Model (KDD 2023)
Designed for text embedding and ranking tasks
Benchmark LLMs by fighting in Street Fighter 3
Capable of understanding text, audio, vision, video
Advanced language and coding AI model
The official repo of Qwen chat & pretrained large language model
ChatGLM2-6B: An Open Bilingual Chat LLM
GLM-4.5: Open-source LLM for intelligent agents by Z.ai
Open-source evaluation toolkit of large multi-modality models (LMMs)
Unleashing 10,000+ Word Generation from Long Context LLMs
LISA: Reasoning Segmentation via Large Language Model
OpenCompass is an LLM evaluation platform
Qwen-Image is a powerful image generation foundation model
Test-Time Reinforcement Learning
A Gym environment for web task automation
Open-source model for program synthesis
AI-Driven Exploration in the Space of Code
Hypernetworks that adapt LLMs for specific benchmark tasks
Driving with Graph Visual Question Answering