PaSa is an open-source “paper search agent” built around large language models (LLMs), designed to automate the process of academic literature retrieval with human-like decision making. Instead of simply translating a query into keywords and returning a flat list of matching papers, PaSa uses a dual-agent architecture (Crawler + Selector) that can iteratively search, read, analyze, and filter academic publications — simulating how a researcher might dig through citation networks, expand references, and evaluate relevance based on both metadata and content. Given a complex scholarly question (for example, “Which works focus on non-stationary reinforcement learning with UCB-based value methods?”), PaSa decomposes the task: the Crawler generates search queries, retrieves candidate papers (via search tools and citation expansion), then adds them to a “paper queue.” The Selector then reads abstracts or full text (depending on what’s available) and decides which papers are relevant.
Features
- Dual-agent architecture (Crawler + Selector) — enabling iterative search, citation expansion, and content-based selection rather than simple keyword matching
- Reinforcement-learning-trained workflows (on synthetic + real query datasets) to optimize recall and precision for complex, nuance-heavy academic queries
- Support for automatic citation network traversal: starting from initial hits, the agent can expand references to discover related relevant works beyond the first search result set
- End-to-end pipeline: from query → search → paper retrieval → reading & evaluation → filtered results — minimizing manual intervention
- Public datasets (AutoScholarQuery for training; RealScholarQuery for evaluation), open-source code and pretrained models — enabling reproducible research or custom fine-tuning
- Benchmarked performance showing strong improvements over standard search engines and naive LLM-based searches in recall metrics for real-world academic queries