| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Parent folder | |||
| README.md | 2026-05-31 | 2.1 kB | |
| v3.10.30 -- 4-dataset BEIR (rank 3_11 mean) + config-divergence finding source code.tar.gz | 2026-05-31 | 25.4 MB | |
| v3.10.30 -- 4-dataset BEIR (rank 3_11 mean) + config-divergence finding source code.zip | 2026-05-31 | 28.5 MB | |
| Totals: 3 Items | 54.0 MB | 1 | |
What ships
4th BEIR dataset (SciDocs) joins NFCorpus + SciFact + ArguAna. New finding: no single pipeline wins everywhere.
SciDocs results
| Pipeline | nDCG@10 | Rank |
|---|---|---|
| dense alone (BGE-base) | 0.211 | 2/11 |
| Lucene RRF (no rerank) | 0.203 | (-0.008, RRF hurt) |
Only behind BGE-large (335M, 0.225). Beats BM25, GTR-XL (1.2B), every other published baseline.
4-dataset mean leaderboard
| System | Params | NFCorpus | SciFact | ArguAna | SciDocs | Mean |
|---|---|---|---|---|---|---|
| BGE-large (published) | 335M | 0.380 | 0.722 | 0.636 | 0.225 | 0.491 |
| SPLADE++ (published) | 110M | 0.347 | 0.704 | 0.521 | 0.159 | 0.433 |
| ruflo best (per-dataset) | 110M | 0.358 | 0.683 | 0.432 | 0.211 | 0.421 |
| GTR-XL (1.2B) | 1.2B | 0.343 | 0.662 | 0.439 | 0.174 | 0.405 |
| GenQ | 110M | 0.319 | 0.644 | 0.493 | 0.143 | 0.400 |
| BM25 (Lucene published) | — | 0.325 | 0.679 | 0.397 | 0.158 | 0.390 |
Rank 3 of 11 on 4-dataset mean. Beats GTR-XL with 1/10× the params. Loses only to SPLADE++ (-0.012, basically tied) and BGE-large (-0.070, mostly the ArguAna gap).
The config-divergence finding
After 4 datasets, no single pipeline wins everywhere:
| Dataset | Best config | What hurts |
|---|---|---|
| NFCorpus | Lucene + RRF + CE rerank | nothing |
| SciFact | Lucene + RRF + CE rerank | nothing |
| ArguAna | Lucene + RRF (no CE) | CE rerank actively hurts |
| SciDocs | dense alone | RRF hurt by 0.008 |
Three of four datasets pick a different best config. Auto-pipeline-selection would need a per-corpus calibrator (cheap, doesn't need GPU — tracked).
Honest limits
- 4/18 BEIR datasets. The 0.421 mean is suggestive, not BEIR-average.
- Zero-shot — NFCorpus and ArguAna train splits remain unused.
- The 5 biggest BEIR datasets (TREC-COVID, FiQA, HotpotQA, NQ, DBPedia, all >50k docs) remain GPU-gated.
Install
:::bash
npx ruflo@3.10.30 # latest / alpha / v3alpha all aligned
Full ADR: v3/docs/adr/ADR-091-scidocs-and-config-divergence.md