...It is structured into 7,500 training problems and 1,000 test problems. These aren’t trivial exercises — many require multi-step reasoning, combining arithmetic operations, and handling intermediate steps (e.g. “If she sold half as many in May… how many in total?”). The problems are written by human authors (not automatically generated) to ensure linguistic variety and realism. The repository maintains strict formatting (e.g. JSONL) for problem + answer pairs, and is used broadly in research to benchmark model performance under “word problem” settings. Issues are tracked (people report incorrect problems, ambiguous statements), and contributions are possible for cleaning or expanding the set.