Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
README.md | 2025-06-03 | 1.3 kB | |
v1.2.0 source code.tar.gz | 2025-06-03 | 16.9 MB | |
v1.2.0 source code.zip | 2025-06-03 | 17.1 MB | |
Totals: 3 Items | 34.0 MB | 1 |
What's Changed
- Code-prose-composition tagger. by @no0p in https://github.com/allenai/dolma/pull/247
- Support for WARC resource record types. by @no0p in https://github.com/allenai/dolma/pull/248
- Bump artifacts version to 4.4.1 by @no0p in https://github.com/allenai/dolma/pull/252
- Sanitize-concat-fim by @cmwilhelm in https://github.com/allenai/dolma/pull/253
- Use original s3 path to delete local cache by @CodeCreator in https://github.com/allenai/dolma/pull/257
- Safe tokenization by skipping failing docs. by @soldni in https://github.com/allenai/dolma/pull/245
- Update RedPajama branch link by @guspan-tanadi in https://github.com/allenai/dolma/pull/263
- Skipping empty tagger key instead of erroring out by @Whattabatt in https://github.com/allenai/dolma/pull/262
- Tokenizer over custom fields and w/o IDs; BOS/EOS tokens. by @soldni in https://github.com/allenai/dolma/pull/266
New Contributors
- @no0p made their first contribution in https://github.com/allenai/dolma/pull/247
- @CodeCreator made their first contribution in https://github.com/allenai/dolma/pull/257
- @guspan-tanadi made their first contribution in https://github.com/allenai/dolma/pull/263
Full Changelog: https://github.com/allenai/dolma/compare/v1.1.2...v1.2.0