Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
README.md | 2024-06-27 | 707 Bytes | |
trafilatura-1.11.0 source code.tar.gz | 2024-06-27 | 32.6 MB | |
trafilatura-1.11.0 source code.zip | 2024-06-27 | 33.0 MB | |
Totals: 3 Items | 65.7 MB | 0 |
Breaking change:
- metadata now skipped by default (#613), to trigger inclusion in all output formats:
- with_metadata=True
(Python)
- --with-metadata
(CLI)
Extraction: - add HTML as output format (#614) - better and faster baseline extraction (#619) - better handling of HTML/XML elements (#628) - XPath rules added with @felipehertzer (#540) - fix: avoid faulty readability_lxml content (#635)
Evaluation: - new scripts and data with @LydiaKoerber (#606, [#615]) - additional data with @swetepete (#197)
Maintenance: - docs extended and updated, added page on deduplication (#618) - review code, add tests and types in part of the submodules (#620, [#623], [#624], [#625])