Download Latest Version trafilatura-2.0.0 source code.tar.gz (31.4 MB)
Email in envelope

Get an email when there's a new version of Trafilatura

Home / v1.12.0
Name Modified Size InfoDownloads / Week
Parent folder
README.md 2024-07-30 812 Bytes
trafilatura-1.12.0 source code.tar.gz 2024-07-30 31.7 MB
trafilatura-1.12.0 source code.zip 2024-07-30 32.2 MB
Totals: 3 Items   63.9 MB 0

Breaking change: - enforce fixed list of output formats, deprecate -out on the CLI (#647)

Faster, more accurate extraction: - review link and structure checks (#653) - improve justext fallback (#652) - baseline: prevent LXML error in JSON-LD (#643), do not use as backup extraction (#646) - review XPaths for undesirable content (#645)

Bugfixes and maintenance: - CLI fix: markdown format should trigger include_formatting (#649) - images fix: use a length threshold on src attribute (#654) - XML-TEI: replace RelaxNG by DTD, remove pickle, and update (#655) - formatting & markdown fix: add newlines (#656) - table fix: prevent MemoryError & ValueError during conversion to text (#658)

Documentation: - update crawls.rst: known is an unexpected argument, by @tommytyc in [#638]

Source: README.md, updated 2024-07-30