unfluff
Automatically extract body content (and other cool stuff) from HTML
...It supports caching internal representations to speed up repeated extractions. While its language support is best for English, it is still widely used in web-content-processing pipelines. The repository notes some limitations (e.g., languages like Chinese/Arabic/Korean may not be well-supported). Because of its simplicity and focused purpose, it can be a reliable building block in backend services or CLI tools.