BlogTEX is an ad-hoc blog posts extraction algorithm written in Java for TREC Blog08 dataset. It includes an optimized sentence model for clearly identifying sentence boundaries in each blog post. Its output can be customized using its config file.
Be the first to post a review of BlogTEX: Blog posts extraction for TREC.!