texrex is a free software for processing ARC data files from crawls and turn them into a corpus of web documents. Currently, it is limited to reading ARC files, but other input modules can be developed quickly.
Note: You should have a few ARC files with documents in a European language lying around to be able to test it adequately.
It does HTMLstripping, codepage & entity conversion, perfect duplicate removal, high-precision boilerplate detection, text quality assessment, in-document paragraph deduplication, w-shingling, server IP geolocalization. Multi-threading is available to speed up processing.
Be the first to post a review of texrex!