texrex

Web corpus creation software, ARC file processor

Add a Review
1 Download (This Week)
Last Update:
Download texrex-neuedimensionen-data-2014-06-23.tgz
Browse All Files
Linux

Description

texrex is a free software for processing ARC data files from crawls and turn them into a corpus of web documents. Currently, it is limited to reading ARC files, but other input modules can be developed quickly.

Note: You should have a few ARC files with documents in a European language lying around to be able to test it adequately.

It does HTMLstripping, codepage & entity conversion, perfect duplicate removal, high-precision boilerplate detection, text quality assessment, in-document paragraph deduplication, w-shingling, server IP geolocalization. Multi-threading is available to speed up processing.

texrex Web Site

Update Notifications





Write a Review

User Reviews

Be the first to post a review of texrex!

Additional Project Details

Languages

English

Intended Audience

Science/Research

User Interface

Command-line

Programming Language

Pascal

Registered

2011-06-09

Icons must be PNG, GIF, or JPEG and less than 1 MiB in size. They will be displayed as 48x48 images.