I'm working on a project that aims to make precise calculations from a set of high-capacity files (300 GB) of a format called WARC format (which is a format for data archiving of any type (pdf, doc, txt, html ...).

So my problem is how can  I access to my files with nltk to handle or treat them

any idea may be very hepful for me,