The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
README.markdown	2018-12-10	4.5 kB	0
brown-bin.txm	2017-11-03	79.5 MB	0
filter-teibrown4txm.xsl	2013-07-10	5.0 kB	0
Totals: 3 Items		79.5 MB	0

BROWN CORPUS - TXM BINARY VERSION

"A Standard Corpus of Present-Day Edited American English, for use with Digital Computers." by W. N. Francis and H. Kucera (1964) Department of Linguistics, Brown University Providence, Rhode Island, USA. Revised 1971, Revised and Amplified 1979 http://www.hit.uib.no/icame/brown/bcm.html. License: May be used for non-commercial purposes.

This version derives from the "Brown Corpus (TEI XML Version)" available from the NLTK Corpora web page: http://www.nltk.org/nltk_data

It was adapted to the TXM platform by the the Textométrie research project http://textometrie.ens-lyon.fr.

Documentation

A) 'type' and 'decls' text properties (metadata)

'type' encodes the text category and 'decls' its one letter code, for example: type='Press: Reportage' and decls='A'.

BROWN CORPUS MANUAL, section 1 - CONTENTS, Brown University, 1964. Revised 1971

B) 'enpos' word property

The 'enpos' and 'enlemma' word properties have been computed by TreeTagger.

Brown Penn Treebank TreeTagger Tagset Cheat Sheet (1)
- Beatrice Santorini, Part-of-Speech Tagging Guidelines for the Penn Treebank Project, March 15, 1991

C) 'type' and 'subtype' word properties

The 'type' and 'subtype' part of speech word properties comme from the original BROWN corpus, recoded for the "TEI XML Version".

Quote from the "TEI XML Version" encoding documentation:

The original POS tagging scheme combined several codes into a single tag, combining them with minus and plus signs. It also combined in a single scheme morpho-syntactic categories such as "NN" for noun, with contextual ones, such as "TL" for words within titles. In this version the contextual codes have been separated out and appear as the value of the @subtype attribute. Where multiple tags were assigned to the same word, they are separated by spaces rather than + signs, as the value of the @type attribute. Where multiple tags were assigned to the same word, the word is explicitly marked as a "multiword", using a <mw> element.

The word "whaddya" (E01.2) originally had the code "wdt+ber+pp". Since this is the only occurrence of "pp" in the corpus, I have assumed it was an error for "pps" and hand-edited the text accordingly. Similarly, the erroneous tagging of the word "You're" (J31.28) as "ppss+ber-n" rather than "ppss+ber-nc" has been manually corrected.

BROWN CORPUS MANUAL, section 4 - THE TAGGED VERSION, Brown University, 1964. Revised 1971

Using the ready for use binary version provided here

To directly load the (compiled) binary corpus version into TXM desktop (version 0.6 or higher):

download the 'brown-bin.txm' file;
launch TXM;
call the File / Load command on the 'brown-bin.txm' file;
the BROWN corpus is ready to use.

Building the binary version yourself

To import the Brown corpus into TXM from its source files yourself:

download brown_tei.zip file from http://www.nltk.org/nltk_data/packages/corpora/brown_tei.zip;
unzip the source files;
delete the following files from the unzipped source directory: BrownXML.dtd, BrownXML.rnc; BrownXML.rng, BrownXML.xsd, Corpus.xml, README, tei.css;
download the filter-teibrown4txm.xsl file from this folder;
launch TXM
call the File / Import / 'XML/w+CSV' import module:
1. select the source directory
2. select the 'filter-teibrown4txm.xsl' file in the "Front XSLT" parameter;
3. keep the 'annotate corpus' option set, on the 'en' language, if you want to run TreeTagger on the corpus on the fly;
4. click on 'Start import'.
the BROWN corpus is ready to use.

If you want to give your own binary version of that corpus to someone else, select the 'BROWN' corpus and call the 'Export corpus' command to build the ZIP binary.

Please address any enquiries about the TXM conversion to textometrie@ens-lyon.fr