Download Latest Version Preparation et Import dans TXM 2019.zip (1.2 MB)
Email in envelope

Get an email when there's a new version of TXM

Home / corpora / brown
Name Modified Size InfoDownloads / Week
Parent folder
README.markdown 2018-12-10 4.5 kB
brown-bin.txm 2017-11-03 79.5 MB
filter-teibrown4txm.xsl 2013-07-10 5.0 kB
Totals: 3 Items   79.5 MB 0

BROWN CORPUS - TXM BINARY VERSION

"A Standard Corpus of Present-Day Edited American English, for use with Digital Computers." by W. N. Francis and H. Kucera (1964) Department of Linguistics, Brown University Providence, Rhode Island, USA. Revised 1971, Revised and Amplified 1979 http://www.hit.uib.no/icame/brown/bcm.html. License: May be used for non-commercial purposes.

This version derives from the "Brown Corpus (TEI XML Version)" available from the NLTK Corpora web page: http://www.nltk.org/nltk_data

It was adapted to the TXM platform by the the Textométrie research project http://textometrie.ens-lyon.fr.

Documentation

A) 'type' and 'decls' text properties (metadata)

'type' encodes the text category and 'decls' its one letter code, for example: type='Press: Reportage' and decls='A'.

B) 'enpos' word property

The 'enpos' and 'enlemma' word properties have been computed by TreeTagger.

C) 'type' and 'subtype' word properties

The 'type' and 'subtype' part of speech word properties comme from the original BROWN corpus, recoded for the "TEI XML Version".

Quote from the "TEI XML Version" encoding documentation:

The original POS tagging scheme combined several codes into a single tag, combining them with minus and plus signs. It also combined in a single scheme morpho-syntactic categories such as "NN" for noun, with contextual ones, such as "TL" for words within titles. In this version the contextual codes have been separated out and appear as the value of the @subtype attribute. Where multiple tags were assigned to the same word, they are separated by spaces rather than + signs, as the value of the @type attribute. Where multiple tags were assigned to the same word, the word is explicitly marked as a "multiword", using a <mw> element.

The word "whaddya" (E01.2) originally had the code "wdt+ber+pp". Since this is the only occurrence of "pp" in the corpus, I have assumed it was an error for "pps" and hand-edited the text accordingly. Similarly, the erroneous tagging of the word "You're" (J31.28) as "ppss+ber-n" rather than "ppss+ber-nc" has been manually corrected.

Using the ready for use binary version provided here

To directly load the (compiled) binary corpus version into TXM desktop (version 0.6 or higher):

  1. download the 'brown-bin.txm' file;

  2. launch TXM;

  3. call the File / Load command on the 'brown-bin.txm' file;

  4. the BROWN corpus is ready to use.

Building the binary version yourself

To import the Brown corpus into TXM from its source files yourself:

  1. download brown_tei.zip file from http://www.nltk.org/nltk_data/packages/corpora/brown_tei.zip;

  2. unzip the source files;

  3. delete the following files from the unzipped source directory: BrownXML.dtd, BrownXML.rnc; BrownXML.rng, BrownXML.xsd, Corpus.xml, README, tei.css;

  4. download the filter-teibrown4txm.xsl file from this folder;

  5. launch TXM

  6. call the File / Import / 'XML/w+CSV' import module:

    1. select the source directory

    2. select the 'filter-teibrown4txm.xsl' file in the "Front XSLT" parameter;

    3. keep the 'annotate corpus' option set, on the 'en' language, if you want to run TreeTagger on the corpus on the fly;

    4. click on 'Start import'.

  7. the BROWN corpus is ready to use.

If you want to give your own binary version of that corpus to someone else, select the 'BROWN' corpus and call the 'Export corpus' command to build the ZIP binary.

Please address any enquiries about the TXM conversion to textometrie@ens-lyon.fr

Notes

  1. adapted from https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
Source: README.markdown, updated 2018-12-10