Menu

Balie / News: Recent posts

Balie & YooName thesis

Balie, and its NER extension called YooName, are thoroughly described in the PhD thesis titled "Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision".

The thesis was successfully defended some weeks ago and is now available online at http://cogprints.org/5859/.

Posted by David N. 2007-12-16

Balie Ungava release is available!

The most important release of Balie to date is now available. Codenamed "Ungava", it comprises the latest named entity recognition capabilities.

Posted by David N. 2007-07-31

Balie 1.65 released - qTag removed

qTag (external product) was removed from Balie in this release.

Note that we maintain a distinct qTag wrapper in the CVS. It offers the exact same part-of-speech annotation function.

Release 1.65 also corrects some minor bugs and allows NER alias network information to be printed in XML output.

Posted by David N. 2007-07-04

Balie evolution codenamed YooName

YooName (http://www.yooname.com) is a semi-supervised Named Entity Recognition system using components of Balie.

YooName currently handles 100 Named Entity types: person, location, organization, facility, product, event, natural object, etc.

Visit YooName: http://www.yooname.com

Posted by David N. 2007-05-17

Important: Java 1.5 is now required!

- Since Balie v1.5, Java 1.5 is required (use of typed collections and enumerations)

- Balie v1.2 was the last version compliant with earlier versions.

Posted by David N. 2007-01-18

Balie 1.5

Compliance with Java 1.5 and code improvements:

- Generalization of NER module (lower accuracy but greater scalability)

- Lot of token features now available: word-level (casing, numeric, single char), morphology, functional features from litterature. etc.

- Code beautification, maintenance, etc.

Enjoy!

Posted by David N. 2006-09-26

Balie 1.2: Major version

Balie 1.2 is a major improvement.
A baseline named entity recognition was added.
The language identification module was upgraded.

Posted by David N. 2005-12-16

Balie 1.18

Balie 1.18 released today.

Lot of bug fixes.. speed improvements.. but nothing major.

CAUTION: the main "Tokenizer" call has one more parameter (boolean: flag the pos tagger ON-OFF).. It is not backward compatible..

Posted by David N. 2005-10-24

Balie 1.17

Minor bug fixes only.

Posted by David N. 2005-08-01

Balie 1.16 now ready

This release contains no new features but many bug fixes. As of today, Balie successfully ran through millions of douments, a proof of speed and stability.

Posted by David N. 2005-06-28

Balie Welcomes a new Developper!

We are happy to welcome Divan_Roulant, an experienced Java developper. Divan_Roulant will work on morphology (e.g.: flexion suffixes) as well as maintenance and bug fixes on the core modules of Balie.

Posted by David N. 2005-05-23

Balie and CohenWrapper new releases

Balie main change is the new Iterator object to loop through the TokenList. This is a more intuitive way to handle the text.

CohenWrapper was redesigned and generalized. It is now easier to use on other tasks. Two new features were added (in comparison to Cohen published algorithm): row and column number of the innermost table. Also, an optional noise filter and an optional cost-sensitive classification module were added.

Posted by David N. 2005-05-21

Balie Technology to be Presented at AI'2005

Balie text processing capabilities has been use to develop a domain independant acronym resolution module. It will be presented at the Eighteenth Canadian Conference on Artificial Intelligence (AI'2005).

http://www.iro.umontreal.ca/~ai05/

Posted by David N. 2005-05-07

Balie Javadoc now available!

The entire Javadoc has been written for Balie.
It help solving some architecture issues and refactoring the code. The javadoc can be consulted live here: http://balie.sourceforge.net/doc/

Posted by David N. 2005-04-12

Balie evolves and a new package is created

Balie 1.11 is now available.
It corrects some minors bugs.
The main modification is the removal of all corpus and intermediates files.
Those files are now in an independant package (BalieCorpora 1.0)

Posted by David N. 2005-04-07

Balie v1.1 - major release

Balie now matures to a new and more powerful version. Architecture was revised following programming principles and design patterns. Code was inspected using jlint.

Here are the main changes:

- tested under linux! (bash files added),
- corpus size was augmented (and all UTF-8),
- lang identification now return probability estimates,
- better charset handling,
- many bug fixes (major: lang id optimization was corrupting the model),
- code inspection (IDEA inspector and jlint),
- better punctuation classes,
- improved language specific class architechture

Posted by David N. 2005-03-17

Balie technical report

This report presents Balie, a system for multilingual textual information extraction (IE). IE consist in finding and structuring data from free-written texts. Examples of IE tasks include named-entity recognition (NER), identification of proteins, abbreviation resolution, keyphrase extraction, identification of semantic roles and a large amount of much specific tasks (e.g.: finding e-mails, urls). Some tasks, like the last ones, may be easily tackled by the use of regular expressions or simple rules. However, more complex tasks like NER, require efficient and flexible architecture. Balie is driven by the need of such a flexible IE architecture. ... read more

Posted by David N. 2005-02-04

Balie 1.05 & CohenWrapper 1.01

Balie 1.05: major improvements for language identification module.

Cohen Wrapper 1.05: now includes probability estimates to force at least 1 positive node (in news content extraction)

----

Balie is designed to support textual information extraction tasks.

Balie supports English, French, Spanish, German and now, Romanian!

Given a text, it will (1) identify the language, (2) tokenize the text, (3) detect sentence boundaries and (4) guess part-of-speech for each token. ... read more

Posted by David N. 2005-01-27

Balie v1.04 - now supports Romanian!

Balie is designed to support textual information extraction tasks.

Balie supports English, French, Spanish, German and now, Romanian!

Given a text, it will (1) identify the language, (2) tokenize the text, (3) detect sentence boundaries and (4) guess part-of-speech for each token.

Posted by David N. 2005-01-08

Balie v1.03 now available

New feature:
- Language identification now works for Romanian (by onutza)

Bug fixes for Balie v1.03:
- Tokenizer bug on token with trailing puntuations (by e2lance)
- SBD bug (caused by tokenizer bug) also fixed - SBD better handling for open and close quotes
- SBD now use j48 unpruned tree

Posted by David N. 2005-01-02

Balie v1.02 is here

Balie is designed to support textual information extraction tasks.

Given a text, it will (1) identify the language, (2) tokenize the text, (3) detect sentence boundaries and (4) guess part-of-speech for each token.

-------

This is a minor release with 1 change:

1- The tokenlist can now be expressed as an XML string. It is usefull as an intermediate representation and can be useful for visualization using XSL.

Posted by David N. 2004-12-21

Balie extension: introducing Cohen Wrapper 1.0!

Balie is designed to support textual information extraction tasks.

Given a text, it will (1) identify the language, (2) tokenize the text, (3) detect sentence boundaries and (4) guess part-of-speech for each token.

---

The Cohen Wrapper is an extension to Balie.
This is a web page wrapper that can be trained to recognize desired parts of a web page.

Two wrappers are already trained:
(1) a wrapper that recognize news items in a newspaper homepage;
(2) a wrapper that recognize relevant text in a news article.

Posted by David N. 2004-12-17

Balie - now in the SourceForge CVS

Balie latest version is now in the SourceForge CVS:

Host: cvs.sourceforge.net
Path: /cvsroot/balie

The check-in was performed from Eclipse IDE. Therefore, you'll find Eclipse classpath and projects files.

Enjoy!

Posted by David N. 2004-12-01

Balie v.1.01 is available!

Balie is designed to support textual information extraction tasks.

Given a text, it will (1) identify the language, (2) tokenize the text, (3) detect sentence boundaries and (4) guess part-of-speech for each token.

-------

This is a minor release with 2 changes:

1- The sentence boundary detection training was speeded up.

2- A TermFrequency hashtable was added in the tokenlist. It maps each term to its frequency in the text.

Posted by David N. 2004-11-29

Sample IE task

Find a sample (really simple) program in the project documentation:

http://sourceforge.net/docman/?group_id=124581

Posted by David N. 2004-11-26