-
I'm wondering how encodings are handled.
In my case, I have HTML in memory as a byte[] (could be from an HTTP connection, from a file, or several other sources). I know the encoding or have a pretty good guess (I'm using http://icu-project.org to detect encodings if not known).
I see an HTMLCleaner constructor that takes an InputStream and a charset name. I can see no other constructor I...
2007-05-24 06:27:42 UTC in HtmlCleaner
-
Caused by: java.lang.NullPointerException
at org.htmlcleaner.HtmlCleaner.makeTree(HtmlCleaner.java:502)
at org.htmlcleaner.HtmlTokenizer.addToken(HtmlTokenizer.java:89)
at org.htmlcleaner.HtmlTokenizer.tagStart(HtmlTokenizer.java:365)
at org.htmlcleaner.HtmlTokenizer.start(HtmlTokenizer.java:333)
at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:361)
at...
2007-05-24 05:41:37 UTC in HtmlCleaner
-
Great!
What are the plans for releasing source code for the 2.x C++ enablement layer?.
2006-11-16 21:41:40 UTC in UIMA Framework
-
Currently the C++ enablement layer uses ICU 2.8. My
annotators use ICU 3.0. The current version of ICU is
3.6 (Unicode 5).
I could convince the annotator team to upgrade to ICU
3.x (x > 0), but I can't convince them to go back to
2.8. So any version of ICU from 3.0 to 3.6 would be
fine.
2006-10-14 00:22:24 UTC in UIMA Framework
-
I would like to be able to create a CAS and call a C++
annotator outside a Java virtual machine, i.e. in a
stand-alone C++ executable. I'm told that xcasDriver
can do this. I don't even need an Analysis Engine; I
would be happy if I could run just a single C++
annotator all my itself.
2006-10-14 00:17:57 UTC in UIMA Framework
-
I would like to get the source code for the C++ enable
layer. This would let me:
1. Support platforms other than Linux and Windows.
For example, I must support Solaris.
2. Link a different version of the ICU libraries that
are compatible with my C++ annotators.
2006-10-14 00:12:53 UTC in UIMA Framework
-
Class com.ibm.uima.cas.impl.Serialization is not
documented in the JavaDoc.
Since class CAS doesn't implement Serializable,
Serialization is the only way to (efficiently)
transport a CAS Java process to Java process (via RMI
or Jini).
Also, Serialization seems to be missing a method.
There is a "CASSerializer serializeCAS(CAS)" method
but no way to deserialize it. There is a...
2006-10-11 22:26:42 UTC in UIMA Framework
-
For example, in
docs\examples\src\com\ibm\uima\examples\RunAE.java,
CollectionReader.setCasInitializer() is called, which
is deprecated.
The examples should show the right way to do things.
In this case, it should show how to use a multi-SOFA
annotator instead of a CAS Initializer. For example,
it might use XmlDetagger annotator.
I'm sure there are other places the examples...
2006-08-30 04:05:23 UTC in UIMA Framework
-
For example, in
docs\examples\src\com\ibm\uima\examples\RunAE.java,
CollectionReader.setCasInitializer() is called, which
is deprecated.
The examples should show the right way to do things.
In this case, it should show how to use a multi-SOFA
annotator instead of a CAS Initializer. For example,
it might use XmlDetagger annotator.
I'm sure there are other places the examples...
2006-08-30 00:51:19 UTC in UIMA Framework
-
In the Component Descriptor Editor, if you are in the
Source tab and make a mistake (for example, an import
path is wrong and the file can't be read), then when
you go to another tab, the editor will pop up an Error
dialog. You click OK, and it pops up another
(identical) Error dialog. Click OK, and the process
repeats without end.
Your only choice is to kill the java.exe process.
2006-08-10 06:12:09 UTC in UIMA Framework