Im really impressed that this effort is up and running.
A query about the pos tagger interface. Is it possible to permit the tagger
to accept pos tags in the tokenized input? i.e. to have input in the form
of a set of tokens with an optional set of pos tags per token?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
We're definitely open to suggestions for the interfaces. Could you give me an example of where this type of input would be useful for a pos tagger? Is it that you have some sort of prior knowledge about what the possible tags are?
Gann
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2000-06-07
(im still in the process of looking through the system, please excuse any oversites).
tagging is usually thought of as an initial process, however if you're dealing with
messy data, you need to do a lot of pre-processing before you can even
get to tagging. this preprocessing may identify, for example, known chunks of text
which can be offered as single units which may or may not have associated features.
an example of the type of input would be
the
vw beetle NN
is
groovy JJ
.
as you can see, the text is already tokenized and there are some pos tags associated with
some of these tokens. ive noticed when looking at other pos taggers that this type of
feature is present in a few but seems to be not present in the majority. of course,
if you have it, it simply removes a certain amount of work that the taggr itself has
to do (or replaces it).
however, one overhead of this approach is that it stops the string being the universal
data representation and introduces a more complex object. however, i believe that a simple
representation would be a string plus an attribute feature list.
if you think this is a reasonable extension of the tagger interface, id be happy to draft
a more formal specification etc.
matt
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So, you are suggesting that there might be an earlier module that identifies certain tokens and knows their part of speech. That seems reasonable-- esp in restricted domains.
Fortunately, we already have a nice structured data representation that will make this pretty easy to do. All of our preprocessing components use XML as its data rep. Some of them, like pos tagging, also have a lower level data representation that can be used if desired.
Basically, one would have to change our tagger so that when a pos tag is specified for a token in the XML, it will only consider that when searching for the highest probability tag set. This should have the added bonus of speeding up the search. The pre-preprocessing you are suggesting would simply go earlier in the pipeline.
Are you at all interested in implementing this? It would be a good chance to get into the system... although I'm afraid that that area isn't particularly well documented.
Gann
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Im really impressed that this effort is up and running.
A query about the pos tagger interface. Is it possible to permit the tagger
to accept pos tags in the tokenized input? i.e. to have input in the form
of a set of tokens with an optional set of pos tags per token?
Thanks!
We're definitely open to suggestions for the interfaces. Could you give me an example of where this type of input would be useful for a pos tagger? Is it that you have some sort of prior knowledge about what the possible tags are?
Gann
(im still in the process of looking through the system, please excuse any oversites).
tagging is usually thought of as an initial process, however if you're dealing with
messy data, you need to do a lot of pre-processing before you can even
get to tagging. this preprocessing may identify, for example, known chunks of text
which can be offered as single units which may or may not have associated features.
an example of the type of input would be
the
vw beetle NN
is
groovy JJ
.
as you can see, the text is already tokenized and there are some pos tags associated with
some of these tokens. ive noticed when looking at other pos taggers that this type of
feature is present in a few but seems to be not present in the majority. of course,
if you have it, it simply removes a certain amount of work that the taggr itself has
to do (or replaces it).
however, one overhead of this approach is that it stops the string being the universal
data representation and introduces a more complex object. however, i believe that a simple
representation would be a string plus an attribute feature list.
if you think this is a reasonable extension of the tagger interface, id be happy to draft
a more formal specification etc.
matt
Okay, I understand now.
So, you are suggesting that there might be an earlier module that identifies certain tokens and knows their part of speech. That seems reasonable-- esp in restricted domains.
Fortunately, we already have a nice structured data representation that will make this pretty easy to do. All of our preprocessing components use XML as its data rep. Some of them, like pos tagging, also have a lower level data representation that can be used if desired.
Basically, one would have to change our tagger so that when a pos tag is specified for a token in the XML, it will only consider that when searching for the highest probability tag set. This should have the added bonus of speeding up the search. The pre-preprocessing you are suggesting would simply go earlier in the pipeline.
Are you at all interested in implementing this? It would be a good chance to get into the system... although I'm afraid that that area isn't particularly well documented.
Gann