TBXTools / Tickets / #6 Freeling API functionalities

Antoni Oliver - 2020-12-22

Hello:

Sorry for the delay in my answer The Freeling API connects with Freeling to tag the text and puts the output in this special format.

You can use any tagger and adapt the output to have the same format.

Please, note that the POS tags may differ from one tagger to another so the POS patterns should be changed accordingly.

Please, also remember that the project has moved to Github, so the lattest versions will be availablre only there: https://github.com/aoliverg/TBXTools

Best regads

Antoni

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bahgat Ahmed - 2021-01-24

Thank you very much Dr. Antoni for your answer,

I have another question please. Here are my question details:

I did what you said. Moreover, I have experimented with different taggers, and lemmatizers. I have tested them against your ready sample of tagged corpus "corpus-control-JRC-tagged-eng.txt", and I compared the ratio true to fake, and the ratio fake to all terminologies extracted. I used your provided terminologies file "JRC-control-evaluation-terms2g3g-eng.txt" for getting the true terms for calculating these ratios by considering all terms in that file to be the true terms I compare against, and all terms extracted by TBXTools that aren't in these terms I considered them fake terms.

Using your linguistic tool only, what I found after tagging your same sample corpus "corpus-control-JRC-seg-eng.txt" is that some of the combinations of taggers, and lemmatizers I used obtained:
1 - More True to Fake Ratio than the Ratio I obtained by using you ready tagged corpus "corpus-control-JRC-tagged-eng.txt"
2 - However, I remember that your tagged corpus obtained less Fake to All Extracted terminology Ratio than some of the combinations of taggers, and lemmatizers.

Therefore seems that using some different taggers, and lemmatizers do better jobs than others. So my question is: Is the three accepted taggers you mentioned in your code, (I have attached the code snippet mentioning this below), are the best ones used for extracting terminologies using linguistic TBXTools? Or can there be more promising taggers, and lemmatizers combinations that I can use for better using your tool? I don't know how your tool exactly deals with taggers, and lemmatizers from its inside. I mean I don't know what is the best crietria your tool has for determining which is best.

def load_sl_tagged_corpus(self,corpusfile,format="TBXTools",encoding="utf-8"): '''Loads the source language tagged corpus. 3 formats are allowed: - TBXTools: The internal format used by TBXTools. One tagged segment per line. f1|l1|t1|p1 f2|l2|t2|p2 ... fn|ln|tn|pn - Freeling: One token per line and segments separated by blank lines f1 l1 t1 p1 f2 l2 t2 p2 ... fn ln tn pn - Conll: One of the output formats guiven by the Standford Core NLP analyzer. On token per line and segments separated by blank lines id1 f1 l1 t1 ... id2 f2 l2 t2 ... ... idn fn ln tn ... '''

I am sorry for this long question, but I was trying to clarify things for you for better understanding my problem, and thus better helping me.

Thank you for your tool, and for your efforts.

I am looking forward to receiving your response.

Best Regards,
Bahgat Ahmed

Last edit: Bahgat Ahmed 2021-01-24
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Antoni Oliver - 2021-01-25
  
  Hello:
  
  You can use any tagger BUT:
  
  POS patterns may be changed if the used tagger uses a different tagset.
  
  The format for a tagged corpus should be as described, that is, each
  token should be represented as word_form|lemma|tag and each of these tokens
  should be separated by spaces.
  
  Remember that we have moved our repository to Github:
  https://github.com/aoliverg/TBXTools
  
  Best regards
  
  Antoni Oliver
  
  Antoni Oliver González
  Estudis d'Arts i Humanitats
  
  Director del màster en Traducció i tecnologies
  aoliverg@uoc.edu
  ResearchGate https://www.researchgate.net/profile/Antoni_Oliver2 / Twitter
  https://twitter.com/aoliverg?lang=en / Linkedin
  https://www.linkedin.com/in/antonioliver/ /
  ORCID 0000-0001-8399-3770 https://orcid.org/0000-0001-8399-3770
  Av. Tibidabo, 39-43
  08035 Barcelona
  uoc.edu http://www.uoc.edu/
  eah.uoc.edu
  facebook https://www.facebook.com/artshumanitats.uoc / twitter
  https://twitter.com/UOCartshum / youtube
  https://www.youtube.com/playlist?list=PLA38C0AB8764CFC40
  
  [image: UOC] http://www.uoc.edu/
  
  Aquest missatge és confidencial, la difusió de les dades que inclou és
  regulada per la Llei orgànica de protecció de dades i la Llei de serveis
  de la societat de la informació. Abans d’imprimir aquest missatge
  electrònic penseu en el medi ambient.
  Este mensaje es confidencial, la difusión de los datos que incluye está
  regulada por la Ley orgánica de protección de datos y la Ley de servicios
  de la sociedad de la información. Antes de imprimir este mensaje
  electrónico piensa en el medio ambiente.
  This message is confidential. Dissemination of this message and its
  contents is regulated by the Spanish laws on Information Society Services
  and Data Protection. Please consider the environment before printing.
  
  Missatge de Bahgat Ahmed bahgat@users.sourceforge.net del dia dg., 24 de
  gen. 2021 a les 13:00:
  
  Thank you very much Dr. Antoni for your answer,
  
  I have another question please. Here are my question details:
  
  I did what you said. Moreover, I have experimented with different taggers,
  and lemmatizers. I have tested them against your ready sample of tagged
  corpus "corpus-control-JRC-tagged-eng.txt", and I compared the ratio true
  to fake, and the ratio fake to all terminologies extracted. I used your
  provided terminologies file "JRC-control-evaluation-terms2g3g-eng.txt" for
  getting the true terms for calculating these ratios by considering all
  terms in that file to be the true terms I compare against, and all terms
  extracted by TBXTools that aren't in these terms I considered them fake
  terms.
  
  Using your linguistic tool only, what I found after tagging your same
  sample corpus "corpus-control-JRC-seg-eng.txt" is that some of the
  combinations of taggers, and lemmatizers I used obtained:
  1 - More True to Fake Ratio than the Ratio I obtained by using you ready
  tagged corpus "corpus-control-JRC-tagged-eng.txt"
  2 - However, I remember that your tagged corpus obtained less Fake to All
  Extracted terminology Ratio than some of the combinations of taggers, and
  lemmatizers.
  
  Therefore seems that using some different taggers, and lemmatizers do
  better jobs than others. So my question is: Is the three accepted taggers
  you mentioned in your code, (I have attached the code snippet mentioning
  this below), are the best ones used for extracting terminologies using
  linguistic TBXTools? Or can there be more promising taggers, and
  lemmatizers combinations that I can use for better using your tool? I don't
  know how your tool exactly deals with taggers, and lemmatizers from its
  inside. I mean I don't know what is the best crietria your tool has for
  determining which is best.
  
  def
  load_sl_tagged_corpus(self,corpusfile,format="TBXTools",encoding="utf-8"):
  '''Loads the source language tagged corpus. 3 formats are allowed:
  - TBXTools: The internal format used by TBXTools. One tagged segment per
  line.
  f1|l1|t1|p1 f2|l2|t2|p2 ... fn|ln|tn|pn
  - Freeling: One token per line and segments separated by blank lines
  f1 l1 t1 p1
  f2 l2 t2 p2
  ...
  fn ln tn pn
  - Conll: One of the output formats guiven by the Standford Core NLP
  analyzer. On token per line and segments separated by blank lines
  id1 f1 l1 t1 ...
  id2 f2 l2 t2 ...
  ...
  idn fn ln tn ...
  '''
  
  I am sorry for this long question, but I was trying to clarify things for
  you for better understanding my problem, and therefore better helping me.
  
  Thank you for your tool, and your efforts.
  
  I am looking forward to receiving your response.
  
  Best Regards,
  Bahgat Ahmed
  
  [tickets:#6] https://sourceforge.net/p/tbxtools/tickets/6/ Freeling
  API functionalities*
  
  Status: open
  Milestone: 2.0
  Labels: freeling linguistic freeling_api
  Created: Wed Dec 16, 2020 06:41 PM UTC by Bahgat Ahmed
  Last Updated: Tue Dec 22, 2020 07:16 PM UTC
  Owner: Antoni Oliver
  
  Dear Eng. Antoni,
  
  Does the freeling API do something else apart from tagging the input
  corpus in a special format so that the linguistic_term_extraction method
  can extract terminologies?
  
  Because I was faced with a problem in the freeling API, so I replaced it
  with nltk.pos_tag function and I added some code to it so that the final
  output format will be
  word1|word1|tag1 + space+word2|word2|tag2 + space+..... and so on
  
  However after doing this there were only 7 true terminologies extracted
  from the text and 1000+ fake terminologies. So, I am wondering if what I
  did was right or not.
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/tbxtools/tickets/6/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  --
  
  INFORMACIÓ SOBRE PROTECCIÓ DE DADES DE LA UNIVERSITAT OBERTA DE
  CATALUNYA (UOC)
  
  Us informem que les vostres dades identificatives i les
  contingudes en els missatges electrònics i fitxers adjunts es poden
  incorporar a les nostres bases de dades amb la finalitat de gestionar les
  relacions i comunicacions vinculades a la UOC, i que es poden conservar
  mentre es mantingui la relació. Si ho voleu, podeu exercir el dret a
  accedir a les vostres dades, rectificar-les i suprimir-les i altres drets
  reconeguts normativament adreçant-vos a l'adreça de correu emissora o a
  fuoc_pd@uoc.edu fuoc_pd@uoc.edu.
  
  Aquest missatge i qualsevol
  fitxer que porti adjunt, si escau, tenen el caràcter de confidencials i
  s'adrecen únicament a la persona o entitat a qui s'han enviat.
  
  Així
  mateix, posem a la vostra disposició un delegat de protecció de dades que
  no només s'encarregarà de supervisar tots els tractaments de dades de la
  nostra entitat, sinó que us podrà atendre per a qualsevol qüestió
  relacionada amb el tractament de dades. La seva adreça de contacte és
  dpd@uoc.edu dpd@uoc.edu.
  INFORMACIÓN SOBRE PROTECCIÓN DE DATOS DE
  LA UNIVERSITAT OBERTA DE CATALUNYA (UOC)
  Os informamos de que vuestros
  datos identificativos y los contenidos en los mensajes electrónicos y
  ficheros adjuntos pueden incorporarse a nuestras bases de datos con el fin
  de gestionar las relaciones y comunicaciones vinculadas a la UOC, y de que
  pueden conservarse mientras se mantenga la relación. Si lo deseáis, podéis
  ejercer el derecho a acceder a vuestros datos, rectificarlos y suprimirlos
  y otros derechos reconocidos normativamente dirigiéndoos a la dirección de
  correo emisora o a fuoc_pd@uoc.edu fuoc_pd@uoc.edu.
  Este mensaje y
  cualquier fichero que lleve adjunto, si procede, tienen el carácter de
  confidenciales y se dirigen únicamente a la persona o entidad a quien se
  han enviado.
  Así mismo, ponemos a vuestra disposición a un delegado de
  protección de datos que no solo se encargará de supervisar todos los
  tratamientos de datos de nuestra entidad, sino que podrá atenderos para
  cualquier cuestión relacionada con el tratamiento de datos. Su dirección de
  contacto es dpd@uoc.edu dpd@uoc.edu.
  
  UNIVERSITAT OBERTA DE
  CATALUNYA (UOC) DATA PROTECTION INFORMATION
  Your personal data and the data
  contained in your email messages and attached files may be stored in our
  databases for the purpose of maintaining relations and communications
  linked to the UOC, and the data may be stored for as long as these
  relations and communications are maintained. If you so wish, you can
  exercise your rights to access, rectification and erasure of your data, and
  any other legally held rights, by writing to the sender’s email address or
  to fuoc_pd@uoc.edu http://fuoc_pd@uoc.edu.
  This message and, where
  applicable, any attachments are confidential and addressed solely to the
  individual or organization they were sent to.
  The UOC has a data protection
  officer who not only supervises the data processing carried out at the
  University, but who will also respond to any questions you may have about
  this data processing. You can contact our data protection officer by
  writing to dpd@uoc.edu http://dpd@uoc.edu.
  
  Related
  
  Tickets: #6
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bahgat Ahmed - 2021-02-03

Thank you for your answer, Dr. Antoni. I am very sorry for my late follow-up question.

So do you mean that if the tagger uses a different tagset than the ones you mentioned in your code (TBXTools, Freeling, or Conll) formats, the POS patterns must be modified?

For example, if the tagger tags the "personal pronoun" by this tag "PPER", while the corresponding Conll tag is "PRP" should I replace any "PPER" tag with "PRP" tag for TBXTools to work correctly?

**
So he|he|PPER must become ---> he|he|PRP ?
**

Thank you for mentioning that you have moved the repository to Github.

Thank you, Dr. Antoni.
I am looking forward to receiving your answer.

Best Regards,
Bahgat Ahmed

Last edit: Bahgat Ahmed 2021-02-03

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Antoni Oliver - 2021-02-03
  
  Hello:
  
  In the patterns you should use the same tags than your tagger. If your
  tagger uses PPER you should use PPER in the POS patterns. Remember that you
  can use wildcards from regular expressions to shorten or group patters.
  
  Best regards
  
  Antoni
  
  Antoni Oliver González
  Estudis d'Arts i Humanitats
  
  Director del màster en Traducció i tecnologies
  aoliverg@uoc.edu
  ResearchGate https://www.researchgate.net/profile/Antoni_Oliver2 / Twitter
  https://twitter.com/aoliverg?lang=en / Linkedin
  https://www.linkedin.com/in/antonioliver/ /
  ORCID 0000-0001-8399-3770 https://orcid.org/0000-0001-8399-3770
  Av. Tibidabo, 39-43
  08035 Barcelona
  uoc.edu http://www.uoc.edu/
  eah.uoc.edu
  facebook https://www.facebook.com/artshumanitats.uoc / twitter
  https://twitter.com/UOCartshum / youtube
  https://www.youtube.com/playlist?list=PLA38C0AB8764CFC40
  
  [image: UOC] http://www.uoc.edu/
  
  Aquest missatge és confidencial, la difusió de les dades que inclou és
  regulada per la Llei orgànica de protecció de dades i la Llei de serveis
  de la societat de la informació. Abans d’imprimir aquest missatge
  electrònic penseu en el medi ambient.
  Este mensaje es confidencial, la difusión de los datos que incluye está
  regulada por la Ley orgánica de protección de datos y la Ley de servicios
  de la sociedad de la información. Antes de imprimir este mensaje
  electrónico piensa en el medio ambiente.
  This message is confidential. Dissemination of this message and its
  contents is regulated by the Spanish laws on Information Society Services
  and Data Protection. Please consider the environment before printing.
  
  Missatge de Bahgat Ahmed bahgat@users.sourceforge.net del dia dc., 3 de
  febr. 2021 a les 9:34:
  
  Thank you for your answer Dr. Antoni. I am very sorry for my late
  follow-up question.
  
  So do you mean that if the tagger uses a different tagset than the ones
  you mentioned in your code (TBXTools, Freeling, or Conll) formats, the POS
  patterns must be modified?
  
  For example, if the tagger tags the "personal pronoun" by this tag "PPER",
  while the corresponding Conll tag is "PRP" I should replace any "PPER" tag
  with "PRP" tag for TBXTools to work correctly?
  
  so he|he|PPER must become ---> he|he|PRP ? *
  
  Thank you for mentioning that you have moved the repository to Github.
  
  Thank you, Dr. Antoni.
  I am looking forward to receivng your answer.
  
  Best Regards,
  Bahgat Ahmed
  
  [tickets:#6] https://sourceforge.net/p/tbxtools/tickets/6/ Freeling
  API functionalities*
  
  Status: open
  Milestone: 2.0
  Labels: freeling linguistic freeling_api
  Created: Wed Dec 16, 2020 06:41 PM UTC by Bahgat Ahmed
  Last Updated: Sun Jan 24, 2021 12:00 PM UTC
  Owner: Antoni Oliver
  
  Dear Eng. Antoni,
  
  Does the freeling API do something else apart from tagging the input
  corpus in a special format so that the linguistic_term_extraction method
  can extract terminologies?
  
  Because I was faced with a problem in the freeling API, so I replaced it
  with nltk.pos_tag function and I added some code to it so that the final
  output format will be
  word1|word1|tag1 + space+word2|word2|tag2 + space+..... and so on
  
  However after doing this there were only 7 true terminologies extracted
  from the text and 1000+ fake terminologies. So, I am wondering if what I
  did was right or not.
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/tbxtools/tickets/6/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  --
  
  INFORMACIÓ SOBRE PROTECCIÓ DE DADES DE LA UNIVERSITAT OBERTA DE
  CATALUNYA (UOC)
  
  Us informem que les vostres dades identificatives i les
  contingudes en els missatges electrònics i fitxers adjunts es poden
  incorporar a les nostres bases de dades amb la finalitat de gestionar les
  relacions i comunicacions vinculades a la UOC, i que es poden conservar
  mentre es mantingui la relació. Si ho voleu, podeu exercir el dret a
  accedir a les vostres dades, rectificar-les i suprimir-les i altres drets
  reconeguts normativament adreçant-vos a l'adreça de correu emissora o a
  fuoc_pd@uoc.edu fuoc_pd@uoc.edu.
  
  Aquest missatge i qualsevol
  fitxer que porti adjunt, si escau, tenen el caràcter de confidencials i
  s'adrecen únicament a la persona o entitat a qui s'han enviat.
  
  Així
  mateix, posem a la vostra disposició un delegat de protecció de dades que
  no només s'encarregarà de supervisar tots els tractaments de dades de la
  nostra entitat, sinó que us podrà atendre per a qualsevol qüestió
  relacionada amb el tractament de dades. La seva adreça de contacte és
  dpd@uoc.edu dpd@uoc.edu.
  INFORMACIÓN SOBRE PROTECCIÓN DE DATOS DE
  LA UNIVERSITAT OBERTA DE CATALUNYA (UOC)
  Os informamos de que vuestros
  datos identificativos y los contenidos en los mensajes electrónicos y
  ficheros adjuntos pueden incorporarse a nuestras bases de datos con el fin
  de gestionar las relaciones y comunicaciones vinculadas a la UOC, y de que
  pueden conservarse mientras se mantenga la relación. Si lo deseáis, podéis
  ejercer el derecho a acceder a vuestros datos, rectificarlos y suprimirlos
  y otros derechos reconocidos normativamente dirigiéndoos a la dirección de
  correo emisora o a fuoc_pd@uoc.edu fuoc_pd@uoc.edu.
  Este mensaje y
  cualquier fichero que lleve adjunto, si procede, tienen el carácter de
  confidenciales y se dirigen únicamente a la persona o entidad a quien se
  han enviado.
  Así mismo, ponemos a vuestra disposición a un delegado de
  protección de datos que no solo se encargará de supervisar todos los
  tratamientos de datos de nuestra entidad, sino que podrá atenderos para
  cualquier cuestión relacionada con el tratamiento de datos. Su dirección de
  contacto es dpd@uoc.edu dpd@uoc.edu.
  
  UNIVERSITAT OBERTA DE
  CATALUNYA (UOC) DATA PROTECTION INFORMATION
  Your personal data and the data
  contained in your email messages and attached files may be stored in our
  databases for the purpose of maintaining relations and communications
  linked to the UOC, and the data may be stored for as long as these
  relations and communications are maintained. If you so wish, you can
  exercise your rights to access, rectification and erasure of your data, and
  any other legally held rights, by writing to the sender’s email address or
  to fuoc_pd@uoc.edu http://fuoc_pd@uoc.edu.
  This message and, where
  applicable, any attachments are confidential and addressed solely to the
  individual or organization they were sent to.
  The UOC has a data protection
  officer who not only supervises the data processing carried out at the
  University, but who will also respond to any questions you may have about
  this data processing. You can contact our data protection officer by
  writing to dpd@uoc.edu http://dpd@uoc.edu.
  
  Related
  
  Tickets: #6
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bahgat Ahmed - 2021-02-14

Thank you for your answer Dr. Antoni.

But how could I use the wildcards for regular expressions to shorten or group patterns? Could you provide any simple example on how to use them? since they aren't mentioned in the documentation.

I am looking forward to receiving your response.

Best Regards,
Bahgat Ahmed

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Freeling API functionalities

A Python class for Terminology Extraction and Management

Milestone

Searches

Help

#6 Freeling API functionalities

Related

Discussion

Related

Related