Menu

#6 Freeling API functionalities

2.0
open
2021-02-14
2020-12-16
No

Dear Eng. Antoni,

Does the freeling API do something else apart from tagging the input corpus in a special format so that the linguistic_term_extraction method can extract terminologies?

Because I was faced with a problem in the freeling API, so I replaced it with nltk.pos_tag function and I added some code to it so that the final output format will be
word1|word1|tag1 + space+word2|word2|tag2 + space+..... and so on

However after doing this there were only 7 true terminologies extracted from the text and 1000+ fake terminologies. So, I am wondering if what I did was right or not.

Related

Tickets: #6

Discussion

  • Antoni Oliver

    Antoni Oliver - 2020-12-22

    Hello:

    Sorry for the delay in my answer The Freeling API connects with Freeling to tag the text and puts the output in this special format.

    You can use any tagger and adapt the output to have the same format.

    Please, note that the POS tags may differ from one tagger to another so the POS patterns should be changed accordingly.

    Please, also remember that the project has moved to Github, so the lattest versions will be availablre only there: https://github.com/aoliverg/TBXTools

    Best regads

    Antoni

     
  • Bahgat Ahmed

    Bahgat Ahmed - 2021-01-24

    Thank you very much Dr. Antoni for your answer,

    I have another question please. Here are my question details:

    I did what you said. Moreover, I have experimented with different taggers, and lemmatizers. I have tested them against your ready sample of tagged corpus "corpus-control-JRC-tagged-eng.txt", and I compared the ratio true to fake, and the ratio fake to all terminologies extracted. I used your provided terminologies file "JRC-control-evaluation-terms2g3g-eng.txt" for getting the true terms for calculating these ratios by considering all terms in that file to be the true terms I compare against, and all terms extracted by TBXTools that aren't in these terms I considered them fake terms.

    Using your linguistic tool only, what I found after tagging your same sample corpus "corpus-control-JRC-seg-eng.txt" is that some of the combinations of taggers, and lemmatizers I used obtained:
    1 - More True to Fake Ratio than the Ratio I obtained by using you ready tagged corpus "corpus-control-JRC-tagged-eng.txt"
    2 - However, I remember that your tagged corpus obtained less Fake to All Extracted terminology Ratio than some of the combinations of taggers, and lemmatizers.

    Therefore seems that using some different taggers, and lemmatizers do better jobs than others. So my question is: Is the three accepted taggers you mentioned in your code, (I have attached the code snippet mentioning this below), are the best ones used for extracting terminologies using linguistic TBXTools? Or can there be more promising taggers, and lemmatizers combinations that I can use for better using your tool? I don't know how your tool exactly deals with taggers, and lemmatizers from its inside. I mean I don't know what is the best crietria your tool has for determining which is best.

    def load_sl_tagged_corpus(self,corpusfile,format="TBXTools",encoding="utf-8"):
            '''Loads the source language tagged corpus. 3 formats are allowed:
    
            - TBXTools: The internal format used by TBXTools. One tagged segment per line.
                    f1|l1|t1|p1 f2|l2|t2|p2 ... fn|ln|tn|pn
            - Freeling: One token per line and segments separated by blank lines
                    f1 l1 t1 p1
                    f2 l2 t2 p2
                    ...
                    fn ln tn pn
            - Conll: One of the output formats guiven by the Standford Core NLP analyzer.  On token per line and segments separated by blank lines
                    id1 f1 l1 t1 ...
                    id2 f2 l2 t2 ...
                    ...
                    idn fn ln tn ...
            '''
    

    I am sorry for this long question, but I was trying to clarify things for you for better understanding my problem, and thus better helping me.

    Thank you for your tool, and for your efforts.

    I am looking forward to receiving your response.

    Best Regards,
    Bahgat Ahmed

     

    Last edit: Bahgat Ahmed 2021-01-24
    • Antoni Oliver

      Antoni Oliver - 2021-01-25

      Hello:

      You can use any tagger BUT:

      • POS patterns may be changed if the used tagger uses a different tagset.
      • The format for a tagged corpus should be as described, that is, each
        token should be represented as word_form|lemma|tag and each of these tokens
        should be separated by spaces.

      Remember that we have moved our repository to Github:
      https://github.com/aoliverg/TBXTools

      Best regards

      Antoni Oliver

      Antoni Oliver González
      Estudis d'Arts i Humanitats

      Director del màster en Traducció i tecnologies
      aoliverg@uoc.edu
      ResearchGate https://www.researchgate.net/profile/Antoni_Oliver2 / Twitter
      https://twitter.com/aoliverg?lang=en / Linkedin
      https://www.linkedin.com/in/antonioliver/ /
      ORCID 0000-0001-8399-3770 https://orcid.org/0000-0001-8399-3770
      Av. Tibidabo, 39-43
      08035 Barcelona
      uoc.edu http://www.uoc.edu/
      eah.uoc.edu
      facebook https://www.facebook.com/artshumanitats.uoc / twitter
      https://twitter.com/UOCartshum / youtube
      https://www.youtube.com/playlist?list=PLA38C0AB8764CFC40

      [image: UOC] http://www.uoc.edu/

      Aquest missatge és confidencial, la difusió de les dades que inclou és
      regulada per la Llei orgànica de protecció de dades i la Llei de serveis
      de la societat de la informació. Abans d’imprimir aquest missatge
      electrònic penseu en el medi ambient.
      Este mensaje es confidencial, la difusión de los datos que incluye está
      regulada por la Ley orgánica de protección de datos y la Ley de servicios
      de la sociedad de la información. Antes de imprimir este mensaje
      electrónico piensa en el medio ambiente.

      This message is confidential. Dissemination of this message and its
      contents is regulated by the Spanish laws on Information Society Services
      and Data Protection. Please consider the environment before printing.

      Missatge de Bahgat Ahmed bahgat@users.sourceforge.net del dia dg., 24 de
      gen. 2021 a les 13:00:

      Thank you very much Dr. Antoni for your answer,

      I have another question please. Here are my question details:

      I did what you said. Moreover, I have experimented with different taggers,
      and lemmatizers. I have tested them against your ready sample of tagged
      corpus "corpus-control-JRC-tagged-eng.txt", and I compared the ratio true
      to fake, and the ratio fake to all terminologies extracted. I used your
      provided terminologies file "JRC-control-evaluation-terms2g3g-eng.txt" for
      getting the true terms for calculating these ratios by considering all
      terms in that file to be the true terms I compare against, and all terms
      extracted by TBXTools that aren't in these terms I considered them fake
      terms.

      Using your linguistic tool only, what I found after tagging your same
      sample corpus "corpus-control-JRC-seg-eng.txt" is that some of the
      combinations of taggers, and lemmatizers I used obtained:
      1 - More True to Fake Ratio than the Ratio I obtained by using you ready
      tagged corpus "corpus-control-JRC-tagged-eng.txt"
      2 - However, I remember that your tagged corpus obtained less Fake to All
      Extracted terminology Ratio than some of the combinations of taggers, and
      lemmatizers.

      Therefore seems that using some different taggers, and lemmatizers do
      better jobs than others. So my question is: Is the three accepted taggers
      you mentioned in your code, (I have attached the code snippet mentioning
      this below), are the best ones used for extracting terminologies using
      linguistic TBXTools? Or can there be more promising taggers, and
      lemmatizers combinations that I can use for better using your tool? I don't
      know how your tool exactly deals with taggers, and lemmatizers from its
      inside. I mean I don't know what is the best crietria your tool has for
      determining which is best.

      def
      load_sl_tagged_corpus(self,corpusfile,format="TBXTools",encoding="utf-8"):
      '''Loads the source language tagged corpus. 3 formats are allowed:
      - TBXTools: The internal format used by TBXTools. One tagged segment per
      line.
      f1|l1|t1|p1 f2|l2|t2|p2 ... fn|ln|tn|pn
      - Freeling: One token per line and segments separated by blank lines
      f1 l1 t1 p1
      f2 l2 t2 p2
      ...
      fn ln tn pn
      - Conll: One of the output formats guiven by the Standford Core NLP
      analyzer. On token per line and segments separated by blank lines
      id1 f1 l1 t1 ...
      id2 f2 l2 t2 ...
      ...
      idn fn ln tn ...
      '''

      I am sorry for this long question, but I was trying to clarify things for
      you for better understanding my problem, and therefore better helping me.

      Thank you for your tool, and your efforts.

      I am looking forward to receiving your response.

      Best Regards,
      Bahgat Ahmed


      Status: open
      Milestone: 2.0
      Labels: freeling linguistic freeling_api
      Created: Wed Dec 16, 2020 06:41 PM UTC by Bahgat Ahmed
      Last Updated: Tue Dec 22, 2020 07:16 PM UTC
      Owner: Antoni Oliver

      Dear Eng. Antoni,

      Does the freeling API do something else apart from tagging the input
      corpus in a special format so that the linguistic_term_extraction method
      can extract terminologies?

      Because I was faced with a problem in the freeling API, so I replaced it
      with nltk.pos_tag function and I added some code to it so that the final
      output format will be
      word1|word1|tag1 + space+word2|word2|tag2 + space+..... and so on

      However after doing this there were only 7 true terminologies extracted
      from the text and 1000+ fake terminologies. So, I am wondering if what I
      did was right or not.


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/tbxtools/tickets/6/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

      --

      INFORMACIÓ SOBRE PROTECCIÓ DE DADES DE LA UNIVERSITAT OBERTA DE
      CATALUNYA (UOC)

      Us informem que les vostres dades identificatives i les
      contingudes en els missatges electrònics i fitxers adjunts es poden
      incorporar a les nostres bases de dades amb la finalitat de gestionar les
      relacions i comunicacions vinculades a la UOC, i que es poden conservar
      mentre es mantingui la relació. Si ho voleu, podeu exercir el dret a
      accedir a les vostres dades, rectificar-les i suprimir-les i altres drets
      reconeguts normativament adreçant-vos a l'adreça de correu emissora o a
      fuoc_pd@uoc.edu fuoc_pd@uoc.edu.

      Aquest missatge i qualsevol
      fitxer que porti adjunt, si escau, tenen el caràcter de confidencials i
      s'adrecen únicament a la persona o entitat a qui s'han enviat.

      Així
      mateix, posem a la vostra disposició un delegat de protecció de dades que
      no només s'encarregarà de supervisar tots els tractaments de dades de la
      nostra entitat, sinó que us podrà atendre per a qualsevol qüestió
      relacionada amb el tractament de dades. La seva adreça de contacte és
      dpd@uoc.edu dpd@uoc.edu.
      INFORMACIÓN SOBRE PROTECCIÓN DE DATOS DE
      LA UNIVERSITAT OBERTA DE CATALUNYA (UOC)
      Os informamos de que vuestros
      datos identificativos y los contenidos en los mensajes electrónicos y
      ficheros adjuntos pueden incorporarse a nuestras bases de datos con el fin
      de gestionar las relaciones y comunicaciones vinculadas a la UOC, y de que
      pueden conservarse mientras se mantenga la relación. Si lo deseáis, podéis
      ejercer el derecho a acceder a vuestros datos, rectificarlos y suprimirlos
      y otros derechos reconocidos normativamente dirigiéndoos a la dirección de
      correo emisora o a fuoc_pd@uoc.edu fuoc_pd@uoc.edu.
      Este mensaje y
      cualquier fichero que lleve adjunto, si procede, tienen el carácter de
      confidenciales y se dirigen únicamente a la persona o entidad a quien se
      han enviado.
      Así mismo, ponemos a vuestra disposición a un delegado de
      protección de datos que no solo se encargará de supervisar todos los
      tratamientos de datos de nuestra entidad, sino que podrá atenderos para
      cualquier cuestión relacionada con el tratamiento de datos. Su dirección de
      contacto es dpd@uoc.edu dpd@uoc.edu.

      UNIVERSITAT OBERTA DE
      CATALUNYA (UOC) DATA PROTECTION INFORMATION
      Your personal data and the data
      contained in your email messages and attached files may be stored in our
      databases for the purpose of maintaining relations and communications
      linked to the UOC, and the data may be stored for as long as these
      relations and communications are maintained. If you so wish, you can
      exercise your rights to access, rectification and erasure of your data, and
      any other legally held rights, by writing to the sender’s email address or
      to fuoc_pd@uoc.edu http://fuoc_pd@uoc.edu.
      This message and, where
      applicable, any attachments are confidential and addressed solely to the
      individual or organization they were sent to.
      The UOC has a data protection
      officer who not only supervises the data processing carried out at the
      University, but who will also respond to any questions you may have about
      this data processing. You can contact our data protection officer by
      writing to dpd@uoc.edu http://dpd@uoc.edu.

       

      Related

      Tickets: #6

  • Bahgat Ahmed

    Bahgat Ahmed - 2021-02-03

    Thank you for your answer, Dr. Antoni. I am very sorry for my late follow-up question.

    So do you mean that if the tagger uses a different tagset than the ones you mentioned in your code (TBXTools, Freeling, or Conll) formats, the POS patterns must be modified?

    For example, if the tagger tags the "personal pronoun" by this tag "PPER", while the corresponding Conll tag is "PRP" should I replace any "PPER" tag with "PRP" tag for TBXTools to work correctly?

    **
    So he|he|PPER must become ---> he|he|PRP ?
    **

    Thank you for mentioning that you have moved the repository to Github.

    Thank you, Dr. Antoni.
    I am looking forward to receiving your answer.

    Best Regards,
    Bahgat Ahmed

     

    Last edit: Bahgat Ahmed 2021-02-03
    • Antoni Oliver

      Antoni Oliver - 2021-02-03

      Hello:

      In the patterns you should use the same tags than your tagger. If your
      tagger uses PPER you should use PPER in the POS patterns. Remember that you
      can use wildcards from regular expressions to shorten or group patters.

      Best regards

      Antoni

      Antoni Oliver González
      Estudis d'Arts i Humanitats

      Director del màster en Traducció i tecnologies
      aoliverg@uoc.edu
      ResearchGate https://www.researchgate.net/profile/Antoni_Oliver2 / Twitter
      https://twitter.com/aoliverg?lang=en / Linkedin
      https://www.linkedin.com/in/antonioliver/ /
      ORCID 0000-0001-8399-3770 https://orcid.org/0000-0001-8399-3770
      Av. Tibidabo, 39-43
      08035 Barcelona
      uoc.edu http://www.uoc.edu/
      eah.uoc.edu
      facebook https://www.facebook.com/artshumanitats.uoc / twitter
      https://twitter.com/UOCartshum / youtube
      https://www.youtube.com/playlist?list=PLA38C0AB8764CFC40

      [image: UOC] http://www.uoc.edu/

      Aquest missatge és confidencial, la difusió de les dades que inclou és
      regulada per la Llei orgànica de protecció de dades i la Llei de serveis
      de la societat de la informació. Abans d’imprimir aquest missatge
      electrònic penseu en el medi ambient.
      Este mensaje es confidencial, la difusión de los datos que incluye está
      regulada por la Ley orgánica de protección de datos y la Ley de servicios
      de la sociedad de la información. Antes de imprimir este mensaje
      electrónico piensa en el medio ambiente.

      This message is confidential. Dissemination of this message and its
      contents is regulated by the Spanish laws on Information Society Services
      and Data Protection. Please consider the environment before printing.

      Missatge de Bahgat Ahmed bahgat@users.sourceforge.net del dia dc., 3 de
      febr. 2021 a les 9:34:

      Thank you for your answer Dr. Antoni. I am very sorry for my late
      follow-up question.

      So do you mean that if the tagger uses a different tagset than the ones
      you mentioned in your code (TBXTools, Freeling, or Conll) formats, the POS
      patterns must be modified?

      For example, if the tagger tags the "personal pronoun" by this tag "PPER",
      while the corresponding Conll tag is "PRP" I should replace any "PPER" tag
      with "PRP" tag for TBXTools to work correctly?

      • so he|he|PPER must become ---> he|he|PRP ? *

      Thank you for mentioning that you have moved the repository to Github.

      Thank you, Dr. Antoni.
      I am looking forward to receivng your answer.

      Best Regards,
      Bahgat Ahmed


      Status: open
      Milestone: 2.0
      Labels: freeling linguistic freeling_api
      Created: Wed Dec 16, 2020 06:41 PM UTC by Bahgat Ahmed
      Last Updated: Sun Jan 24, 2021 12:00 PM UTC
      Owner: Antoni Oliver

      Dear Eng. Antoni,

      Does the freeling API do something else apart from tagging the input
      corpus in a special format so that the linguistic_term_extraction method
      can extract terminologies?

      Because I was faced with a problem in the freeling API, so I replaced it
      with nltk.pos_tag function and I added some code to it so that the final
      output format will be
      word1|word1|tag1 + space+word2|word2|tag2 + space+..... and so on

      However after doing this there were only 7 true terminologies extracted
      from the text and 1000+ fake terminologies. So, I am wondering if what I
      did was right or not.


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/tbxtools/tickets/6/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

      --

      INFORMACIÓ SOBRE PROTECCIÓ DE DADES DE LA UNIVERSITAT OBERTA DE
      CATALUNYA (UOC)

      Us informem que les vostres dades identificatives i les
      contingudes en els missatges electrònics i fitxers adjunts es poden
      incorporar a les nostres bases de dades amb la finalitat de gestionar les
      relacions i comunicacions vinculades a la UOC, i que es poden conservar
      mentre es mantingui la relació. Si ho voleu, podeu exercir el dret a
      accedir a les vostres dades, rectificar-les i suprimir-les i altres drets
      reconeguts normativament adreçant-vos a l'adreça de correu emissora o a
      fuoc_pd@uoc.edu fuoc_pd@uoc.edu.

      Aquest missatge i qualsevol
      fitxer que porti adjunt, si escau, tenen el caràcter de confidencials i
      s'adrecen únicament a la persona o entitat a qui s'han enviat.

      Així
      mateix, posem a la vostra disposició un delegat de protecció de dades que
      no només s'encarregarà de supervisar tots els tractaments de dades de la
      nostra entitat, sinó que us podrà atendre per a qualsevol qüestió
      relacionada amb el tractament de dades. La seva adreça de contacte és
      dpd@uoc.edu dpd@uoc.edu.
      INFORMACIÓN SOBRE PROTECCIÓN DE DATOS DE
      LA UNIVERSITAT OBERTA DE CATALUNYA (UOC)
      Os informamos de que vuestros
      datos identificativos y los contenidos en los mensajes electrónicos y
      ficheros adjuntos pueden incorporarse a nuestras bases de datos con el fin
      de gestionar las relaciones y comunicaciones vinculadas a la UOC, y de que
      pueden conservarse mientras se mantenga la relación. Si lo deseáis, podéis
      ejercer el derecho a acceder a vuestros datos, rectificarlos y suprimirlos
      y otros derechos reconocidos normativamente dirigiéndoos a la dirección de
      correo emisora o a fuoc_pd@uoc.edu fuoc_pd@uoc.edu.
      Este mensaje y
      cualquier fichero que lleve adjunto, si procede, tienen el carácter de
      confidenciales y se dirigen únicamente a la persona o entidad a quien se
      han enviado.
      Así mismo, ponemos a vuestra disposición a un delegado de
      protección de datos que no solo se encargará de supervisar todos los
      tratamientos de datos de nuestra entidad, sino que podrá atenderos para
      cualquier cuestión relacionada con el tratamiento de datos. Su dirección de
      contacto es dpd@uoc.edu dpd@uoc.edu.

      UNIVERSITAT OBERTA DE
      CATALUNYA (UOC) DATA PROTECTION INFORMATION
      Your personal data and the data
      contained in your email messages and attached files may be stored in our
      databases for the purpose of maintaining relations and communications
      linked to the UOC, and the data may be stored for as long as these
      relations and communications are maintained. If you so wish, you can
      exercise your rights to access, rectification and erasure of your data, and
      any other legally held rights, by writing to the sender’s email address or
      to fuoc_pd@uoc.edu http://fuoc_pd@uoc.edu.
      This message and, where
      applicable, any attachments are confidential and addressed solely to the
      individual or organization they were sent to.
      The UOC has a data protection
      officer who not only supervises the data processing carried out at the
      University, but who will also respond to any questions you may have about
      this data processing. You can contact our data protection officer by
      writing to dpd@uoc.edu http://dpd@uoc.edu.

       

      Related

      Tickets: #6

  • Bahgat Ahmed

    Bahgat Ahmed - 2021-02-14

    Thank you for your answer Dr. Antoni.

    But how could I use the wildcards for regular expressions to shorten or group patterns? Could you provide any simple example on how to use them? since they aren't mentioned in the documentation.

    I am looking forward to receiving your response.

    Best Regards,
    Bahgat Ahmed

     

Log in to post a comment.

MongoDB Logo MongoDB