Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

Which metric ?

Andreas
2007-04-17
2013-04-25
  • Andreas
    Andreas
    2007-04-17

    hi,

    Im looking for a metric which is suitable for the following task:
    I have database containing various sentences, which may have pre- or postfixes.

    For example:

    "forcast: The weather is nice" 
    "The weather is nice - Andreas said"

    however this pre/postxifed sentences have been generated from an original sentence. In the examples, the original is: "The weather is nice"

    the mission is:
    given an (original) sentence, return sentences from the database, which are similar. But the following constraint must be followed:

    the results must have a length, which is near to the original´s
    Thus the following sentence should not be returned: "The weather is nice, because the sun is shining". Or it should be returned with a small relevance value

    thanks in advance,
    Andreas

     
    • Andreas
      Andreas
      2007-04-17

      in addition to the previous usecase, I also need a solution to the following (consider this as an independent task):

      this time the database contains both original and pre/postfixed sentences.

      and here the mission:
      given a sentence (even a pre or postfixed one), return similar sentences.

      Thus the following sentences should be considered similar, where any of the sentences could be used to trigger the query:

      "forcast: The weather is nice"
      "The weather is nice - Andreas said"
      "The weather is nice"

      cheers,
      Andreas

       
      • ReverendSam
        ReverendSam
        2007-04-18

        looks like a longest similar segment match not penalising for different pre and postfix segments heavily so smithwaterman seems best if you do not want to penalise chunks inside the text also

        The weather is looking nice. (i.e. less penalisation for "looking") then smithwatermangotoh is better

        hope this helps

        good luck

         
    • ReverendSam
      ReverendSam
      2007-04-18

      this sounds like a job for a combination of metrics, ChapmanMeanLength is made for similar purposes (its embarissingly simple, yet effective, it is simply a measure of length matching similarity, 1=same length,0=widely different length, it should never be used singularly and only in combination with other metrics). Combined with something general like SmithWatermanGotoh it should provide a good indicator.

      e.g.

      ChapmanMeanLengthResult * SmithWatermanGotohResult = final result

      (the above will possibly need tweaking/weighting to the exact data but should be a good starting point)

      Hope this helps, good luck