Menu

Mod to fuzzy matching to get all segments

2008-08-17
2013-04-24
  • Martin Wunderlich

    Hi there,

    I have slightly modified the fuzzy machting function tinytm_get_fuzzy_matches so that it will retrieve all segments for the respective language combination, if an empty string is pass in as the source text. The source text for the PostgreSQL function is below. I thought this could be useful, to have something like a primitive TM export function.

    Cheers,

    Martin

    -- Function: tinytm_get_fuzzy_matches(character varying, character varying, character varying, character varying, character varying)

    -- DROP FUNCTION tinytm_get_fuzzy_matches(character varying, character varying, character varying, character varying, character varying);

    CREATE OR REPLACE FUNCTION tinytm_get_fuzzy_matches(character varying, character varying, character varying, character varying, character varying)
      RETURNS SETOF tinytm_fuzzy_search_result AS
    $BODY$
    DECLARE
        p_source_lang        alias for $1;
        p_target_lang        alias for $2;
        p_source_text        alias for $3;
        p_tag_string        alias for $4;
        p_penalty_string    alias for $5;

        result            tinytm_fuzzy_search_result%ROWTYPE;
        row            RECORD;
    BEGIN
        IF p_source_text = '' THEN
            FOR row IN
                SELECT
                    tinytm_levenshtein(ts.source_text,ts.source_text) + 100 as levenshtein,
                    ts.source_text,
                    ts.target_text
                FROM        
                    tinytm_segments ts
                    LEFT OUTER JOIN tinytm_languages sl ON (ts.source_lang_id = sl.language_id)
                    LEFT OUTER JOIN tinytm_languages tl ON (ts.target_lang_id = tl.language_id)
                WHERE
                    source_lang_id in (select * from tinytm_lang_ids_from_lang(p_source_lang)) and
                    target_lang_id in (select * from tinytm_lang_ids_from_lang(p_target_lang))
            LOOP
                result.score := row.levenshtein;
                result.source_text = row.source_text;
                result.target_text = row.target_text;
                RETURN NEXT result;
            END LOOP;
           
        ELSE
            FOR row IN
                SELECT
                    100 * (source_len - levenshtein) / source_len as score,
                    ts.source_text,
                    ts.target_text,
                    source_len
                FROM
                    (select distinct
                        tinytm_levenshtein(trim(p_source_text), ts.source_text) * 1.0 as levenshtein,
                        length(p_source_text) * 1.0 as source_len,
                        sl.language as source_lang,
                        tl.language as target_lang,
                        source_text,
                        target_text
                    from
                        tinytm_segments ts
                        LEFT OUTER JOIN tinytm_languages sl ON (ts.source_lang_id = sl.language_id)
                        LEFT OUTER JOIN tinytm_languages tl ON (ts.target_lang_id = tl.language_id)
                    where
                        source_lang_id in (select * from tinytm_lang_ids_from_lang(p_source_lang)) and
                        target_lang_id in (select * from tinytm_lang_ids_from_lang(p_target_lang))
                    order by
                        levenshtein
                    LIMIT 20
                    ) ts
                where
                    source_len > 5
            LOOP
                IF row.score > 50.0 THEN
                    result.score := row.score::numeric(6,1);
                    result.source_text = row.source_text;
                    result.target_text = row.target_text;
                    RETURN NEXT result;
                END IF;
            END LOOP;
        END IF;
        RETURN;
    END;$BODY$
      LANGUAGE 'plpgsql' VOLATILE;
    ALTER FUNCTION tinytm_get_fuzzy_matches(character varying, character varying, character varying, character varying, character varying) OWNER TO "admin";

     
    • Frank Bergmann

      Frank Bergmann - 2009-01-20

      Hi Martin,

      I'd be concerned with performance if there are literally millions of segments in the DB. What do you think?

      Cheers!
      Frank

       
      • Martin Wunderlich

        That's true. There could be a parameter to limit the number of segments returned. But it would be useful to have this option. It's like a TM export. 

        Cheers,

        Martin

         
    • Frank Bergmann

      Frank Bergmann - 2009-01-22

      Hi Martin,

      I understand your point of an export. But maybe that would be possible with a direct access to the SQL tables?

      It really would be no problem to insert your changes into the code. However, my "female intuition" (a difficult to explain collection of experiences during my last 20 years in IT...) tells me that this option will create problems in the future... Imagine some implementation not catching the case of an empty string, and everything will hang then on a slow network connection or something similar...

      It would be cleaner to create an explicit API call for that purpose, actually.

      Cheers!
      Frank

       
      • Martin Wunderlich

        Hi Frank,

        I guess an explicit API call would be the best solution. You could just added to the stored procedures and then it can easily be replicated by the web services interface.

        Cheers,

        Martin

         

Log in to post a comment.