Menu

TMX inline formatings tags support

2009-01-15
2013-04-24
  • Peter Tužinský

    I am currently working on support of tags in tinyTM segments. My prototype solution proposes to add source_tagged (text), target_tagged (text) fields. To store pure segment text with placeholders such as these ({1} text {2} ) where {1} is a start of a tag and {2} an end. Then two fields containing TMX fragments of these tags source_tmx_tags (text) and target_tmx_tags (text).  Other two fields containg arrays containing xid identifiers if header <tag> definitions of TMX are used for a placeholder source_tag_ids (varchar[]) target_tags_ids(varchar[]). Later there will be added a another form of tag normalization using placeholders not as start and end of a tag, rather as largest continuous formatting  information being a placeholder.
    Example:
    ({1}title{2}text{3} {1}=<a href="" title=", {2}= " >, {3}= </a>) , currently it is used like this {1}{2}title{3}{4}text{5}{6} when using <bpt> <ept>.  {1}=bpt containg <a href="" title=", {2}= sub containing title{3}=end of sub, {4}=end of bpt with ">,{5}= start of ept containing </a> {6}= end of ept. The first method uses less placeholder and it can be substituted quickly, the other stores the structure of tmx inline formatting tags and is more general. For searching the first may be more useful and is close to what Idiom uses.
    The search functions are not yet changed so this structure still does nothing. I am currently planning plSQL procedures needed to allow search on tagged segments. I have a Java commandline based TMX parser and importer into TinyTM that stores formatting.
    Please comment on this approach. Codes will be published soon when cleaned properly and thoroughly commented.

     
    • Frank Bergmann

      Frank Bergmann - 2009-01-20

      Hi Peter,

      > My prototype solution proposes to add source_tagged (text),
      > target_tagged (text) fields

      Ok.... Could you please provide a complete (updated) DDL (data description language - create table xxxx ...) for the TinyTM tables?

      Storing key-value lists such as your tag - content structure in a database is an important challenge and there is apparently no ideal solution. My main considerations would be:

      - Performance: This needs to be fast when processing millions of segments...

      - Compatibility: Maybe the code should also work with MySQL or Oracle?

      Storing placeholders in an array seems an elegant solution for PostgreSQL, but I'm not 100% sure about the performance of such an option. The main question is: Do we need to look into the placeholder array when searching for segments? If the answer is no, then everything is OK...

      In my old days working at Volkswagen I've seen an apparently very dirty way to code key-value pairs in the DB as a long "text" field: "key1=value1, key2=value2, ...". I was amazed by this "dirty" structure, but veteran DB administrators told me that this was the result of a long optimization project and that (with IBM DB2) this was by far the fastest solution...

      > Codes will be published soon when cleaned properly
      > and thoroughly commented.

      Publish early and frequently - Don't worry too much about commenting and cleaning. If you declare your code "prototype" nothing happens if you publish dirty stuff. As long as it works :-)

      Cheers!
      Frank

       
    • Peter Tužinský

      Where is the best place to place prototypes? Into CVS? Create another branch? Or by a hyperlink into forums from other locations.

       
    • Peter Tužinský

      The tag information is not used in lookups. Currently i use arrays for storing Xid attributes of tags defined in header  and tags defined in segment i use tiny:phstart tiny:phend attributes to find which placeholders are used for which tag. The lookup will be used only on numeric placeholders. As they will be treated equally at least beginning. If we would like to make lookup based on semantics of formating tags we would need to create different structure. This is the obvius solution to not loose the formatting but the lookup will be based on number of placeholders. The source_tmx_tags field will be used for formatting reconstruction only.

      I will give you the DDL(strangely i have the column of id not as a varchar[] but as a text with array format I will correct that):

      -- Table: tinytm_segments

      -- DROP TABLE tinytm_segments;

      CREATE TABLE tinytm_segments
      (
        segment_id integer NOT NULL,
        segment_key character varying(100),
        parent_id integer,
        owner_id integer NOT NULL,
        creation_date timestamp with time zone NOT NULL,
        creation_ip character varying(50) NOT NULL,
        customer_id integer,
        segment_type_id integer NOT NULL,
        text_type character varying(50),
        document_key character varying(1000),
        subject_area_id integer,
        source_lang_id integer NOT NULL,
        target_lang_id integer NOT NULL,
        tags text,
        source_text text NOT NULL,
        target_text text NOT NULL,
        tagged_source_text text,
        tagged_target_text text,
        source_tmx_tags text,
        target_tmx_tags text,
        source_tag_ids text,
        target_tag_ids text,
        CONSTRAINT tinytm_segment_pk PRIMARY KEY (segment_id),
        CONSTRAINT tinytm_segment_parent_fk FOREIGN KEY (parent_id)
            REFERENCES tinytm_segments (segment_id) MATCH SIMPLE
            ON UPDATE NO ACTION ON DELETE NO ACTION,
        CONSTRAINT tinytm_segment_type_fk FOREIGN KEY (segment_type_id)
            REFERENCES tinytm_segment_types (segment_type_id) MATCH SIMPLE
            ON UPDATE NO ACTION ON DELETE NO ACTION,
        CONSTRAINT tinytm_segments_creation_user_fk FOREIGN KEY (owner_id)
            REFERENCES tinytm_users (user_id) MATCH SIMPLE
            ON UPDATE NO ACTION ON DELETE NO ACTION,
        CONSTRAINT tinytm_source_lang_fk FOREIGN KEY (source_lang_id)
            REFERENCES tinytm_languages (language_id) MATCH SIMPLE
            ON UPDATE NO ACTION ON DELETE NO ACTION,
        CONSTRAINT tinytm_subject_area_fk FOREIGN KEY (subject_area_id)
            REFERENCES tinytm_subject_areas (subject_area_id) MATCH SIMPLE
            ON UPDATE NO ACTION ON DELETE NO ACTION,
        CONSTRAINT tinytm_target_lang_fk FOREIGN KEY (target_lang_id)
            REFERENCES tinytm_languages (language_id) MATCH SIMPLE
            ON UPDATE NO ACTION ON DELETE NO ACTION
      )
      WITHOUT OIDS;
      ALTER TABLE tinytm_segments OWNER TO postgres;

      -- Index: id_idx

      -- DROP INDEX id_idx;

      CREATE UNIQUE INDEX id_idx
        ON tinytm_segments
        USING btree
        (segment_id);

      -- Index: source_hash

      -- DROP INDEX source_hash;

      CREATE INDEX source_hash
        ON tinytm_segments
        USING hash
        (source_text);

      -- Index: source_trgm_idx

      -- DROP INDEX source_trgm_idx;

      CREATE INDEX source_trgm_idx
        ON tinytm_segments
        USING gist
        (source_text gist_trgm_ops);

       
    • Frank Bergmann

      Frank Bergmann - 2009-01-22

      Hi Peter,

      > Where is the best place to place prototypes? Into CVS?

      CVS/SVN is a bit tricky, because you can never ever delete mistakes or old versions (that's the idea of a CVS). Maybe it's the easiest to publish code in the Download area and then post here on the forum with a link to the download area announcing the new code.

      I assume you've got the permissions to publish code, is that right?

      Cheers!
      Frank

       
    • Peter Tužinský

      Jar file in download section: url: http://downloads.sourceforge.net/tinytm/tmximport-alpha.jar?use_mirror=
      I uploaded first version of my import tool.

       

Log in to post a comment.