Jericho HTML Parser / Discussion / Open Discussion: Tokenizing HTML/JSP files

Bill Rich - 2004-08-28

I need to parse HTML, JSP, or fragments of these file types to find the translatable strings. Once I find a translatable string I want to remove the string from the file and leave a token in its place. The extracted strings would be translated then the file with the tokens would be used to recreate the original file with all the translatable strings changed to their translated versions.

Translatable strings can be found in open text or in attributes of tags. Sometimes a translatable string will contain tags within it. These tags must be maintained within the string in order to make the string meaningful to the translator.

I also need to be able to get some context information from the Parser. Items like where the string is used in terms of tags within the file. I will know the file name ahead of time to complete the context information.

The tokens used to replace the extracted strings could be in almost any form but would probably be best if they were in the form of an HTML tag. Hopefully, the tokenized file could then be parsed by the JerichoParser when trying to replace the tokens with the translated strings.

Would the JerichoParser support something like this? Of course, the tokenizer code and the code to recgonize translatable strings would need to be outside the JerichoParser.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Martin Jericho - 2004-08-29
  
  Hi Bill,
  
  > I also need to be able to get some context information from the Parser. Items
  > like where the string is used in terms of tags within the file.
  
  When you say "in terms of tags within the file" I assume you mean the heirarchy of elements the string is nested within. At present the Jericho HTML Parser does not build any sort of parse tree, although this functionality may be included in a future version. It is quite straightforward however to build a parse tree yourself by iterating through the list of elements provided by the Source.findAllElements() method. The only thing that is not straightfoward is deciding how to deal with badly structured HTML.
  
  What other sorts of context information would you require?
  
  > The tokens used to replace the extracted strings could be in almost any form
  > but would probably be best if they were in the form of an HTML tag. Hopefully,
  > the tokenized file could then be parsed by the JerichoParser when trying to
  > replace the tokens with the translated strings.
  
  You couldn't use HTML tags because of the fact that the translatable strings can appear within attribute values of other HTML tags, which would make it very ticky to parse again.
  
  The best form for the tokens would be XML processing instructions, eg:
  <?TranslatabeString id="123"?>
  or for something more compact, something like:
  <?TS 123?>
  
  You could then use Source.findAllStartTags("?TS") to find all the tokens and replace them with the translated strings as usual using an OutputDocument object. Note I have just discovered an issue with the parser in that it doesn't automatically treat any search beginning with a '?' as a processing instruction, so it will try to parse its contents as attributes. This should not cause any real problems for you, but I will look into this issue.
  
  Hope this helps,
  Martin
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Martin Jericho - 2004-08-30
    
    > If I understand correctly I can use the Source.findAllElements() to get all
    > the elements in the file then use various Element methods to get at the content
    > of the elements and find any translatable strings. I would then use IOutputStatement
    > or StringOutputStatement to form the tokenized element to put in the tokenized
    > file.
    
    Not quite right. Jericho HTML Parser doesn't recognise translatable strings at all, so you have to use something else for that. This would be performed directly on the source file, and would have to return the start and end character positions of each TS found.
    
    You then create an OutputDocument from the Source, add a StringOutputSegment containing a token and covering each TS segment, and call the OutputDocument.toString() method to get your finished tokenised document.
    
    The Source.findAllElements() method is only used to build a parse tree so that you can determine the context information for each TS from its position in the source file. Alternatively you could recursively call Source.findEnclosingElement(int pos) on each position instead of building a parse tree if the process doesn't have to be particularly efficient.
    
    > I agree that using an XML PI would be best for the string replacements. This
    > will match nicely with using an XML file format for extracted strings file.
    
    Note that the resulting tokenised document would not necessarily be a valid XML document, as PIs aren't allowed in attribute values either. It would just make it easier to read for humans than using "normal" tags.
    
    > As for the context information I would need anything that could give a clue
    > as to where and how the string is used in the source file.
    >
    > For example: if the string is an attribute value then I would need the key name,
    > the enclosing tag name, the file name, and the fact that it is an attribute
    > value. If the string is open text I would need the enclosing tag name and the
    > file name and the fact that it was text enclosed in a tag.
    
    To determine whether it's an attribute value you would have to see whether the start tag of the element returned by findEnclosingElement() spans the TS, and if so, iterate through its attributes to see which one the TS resides in.
    
    > Do you know of any parsers for Java code or C resource files?
    
    Not personally, but I think javacc has publicly available grammars for java and C code.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Bill Rich - 2004-08-29
  
  Hi Martin,
  Thanks for the quick response.
  
  If I understand correctly I can use the Source.findAllElements() to get all the elements in the file then use various Element methods to get at the content of the elements and find any translatable strings. I would then use IOutputStatement or StringOutputStatement to form the tokenized element to put in the tokenized file.
  
  I agree that using an XML PI would be best for the string replacements. This will match nicely with using an XML file format for extracted strings file. I am thinking about using XLIFF for this file and it has translatable units to identify the strings. It is a possibility that I may not need to write a tokenized file since XLIFF has a section in the file for something like a tokenized representation of the file.
  
  As for the context information I would need anything that could give a clue as to where and how the string is used in the source file.
  
  For example: if the string is an attribute value then I would need the key name, the enclosing tag name, the file name, and the fact that it is an attribute value. If the string is open text I would need the enclosing tag name and the file name and the fact that it was text enclosed in a tag.
  
  What I want context information for is to give the translator as much information as possible so the translation is appropriate to the use of the words. Often we find that a word in English is translated differently depending on how it is used. For example if the word is used as a column heading it may get one translation and if it is used as a menu item it may get a slightly different translation.
  
  It looks like the JerichoParser may be able to fill the parser role I have in mind right now for HTML and JSP files.
  
  Do you know of any parsers for Java code or C resource files?
  
  Thanks. Bill
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Bill Rich - 2004-08-30
  
  OK, I understand that I need some code that will determine what is a TS and what is not. What I really need from the Jericho Parser is the location of each segment in the file which it will provide.
  
  I do need a fairly efficient process since there is at least one project that has close to 1000 HTML and JSP files in it. We may, at times, need to run this process once a week. I have an Ant file that controls the execution of the tools but I do need each tool to finish in a reasonable amount of time since an engineer is usually waiting for the output. I will probably stay with generating a tree and work from there.
  
  We are probably going to use Javacc as another parser but I was just hoping there might be something else.
  
  Thanks. Bill
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Martin Jericho - 2004-08-30
    
    > OK, I understand that I need some code that will determine what is a TS and
    > what is not.
    
    I get the feeling we're on different wavelengths. I have been assuming that you are talking about HTML files that use the GNU concept of a translatable string. Although I haven't had any experience with it myself, I assumed you had already marked all translatable strings as _("this is a translatable string") as is the norm in C source files. See:
    http://www2.iro.umontreal.ca/~gnutra/po/HTML/
    http://www.gnu.org/software/gettext/manual/html_mono/gettext.html
    
    Now after a bit more reading I don't think there is any standard mentioned for translatable strings in HTML files, and your statement above (and in previous posts) makes it sound like your strings aren't marked as translatable at all.
    
    If this is the case, I would recommend you do some research into the Translation Project above to see if there has been any work done already with HTML pages and some standard tools available. Maybe ask around on some forums, as it is bound to be a very common requirement. Maybe XSLT would be a better fit? If there is no established standard for marking translatable strings in HTML, you might want to document what you are doing and make it available to others.
    
    If you indeed just have normal HTML documents with no marking of translatable strings, then your original suggestion is definitely more appropriate. Use the Source.findAllElements() method to determine segments of text and the values of attributess like alt and title. Without manual intervention, this is likely to report a lot of segments that aren't translatable, but your translator can simply mark them as such and the original text placed back in the document instead of a token.
    
    Sorry if there has been a major misunderstanding here.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Bill Rich - 2004-08-30
  
  sigh... I wish I could get the developers to help by marking text as translatable. It would be even better if they would just put the translatable text in a resource bundle. But, such is the life of an L10N engineer. We are left to our own inventions to make sure the product works in the target language. The only standards I can hope to impose on development are in the area of internationalization, or making sure there are no impediments to running in locales other than English.
  
  None of the source files I deal with have any reliable markings in them to indicate what is translatable and what is not. Even looking at comments that a section of a file is not translatable is probably wrong. Most developers don't know or care what is translatable only that it works in the source language.
  
  I have a ton of Java code that tokenizes files of different forms and extracts the strings so that we can decide what is translatable and either translate it or not. I have tried using Trados and SDLX, two very popular translataion tools, but they are full of inconsistencies and lack the ability to differentiate between two strings based on their context.
  
  Thanks for the pointers to the other projects. I will check them out.
  
  As to whether I can publicly document my work or not is a question that can only be answered by my clients. I would like to form a project to develope an L10N toolkit that can be used to front end many of the translation tools. I am proposing that to my client so we shall have to wait to see what happens.
  
  Thanks. Bill
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tokenizing HTML/JSP files

Forums

Help

Tokenizing HTML/JSP files

Tokenizing HTML/JSP files

Forums

Help

Tokenizing HTML/JSP files document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Tokenizing HTML/JSP files