Segmenting rules are
hard-coded in bitext2tmx ('; ', ' : ', '? ' and '! '
from what I have seen
in the source code), and exceptions (i.e., Ms., etc.)
cannot be used.
It would be nice for bitext2tmx to offer the
possibility of editing segmenting rules.
Logged In: YES
user_id=1111672
I agree.
User configurable segmentation rules should be put in place.
An SRX implementation is one possibility.
Logged In: YES
user_id=1111672
Originator: NO
Okay, already noted from another tracker issue. Segmentation rules will be implemented towards version 1.0.
Raymond
I will add that for the segmentation rules, it would be very useful if it was possible to specify a user-defined binary sequence. Or at least allow the user to specify a custom list of symbols which represent common binary/non-printable characters.
For example, I have a file that I generated by dumping a column out of an Excel file into a text file. The lines are thus separated by 'Carriage Return' 'Newline' (i.e. Hex '0D0A'. It is a standard DOS text file line separator format but bitext2tmx gets confused and can't distinguish the different lines of text correctly even if I select 'Split by Line Feed'.
When I select the 'Split by Line Feed' option, I get various blank lines added in different sequence locations in my Source and Target lists and so I can't create a valid tmx file because the text in the Source and Target do not line up.
Adding segmentation rules is the way to specific how to handle special sequences of characters in the form of regular expressions.
Adding binary sequences is moot when you have regular expressions. Thus being concerned about line ending, carriage returns, line feeds, and so forth would be taken care of, along with many other possibilities.
In regard to "split by line feed", that option does exactly what it says. It splits on the line feeds that are in your texts. If you have line feeds all over the place then you will get new rows. This is the way it is supposed to work.
Note: bitext2tmx is not supposed to automagically do all your work for you to create a TMX. It tries to align as best it can, but hardly anyones text is perfect for this (garbage in equals garbage out). That is why you are able to edit after it "attempts" to align segments. Anyway, you are expected to check the alignments by eye and ensure that things are correct before exporting to TMX. If you can't create a valid TMX with B2T it is because you have not done the work to produce one.