I'm curious what you guys have thought about this. It sounds like an interesting addition.
Considering it for a few minutes, I came up with...
Pros:
It might be a nice feature for teams of translators: instead of having to copy the TM files around, they could just point to the same DB for a project. I think some commercial CAT software has this sort of thing for teams/companies, and they probably charge body parts for it as an addition--one doesn't even bother to list prices.
It might be a nice way to organize client TMs (for a single translator), rather than having them in files copied around between folders.
Cons:
It would probably decrease performance, compared to file access.
Slightly harder to backup DB files, compared to regular files, mostly because they can get large. This might not be an issue in this case.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Cons: you lose the freedom to move around files in folders, to edit them with nearly any text editor (regexp, etc.), and to produce them with simple tools.
Didier
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
IMHO, it's not so good idea to use DB backend.
DB is not good for search fuzzy matches. Client will need to read almost all records.
Also, DB is not so good for synchronize translated data in the one project.
If you really need to have multiuser environment, it will be much better to implement application server inside OmegaT. It can use webservice for transfer data. In this case, one OmegaT instance will be server, and many other instances(on other computers), could be clients. It's much simpler after OmegaT core refactoring.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think fuzzy searching could be done reasonably well using a DB. OmegaT could just ask the DB to return results that contain words in the string, and parse through those in the normal manner. Admittedly, I don't know how it works "in the normal manner", yet.
Why do you say it's not good for synchronizing data in a project? You could even keep track of who submitted particular translations.
The application server is an interesting idea as well. How would it handle the organization and synchronization of client specific memories?
Wes
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm realizing my first comment on fuzzy matching is probably not the best way to do it, but I'm sure there is a way to implement the current way of doing it for a database.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Alex did not write it is not possible. The problem with fuzzy matching on a DB is performance, I think. Even "fast" dbs are awfully slow compared to OmegaT. Check the trial versions of various DB based tools. And those dbs are on your disk, think about what that would be if they were to be online.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
> It might be a nice feature for teams of translators: instead of having to
copy the TM files around, they could just point to the same DB for a
project.
Users can already do that. OmegaT uses legacy TMs that are in a specific directory (as defined in the project settings), so if a group of translators all select the same /tm directory (on a network) in their project settings, they are effectively sharing a DB.
This repository of legacy TMs doesn't include the ongoing work in the current project(s), of course, but that can be taken care of easily: all work done so far in a project is output in the TMX file when the project is compiled (=target documents created), so all that is needed is a way for a new export of a TMX file to be detected, and the file then copied to the designated /tm directory. This has already been discussed in the past (I don't recall where: RFE, OmT or OmT-dev). It's also indirectly the subject of an existing RFE, since a feature I'd like to have is for the location to which the TMX files are exported by a project to be user-definable in the project settings in the same way that source files, legacy TMs, etc. already are.
> It might be a nice way to organize client TMs (for a single translator), rather than having them in files copied around between folders.
Quite apart from the speed argument (for or against), the current arrangement of TMs in the form of TMX files that are simply dropped into the desired directory (or the project settings pointed to the desired directory) is actually a very simple, user-friendly and easily understood way of managing TMs. There may - or may not - be a speed advantage in using a dedicated database, but the concept of files and directories is about as simple as it gets in computing. No need for users to get to grips with a new "database" application.
Marc
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
> Users can already do that. OmegaT uses legacy TMs that are in a specific
directory (as defined in the project settings), so if a group of
translators all select the same /tm directory (on a network) in their
project settings, they are effectively sharing a DB.
That would probably work well enough for sharing. Has anyone tried it that way?
Wes
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Right. It's possible to use DB as TM storage. But fuzzy matching algorithm requires to compare current segment with ALL segments in TM. We are calculating "distance" between segments.
Even if we will not use "fuzzy" matching, if we will create some indexes for words, then database size will be huge, and we will need to read many lines. It will be great performance issue, even database will be in local network. But if database will be accessible by Internet, it will be impossible to use it. Client-server architecture will solve this issue.
There are some synchronization issue. For example, if one user translated some segment, other users will be able to see this translation only some time later, because there is no way to send information from one central database to all clients. But client-server model is allow it.
As Marc wrote, this architecture required only for case, when many users work for one project. If they just need to use common TM, it will be enough to just share this TM by files.
Changes in OmegaT for support client-server are not so huge. I have detailed vision of implementation. But I don't see many users which require it. So, there is no sense to spend time for that yet, IMHO.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Currently, at our company we are using a database as a backend for the TM storage. For translations, we use the translation memory of our own software package and a number of scripts to extract data from it in various formats:
1. Glossary in tbx (combination of metadata of the system and the translation of the involved concepts).
2. Existing translations
And we use Kettle (currently named Pentaho) to import the TMX files into the database. Two formats:
1. Import the TMX itself.
2. Import the target files (since the source is not always segmented correctly).
It works fine, but requires quite some IT skills to get up-and-running.
I can imagina an addition, similar to something we have created for Excel.
We have a software add-on on Excel (INvantive Control) that allows you to download data from a datawarehouse into Excel. You can change the data in Excel under control of the add-on (so you can not change everything, but only specific areas). And then synchronize back the changes when you are online again through a webservice or direct database call. During synchronization back to the datawarehouse, validations take place.
An example of such a validation is that the same business object has been changed in between time by another user. Since you are working offline, you can not lock directly.
I can imagine the same concept can be applied to OmegaT: synchronize with a table of a specific structure in any JDBC database or webservice.
Tables:
1. Projects pjt
2. Translation memories tmx (with fk to pjt)
3. Glossaries gly
4. Glossary entries gey (with fk to gly).
5. Synchronization counter (just a monotoneous increasing counter, increased before every synchronize but not per entry, just once for a sync). Also needs to include UTC time.
In addition, you might want to maintain some history/audit tables and security, but that's standard.
You can download translations and tbx without locking.
In the tmx header you can register the source and highest transaction ID at the database.
Then translate. OmegaT registers with every change of an entry that it has changed since last synchronization. We might want to use time for this, but that requires that all system clocks are synchronized rather close. Real life is that freelance translators often have no synchronization through ntp or alike.
At end of the day, press synchronize. OmegaT tries to play all changes into the database. And then reloads a fresh copy, including changes of others.
Effectively it would help our company to reduce the latency between when a translator finishes his job and the results are published. But that might not be the case for others; we are still trying to force the translation process into a continuous process.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I thought I would revamp this ticket... Even though it might be against OmegaT's cornerstone, I think using a database is a interesting idea. I see the limitations of using files and I don't see why using a database has to be slower (all the CAT tools I know except Transit NXT use databases, not text files, and they are not particularly slower than OmegaT -- especially when you add many large TMX files to the /tm folder, it may be a nightmare). Besides this ticket was created 14 years ago and last commented almost one decade ago, so perhaps there has been some progress in the world of databases/querying/indexing... For example, someone mentions above that files are handier because one can move them around -- well, Sqlite databases are just files, so there's no difference. Not to mention that having to move files around and put them in the right place can be a disadvantage too for many users (often I read coments about what is better or worse expressed as absolute truths, we often forget that different users have different needs and different approaches). Onward!
👍
1
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I will try and set up a TeamBase server for DGT-OmegaT at some point next week in the near future actually and if I manage to actually do that I'll record and compare the time it takes for OmegaT to read the tmx and give me matches against how long it takes DGT-OmegaT to query the database
I don't know if this works the same as the SQL db in TeamBase but the people who created the system used for the database also offer patches for OmegaT allowing to use an instance of OmegaT as a server, I will probably have a look at this option first as it is likely to be easier/more straightforward.
I'll take notes as I set up the server in case that's useful at some point down the line.
Last edit: Damien Rembert 2021-05-28
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm curious what you guys have thought about this. It sounds like an interesting addition.
Considering it for a few minutes, I came up with...
Pros:
It might be a nice feature for teams of translators: instead of having to copy the TM files around, they could just point to the same DB for a project. I think some commercial CAT software has this sort of thing for teams/companies, and they probably charge body parts for it as an addition--one doesn't even bother to list prices.
It might be a nice way to organize client TMs (for a single translator), rather than having them in files copied around between folders.
Cons:
It would probably decrease performance, compared to file access.
Slightly harder to backup DB files, compared to regular files, mostly because they can get large. This might not be an issue in this case.
Cons: you lose the freedom to move around files in folders, to edit them with nearly any text editor (regexp, etc.), and to produce them with simple tools.
Didier
That's true. It could have some sort of "export/import" functionality to mitigate that somewhat.
IMHO, it's not so good idea to use DB backend.
DB is not good for search fuzzy matches. Client will need to read almost all records.
Also, DB is not so good for synchronize translated data in the one project.
If you really need to have multiuser environment, it will be much better to implement application server inside OmegaT. It can use webservice for transfer data. In this case, one OmegaT instance will be server, and many other instances(on other computers), could be clients. It's much simpler after OmegaT core refactoring.
I think fuzzy searching could be done reasonably well using a DB. OmegaT could just ask the DB to return results that contain words in the string, and parse through those in the normal manner. Admittedly, I don't know how it works "in the normal manner", yet.
Why do you say it's not good for synchronizing data in a project? You could even keep track of who submitted particular translations.
The application server is an interesting idea as well. How would it handle the organization and synchronization of client specific memories?
Wes
I'm realizing my first comment on fuzzy matching is probably not the best way to do it, but I'm sure there is a way to implement the current way of doing it for a database.
Alex did not write it is not possible. The problem with fuzzy matching on a DB is performance, I think. Even "fast" dbs are awfully slow compared to OmegaT. Check the trial versions of various DB based tools. And those dbs are on your disk, think about what that would be if they were to be online.
wes_freeman wrote:
> It might be a nice feature for teams of translators: instead of having to
copy the TM files around, they could just point to the same DB for a
project.
Users can already do that. OmegaT uses legacy TMs that are in a specific directory (as defined in the project settings), so if a group of translators all select the same /tm directory (on a network) in their project settings, they are effectively sharing a DB.
This repository of legacy TMs doesn't include the ongoing work in the current project(s), of course, but that can be taken care of easily: all work done so far in a project is output in the TMX file when the project is compiled (=target documents created), so all that is needed is a way for a new export of a TMX file to be detected, and the file then copied to the designated /tm directory. This has already been discussed in the past (I don't recall where: RFE, OmT or OmT-dev). It's also indirectly the subject of an existing RFE, since a feature I'd like to have is for the location to which the TMX files are exported by a project to be user-definable in the project settings in the same way that source files, legacy TMs, etc. already are.
> It might be a nice way to organize client TMs (for a single translator), rather than having them in files copied around between folders.
Quite apart from the speed argument (for or against), the current arrangement of TMs in the form of TMX files that are simply dropped into the desired directory (or the project settings pointed to the desired directory) is actually a very simple, user-friendly and easily understood way of managing TMs. There may - or may not - be a speed advantage in using a dedicated database, but the concept of files and directories is about as simple as it gets in computing. No need for users to get to grips with a new "database" application.
Marc
> Users can already do that. OmegaT uses legacy TMs that are in a specific
directory (as defined in the project settings), so if a group of
translators all select the same /tm directory (on a network) in their
project settings, they are effectively sharing a DB.
That would probably work well enough for sharing. Has anyone tried it that way?
Wes
> Alex did not write it is not possible.
Right. It's possible to use DB as TM storage. But fuzzy matching algorithm requires to compare current segment with ALL segments in TM. We are calculating "distance" between segments.
Even if we will not use "fuzzy" matching, if we will create some indexes for words, then database size will be huge, and we will need to read many lines. It will be great performance issue, even database will be in local network. But if database will be accessible by Internet, it will be impossible to use it. Client-server architecture will solve this issue.
There are some synchronization issue. For example, if one user translated some segment, other users will be able to see this translation only some time later, because there is no way to send information from one central database to all clients. But client-server model is allow it.
As Marc wrote, this architecture required only for case, when many users work for one project. If they just need to use common TM, it will be enough to just share this TM by files.
Changes in OmegaT for support client-server are not so huge. I have detailed vision of implementation. But I don't see many users which require it. So, there is no sense to spend time for that yet, IMHO.
Currently, at our company we are using a database as a backend for the TM storage. For translations, we use the translation memory of our own software package and a number of scripts to extract data from it in various formats:
1. Glossary in tbx (combination of metadata of the system and the translation of the involved concepts).
2. Existing translations
And we use Kettle (currently named Pentaho) to import the TMX files into the database. Two formats:
1. Import the TMX itself.
2. Import the target files (since the source is not always segmented correctly).
It works fine, but requires quite some IT skills to get up-and-running.
I can imagina an addition, similar to something we have created for Excel.
We have a software add-on on Excel (INvantive Control) that allows you to download data from a datawarehouse into Excel. You can change the data in Excel under control of the add-on (so you can not change everything, but only specific areas). And then synchronize back the changes when you are online again through a webservice or direct database call. During synchronization back to the datawarehouse, validations take place.
An example of such a validation is that the same business object has been changed in between time by another user. Since you are working offline, you can not lock directly.
I can imagine the same concept can be applied to OmegaT: synchronize with a table of a specific structure in any JDBC database or webservice.
Tables:
1. Projects pjt
2. Translation memories tmx (with fk to pjt)
3. Glossaries gly
4. Glossary entries gey (with fk to gly).
5. Synchronization counter (just a monotoneous increasing counter, increased before every synchronize but not per entry, just once for a sync). Also needs to include UTC time.
In addition, you might want to maintain some history/audit tables and security, but that's standard.
You can download translations and tbx without locking.
In the tmx header you can register the source and highest transaction ID at the database.
Then translate. OmegaT registers with every change of an entry that it has changed since last synchronization. We might want to use time for this, but that requires that all system clocks are synchronized rather close. Real life is that freelance translators often have no synchronization through ntp or alike.
At end of the day, press synchronize. OmegaT tries to play all changes into the database. And then reloads a fresh copy, including changes of others.
Effectively it would help our company to reduce the latency between when a translator finishes his job and the results are published. But that might not be the case for others; we are still trying to force the translation process into a continuous process.
I thought I would revamp this ticket... Even though it might be against OmegaT's cornerstone, I think using a database is a interesting idea. I see the limitations of using files and I don't see why using a database has to be slower (all the CAT tools I know except Transit NXT use databases, not text files, and they are not particularly slower than OmegaT -- especially when you add many large TMX files to the /tm folder, it may be a nightmare). Besides this ticket was created 14 years ago and last commented almost one decade ago, so perhaps there has been some progress in the world of databases/querying/indexing... For example, someone mentions above that files are handier because one can move them around -- well, Sqlite databases are just files, so there's no difference. Not to mention that having to move files around and put them in the right place can be a disadvantage too for many users (often I read coments about what is better or worse expressed as absolute truths, we often forget that different users have different needs and different approaches). Onward!
I will try and set up a TeamBase server for DGT-OmegaT at some point
next weekin the near future actually and if I manage to actually do that I'll record and compare the time it takes for OmegaT to read the tmx and give me matches against how long it takes DGT-OmegaT to query the databaseI don't know if this works the same as the SQL db in TeamBase but the people who created the system used for the database also offer patches for OmegaT allowing to use an instance of OmegaT as a server, I will probably have a look at this option first as it is likely to be easier/more straightforward.
I'll take notes as I set up the server in case that's useful at some point down the line.
Last edit: Damien Rembert 2021-05-28