We came up with the idea of writing a pair-generator for training and testing. A simple CSV reader, some string manipulations (e.g. swap letters, change case, clear fields or fill in random data, …) and as a result a pair-xml file. Since we would know what we manipulated we wouldn’t need a person to mark the pairs as match/hold/no-match. The script could do that.
Well, it feels like tricking the system by training a model with such artificial data. Would we end up with a badly trained model? What are your thoughts on this idea?
There's been some prior work along the lines you've suggested:
(*) Mauricio Hernandez, UIS dbgen,
(*) Peter Christen and Ajus Pudjijuno, Febrl data set generator
(*) Chirag Viradiya and me, A Data Generator
The problem is that real duplicates are always stranger than ones that any of these tools can manufacture, and the strange cases are usually the ones that are most important in training an accurate model. Chirag and I have an idea about combining his data generator with ChoiceMaker's clues to search for (and manufacture) realistically difficult training data, but right now we don't have much more than an idea about how to proceed.
So the short answer is that you're unlikely to get an accurate model by training on artificial data, at least for right now.
Unfortunately the software "A data Generator" does not seem to be available. Do you have a beta release of that?
The "A Data Generator" project has not released any builds, but the stuff that's in the Subversion repository does build a couple of example applications, and the project is under active development. Check out the tutorial and the build instructions on the web site (both of which were released last week):
As first view I found that the tutorial is a bit unclear… before going deeper in study could you tell me if with that sowtware is it possible to have a marked record file with a generic schema?
It is possible to have a marked record PAIR file with a generic schema, although you'll need to write your own XML exporter. Also, the current ADG framework currently supports only single table schemas, but I need to extend it to multi-table schemas for my own use, so hopefully that will change soon.
I'm looking for interesting demos of the ADG framework. If you're willing to collaborate on an open-source demo of how the ADG framework applies to your project - something we could post to SourceForge as another example of how to develop a data generator - then I'd be interested in working with you on the development. My schedule is quite full right now, so you'd end up doing most of the work, but I could certainly cut down the learning curve for you.
My goal is not to build a data generator software butto have a set of marked record PAIR to train a machine learning system and then perform my experiment.
Since this is the topic of my thesis,maybe, I will publish a paper and, of course, I will cite your software. This is the best I can do for your project.
Do you think you can help me ? I have tried to build the choicemaker libreries but Maven outputs an error. Do you want me to write if here or do you have a support channel for ADG?
I'd be happy to take a look at the Maven error that you are encountering. The ADG forums are probably more appropriate sites for this discussion.
I saw your support request on the ADG project, but it turns out that I'm not (yet) able to respond to it. I don't' have your email address, so for now, I must respond in this forum.
The problem that you're encountering with the ADG build is that you appear to be using JDK 1.3 or 1.4. Version 1.4.2 is the correct version for ChoiceMaker 2.x, but the ADG uses ChoiceMaker 3.x. Version 3.x of ChoiceMaker needs a later version of the JDK, namely JDK 1.6 or higher. (JDK 1.6 is recommended; JDK 1.7 should work, but isn't well tested.)
Let me know if you see this response.
Really thank yo for the help…
Actually I think that I'm using jdk 1.6 (http://packages.debian.org/squeeze/default-jdk for amd64).
If you prefer to move this discussion on the email this is my address: piepoli <dot> antonio <at> gmail <dot> com
I'm still having problems with the Support Request forum for the ADG project, so I'll keep responding in this forum.
The reason that I thought might be using JDK 1.3 is that the error messages contained the detail "generics are not supported in -source 1.3 (use -source 5 or higher to enable generics)". Somehow, you need to compile with the "-source 5" setting. I've never had this problem on any of the JDKs that I've used on Redhat Linux, Mac and Windows, but I've never tried using Debian Linux.
I googled for similar problems and found the following article:
"Fixing 'use -source 5 or higher to enable generics' during Maven compilation"
I can't test this since I don't have Debian installed. Would you try modifying some of the pom.xml files to see if it works. If it does, I'll modify the remaining pox.xml files. The best pom.xml file to try is the one for the choicemaker-shared project (which is the first project that needs to be built).
Let me know what you find out.
No plugin tag for the standard POM…
Malformed POM /home/antonio/adg/choicemaker-shared/pom.xml: Unrecognised tag: 'plugin'
http://maven.apache.org/pom.html#What_is_the_POM seem that plugin is avaible for super POM .
I don't use Debian, but I tried starting up a Debian Squeeze x64 workstation at Amazon EC2. Unfortunately, I didn't have much success. Many of the tools are unfamiliar to me - I'm more familiar with Redhat Linux.
I suspect that a key difference is that you're using OpenJDK, whereas I'm using a Sun/Oracle JDK. I haven't been able to verify this, because of my unfamiliarity with Debian.
I'm a bit stuck right now, and I won't have any more time to work on this until next weekend.