I was wondering if there is an easy / GUI guided way to generate Marked-Record-Pairs for training and testing. Unfortunately the Analyzer Source->Training->New->Database function isn’t working (java.lang.ExceptionInInitializerError, ChoiceMaker_Analyzer_20110730-0933, j2re-1_4_2_19-windows-i586-p). Flat-file and XML do work, but I can’t imagine that writing such a file by hand is the only possible solution. So is this functionality hidden in this not working part or is there a separate tool somewhere?
Maybe there is an easy / GUI guided way to generate Clue-files as well?
Any hint would be really appreciated,
MARKED RECORD PAIRS
Do you have a record layout schema already? e.g. an XML file like "SimplePersonRecords.schema" http://links.rph.cx/piWYTp ?
Do you have a preferred database in mind? There's are undocumented (as yet) tools in Analyzer for Oracle and SqlServer databases that will create pairs of marked records from database records.
I need to get this stuff documented and up on on SourceForge, so if you tell me what you need, I'll work on that first.
EASY CLUE FILES
Unfortunately, no. There is no GUI for generating clue files automatically. However, I have some hacked together BASH scripts that will generate generic clue files from record-layout schemas. These scripts generate simple clues like checking for exact matches or naive differences between field values. I'd like to turn these into an easy-to-use GUI tool, but I haven't started this work yet.
I'd be happy to share these BASH hacks, but only if they're going to be useful to you.
Can you describe your de-dupe project publicly? Or would you prefer an offline discussion?
I work with Mike and just wanted to explain what we are doing from a non-technical perspectively. Essentially we have databases with lots of companies. We are using external feeds from other companies with additional data and content about those companies. The only think that the feeds have in common with our data are the business name and address information so we are trying to perform name and address matching so that we can link the unique identifier from our data to the unique ID from theirs.
Hope this makes sense
(*) You haven't asked for it yet, but you're going to need some documentation from me about building ChoiceMaker Server.
Getting the server to run and know how to work with it is priority 1 for me now. So everything you manage to write down is really appreciated.
(*) Will a Java webstart app work as a pair markup tool for your clients?
(*) Would you be willing to help test and debug Reviewer?
We have to think about that. But it would be nice to have the stuff to be able to evaluate the amount of work.
(*) I need to get some documentation together about configuring the SourceForge version of ChoiceMaker to work with MySql (and Oracle and SqlServer)
Yeah, that would be nice too.
Do you have plans to migrate the whole stuff to Java 1.6 or 1.7?
I have some draft documentation on building CM Server that is client specific right now. I'll strip out the client specific stuff and start putting it up on the Wiki.
After the CM Server documentation is up, I'll work on putting the CM Reviewer code onto SourceForge.
Yes, I've written up some plans to migrate everything to Java 1.7. I've added a topic in the Open Discussion forum that describes one approach (http://links.rph.cx/qfIHsv). Comments and feedback are welcome.
Perfect. I'm looking forward to read the server docu.
I'll deliver the server documentation in stages, because there's some code that needs to be added to the SourceForge repository in order to get the full server functionality.
First, I'll write up how to build the server to do batch de-duplication of XML and CSV files. Next, I'll document how to configure the MySql wrapper so that it will work in Analyzer (and CM Server), and then I'll document how to build the server to do batch de-duplication of a SQL database. Finally, I'll document how to build the server to do online (i.e. interactive) record matching.
Each of these steps will take a few days. My goal is to get the first bit of server documentation up by this weekend.
Thanks for updating the wiki.
I was able to follow all the steps described in the wiki and end up with the “models.jar” and “the urm.ejb.jar”. Now I am stuck at the last point since the documentation stops there ;-) It would be great if you could finish the documentation rather sooner than later since I am required to report on a possible matching solution based upon ChoiceMaker until the end of this sprint which is scheduled to be this Friday. It would be very nice to see the Server working once at least till then.
I'll work on this tomorrow morning (Wed, 9/7).
Sorry this is taking longer than I expected.
I have a first draft of how to build the server to do batch de-duplication of XML and CSV files, along with some directions on configuring JBoss to run ChoiceMaker.
There have been a number of changes to existing projects in CVS, plus a number of new ones have been added. You'll need to get the latest version of all the projects under the 2.5.x and model_projects CVS directories.
Let me know if you have questions or comments, or if the directions aren't clear.
Log in to post a comment.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.