Question regarding Marked-Record-Pairs creati

  • mmi

    mmi - 2011-08-09

    I was wondering if there is an easy / GUI guided way to generate Marked-Record-Pairs for training and testing. Unfortunately the Analyzer Source->Training->New->Database function isn’t working (java.lang.ExceptionInInitializerError, ChoiceMaker_Analyzer_20110730-0933, j2re-1_4_2_19-windows-i586-p). Flat-file and XML do work, but I can’t imagine that writing such a file by hand is the only possible solution. So is this functionality hidden in this not working part or is there a separate tool somewhere?
    Maybe there is an easy / GUI guided way to generate Clue-files as well?
    Any hint would be really appreciated,


  • Rick Hall

    Rick Hall - 2011-08-09

    Mike -


    Do you have a record layout schema already? e.g. an XML file like "SimplePersonRecords.schema" ?

    Do you have a preferred database in mind? There's are undocumented (as yet) tools in Analyzer for Oracle and SqlServer databases that will create pairs of marked records from database records.

    I need to get this stuff documented and up on on SourceForge, so if you tell me what you need, I'll work on that first.


    Unfortunately, no. There is no GUI for generating clue files automatically. However, I have some hacked together BASH scripts that will generate generic clue files from record-layout schemas. These scripts generate simple clues like checking for exact matches or naive differences between field values. I'd like to turn these into an easy-to-use GUI tool, but I haven't started this work yet.

    I'd be happy to share these BASH hacks, but only if they're going to be useful to you.

    Can you describe your de-dupe project publicly? Or would you prefer an offline discussion?

    • Rick
  • Matt Adamson

    Matt Adamson - 2011-08-11

    I work with Mike and just wanted to explain what we are doing from a non-technical perspectively. Essentially we have databases with lots of companies. We are using external feeds from other companies with additional data and content about those companies. The only think that the feeds have in common with our data are the business name and address information so we are trying to perform name and address matching so that we can link the unique identifier from our data to the unique ID from theirs.

    Hope this makes sense


  • mmi

    mmi - 2011-08-12

    (*) You haven't asked for it yet, but you're going to need some documentation from me about building ChoiceMaker Server.
    Getting the server to run and know how to work with it is priority 1 for me now. So everything you manage to write down is really appreciated.

    (*) Will a Java webstart app work as a pair markup tool for your clients?
    (*) Would you be willing to help test and debug Reviewer?
    We have to think about that. But it would be nice to have the stuff to be able to evaluate the amount of work.

    (*) I need to get some documentation together about configuring the SourceForge version of ChoiceMaker to work with MySql (and Oracle and SqlServer)
    Yeah, that would be nice too.

    Do you have plans to migrate the whole stuff to Java 1.6 or 1.7?

  • Rick Hall

    Rick Hall - 2011-08-12

    Mike -

    I have some draft documentation on building CM Server that is client specific right now. I'll strip out the client specific stuff and start putting it up on the Wiki.

    After the CM Server documentation is up, I'll work on putting the CM Reviewer code onto SourceForge.

    Yes, I've written up some plans to migrate everything to Java 1.7. I've added a topic in the Open Discussion forum that describes one approach ( Comments and feedback are welcome.

    • Rick
  • mmi

    mmi - 2011-08-16

    Perfect. I'm looking forward to read  the server docu.

  • Rick Hall

    Rick Hall - 2011-08-17

    Mike -

    I'll deliver the server documentation in stages, because there's some code that needs to be added to the SourceForge repository in order to get the full server functionality.

    First, I'll write up how to build the server to do batch de-duplication of XML and CSV files. Next, I'll document how to configure the MySql wrapper so that it will work in Analyzer (and CM Server), and then I'll document how to build the server to do batch de-duplication of a SQL database. Finally, I'll document how to build the server to do online (i.e. interactive) record matching.

    Each of these steps will take a few days. My goal is to get the first bit of server documentation up by this weekend.

    Best regards.

    • Rick
  • mmi

    mmi - 2011-09-06

    Thanks for updating the wiki.

    I was able to follow all the steps described in the wiki and end up with the “models.jar” and “the urm.ejb.jar”. Now I am stuck at the last point since the documentation stops there ;-) It would be great if you could finish the documentation rather sooner than later since I am required to report on a possible matching solution based upon ChoiceMaker until the end of this sprint which is scheduled to be this Friday. It would be very nice to see the Server working once at least till then.

    Br, Mike

  • Rick Hall

    Rick Hall - 2011-09-07

    Mike -

    I'll work on this tomorrow morning (Wed, 9/7).

    • Rick

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.

No, thanks