Can S-Match help with this?

Help
2014-05-28
2014-05-29
  • John Colaizzi

    John Colaizzi - 2014-05-28

    I'm looking for a way to match data in a flat file to a hierarchical structure. I'm not sure if S-Match can help with this. My perfect result would be two columns of the matched numbers.

    Something like:

    5420 37888
    5230 37955
    ... ...

    I would appreciate any feedback or suggestions.

    The hierarchy looks like:

    5000 Yogurt, not specified as to type of milk or flavor
    5100 Yogurt, plain, not specified as to type of milk
    5110 Yogurt, plain, whole milk
    5120 Yogurt, plain, lowfat milk
    5130 Yogurt, plain, nonfat milk
    5200 Yogurt, vanilla, not specified as to type of milk
    5210 Yogurt, vanilla, whole milk
    5220 Yogurt, vanilla flavor, lowfat milk
    5230 Yogurt, vanilla flavor, nonfat milk
    5300 Yogurt, chocolate, not specified as to type of milk
    5310 Yogurt, chocolate, whole milk
    5320 Yogurt, chocolate, nonfat milk
    5400 Yogurt, fruit variety, not specified as to type of milk
    5410 Yogurt, fruit variety, whole milk
    5420 Yogurt, fruit variety, lowfat milk
    5430 Yogurt, fruit variety, nonfat milk
    5500 Yogurt, other flavor, not specified as to type of milk
    5510 Yogurt, other flavor, whole milk
    5520 Yogurt, other flavor, lowfat milk
    5530 Yogurt, other flavor, nonfat milk

    and the data looks like:

    315650 RASPBERRY YOGURT DRINK 0.825PT YOGURT LOW FAT RASPBERRY
    38120 RASPBERRY YOGURT SMOOTHIE 0.775PT YOGURT LOW FAT ROCKING RASPBERRY
    334232 BLUEBERRY YOGURT DRINK 0.825PT YOGURT LOW FAT BLUEBERRY
    201275 BLUEBERRY YOGURT 1.5% MILKFAT 0.5PT YOGURT 1.5% MILK FAT BLUEBERRY
    37888 RASPBERRY YOGURT 2% MILKFAT 2PT YOGURT 2% MILK FAT RASPBERRY
    433258 BLUEBERRY CREAM YOGURT 0.375PT YOGURT LOW FAT BLUEBERRY CREAM
    273316 PINEAPPLE YOGURT 0.375PT YOGURT NOT STATED PINEAPPLE
    273317 MANGO YOGURT LOW FAT 0.375PT YOGURT LOW FAT MANGO
    37928 BLUEBERRIES N CREAM YOGURT 99% FAT FREE 0.5PT YOGURT 99% FAT FREE BLUEBERRY AND CREAM
    37930 KEY LIME PIE YOGURT LOW FAT 0.5PT YOGURT LOW FAT KEY LIME PIE
    37955 VANILLA YOGURT 99% FAT FREE 0.375PT YOGURT 99% FAT FREE VANILLA
    37963 CHOCOLATE CREME YOGURT LOW FAT 0.5PT YOGURT LOW FAT CHOCOLATE CREAM
    37981 PLAIN YOGURT NONFAT 0.5PT YOGURT NONFAT PLAIN
    37988 PINA COLADA YOGURT LOW FAT 0.5PT YOGURT LOW FAT PINA COLADA
    414288 YOGURT LOW FAT 0.3313PT YOGURT LOW FAT CARAMELIZED ALMONDS
    414292 BLUEBERRY BLISS YOGURT 0.3313PT YOGURT LOW FAT BLUEBERRY BLISS
    38048 APPLE CINNAMON YOGURT 0.375PT YOGURT LOW FAT APPLE CINNAMON
    38054 <blank> YOGURT 1% MILK FAT FRENCH VANILLA WITH RASPBERRY

    There are 5 columns, an id, description, category (always Yogurt in this sample), fat content, and flavor.

    Thanks
    John

     
  • Aliaksandr Autayeu

    John, as far as I understood, you want product matching. You can do that with S-Match, but in default configuration you are unlikely to get industrially usable results (e.g. F1>90%). To get better than default results, you'll need to hack S-Match a bit in a few places.

    One general product-matching hack I might imagine now would modify three places: 1) You might want to add your specific concepts to WordNet; 2) introduce some kind of facets (e.g. product, taste, packaging) and 3) update parser and matcher to take them into account. If you have facets separated in columns already, that makes it much easier. Namely, 1) boils down to "select distinct" and a bit of script writing for extJWNL. Hopefully your data side covers most values from both sides. 2) on the data-side due to separate columns is largely done, but you might need to get some facets, like packaging, out of the description. A couple of regexes might do the job. The hierarchy side might be a little more difficult, but feasible - it looks like there is a pattern. 3) given separate columns in data side (and pattern in hierarchy side) modifying the parser is not very difficult, and given the above the matcher might only need to be taught how to match weights (packaging) properly.

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks