I'm looking for a way to match data in a flat file to a hierarchical structure. I'm not sure if S-Match can help with this. My perfect result would be two columns of the matched numbers.
I would appreciate any feedback or suggestions.
The hierarchy looks like:
5000 Yogurt, not specified as to type of milk or flavor
5100 Yogurt, plain, not specified as to type of milk
5110 Yogurt, plain, whole milk
5120 Yogurt, plain, lowfat milk
5130 Yogurt, plain, nonfat milk
5200 Yogurt, vanilla, not specified as to type of milk
5210 Yogurt, vanilla, whole milk
5220 Yogurt, vanilla flavor, lowfat milk
5230 Yogurt, vanilla flavor, nonfat milk
5300 Yogurt, chocolate, not specified as to type of milk
5310 Yogurt, chocolate, whole milk
5320 Yogurt, chocolate, nonfat milk
5400 Yogurt, fruit variety, not specified as to type of milk
5410 Yogurt, fruit variety, whole milk
5420 Yogurt, fruit variety, lowfat milk
5430 Yogurt, fruit variety, nonfat milk
5500 Yogurt, other flavor, not specified as to type of milk
5510 Yogurt, other flavor, whole milk
5520 Yogurt, other flavor, lowfat milk
5530 Yogurt, other flavor, nonfat milk
and the data looks like:
315650 RASPBERRY YOGURT DRINK 0.825PT YOGURT LOW FAT RASPBERRY
38120 RASPBERRY YOGURT SMOOTHIE 0.775PT YOGURT LOW FAT ROCKING RASPBERRY
334232 BLUEBERRY YOGURT DRINK 0.825PT YOGURT LOW FAT BLUEBERRY
201275 BLUEBERRY YOGURT 1.5% MILKFAT 0.5PT YOGURT 1.5% MILK FAT BLUEBERRY
37888 RASPBERRY YOGURT 2% MILKFAT 2PT YOGURT 2% MILK FAT RASPBERRY
433258 BLUEBERRY CREAM YOGURT 0.375PT YOGURT LOW FAT BLUEBERRY CREAM
273316 PINEAPPLE YOGURT 0.375PT YOGURT NOT STATED PINEAPPLE
273317 MANGO YOGURT LOW FAT 0.375PT YOGURT LOW FAT MANGO
37928 BLUEBERRIES N CREAM YOGURT 99% FAT FREE 0.5PT YOGURT 99% FAT FREE BLUEBERRY AND CREAM
37930 KEY LIME PIE YOGURT LOW FAT 0.5PT YOGURT LOW FAT KEY LIME PIE
37955 VANILLA YOGURT 99% FAT FREE 0.375PT YOGURT 99% FAT FREE VANILLA
37963 CHOCOLATE CREME YOGURT LOW FAT 0.5PT YOGURT LOW FAT CHOCOLATE CREAM
37981 PLAIN YOGURT NONFAT 0.5PT YOGURT NONFAT PLAIN
37988 PINA COLADA YOGURT LOW FAT 0.5PT YOGURT LOW FAT PINA COLADA
414288 YOGURT LOW FAT 0.3313PT YOGURT LOW FAT CARAMELIZED ALMONDS
414292 BLUEBERRY BLISS YOGURT 0.3313PT YOGURT LOW FAT BLUEBERRY BLISS
38048 APPLE CINNAMON YOGURT 0.375PT YOGURT LOW FAT APPLE CINNAMON
38054 <blank> YOGURT 1% MILK FAT FRENCH VANILLA WITH RASPBERRY
There are 5 columns, an id, description, category (always Yogurt in this sample), fat content, and flavor.
John, as far as I understood, you want product matching. You can do that with S-Match, but in default configuration you are unlikely to get industrially usable results (e.g. F1>90%). To get better than default results, you'll need to hack S-Match a bit in a few places.
One general product-matching hack I might imagine now would modify three places: 1) You might want to add your specific concepts to WordNet; 2) introduce some kind of facets (e.g. product, taste, packaging) and 3) update parser and matcher to take them into account. If you have facets separated in columns already, that makes it much easier. Namely, 1) boils down to "select distinct" and a bit of script writing for extJWNL. Hopefully your data side covers most values from both sides. 2) on the data-side due to separate columns is largely done, but you might need to get some facets, like packaging, out of the description. A couple of regexes might do the job. The hierarchy side might be a little more difficult, but feasible - it looks like there is a pattern. 3) given separate columns in data side (and pattern in hierarchy side) modifying the parser is not very difficult, and given the above the matcher might only need to be taught how to match weights (packaging) properly.
Log in to post a comment.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.