Referential Cell Processor
A fast, programmer-friendly, free CSV library for Java
Brought to you by:
jamesbassett,
kbg
It would be really handy to be able to use a referential validation step in a csv file. CSV fix has this on the validate there, but being able to do so from this java library instead would be really handy.
Would be the most useful if it could pull from both other fields in this file, as well as other files, and being able to define the mapping to both. Some of the files I work with have references to "Parent" fields in a record, meaning that I need to make sure the value specified there exists, in order to prevent ugly error messages from the consuming application.
Hi Adam,
Could you could give a quick example (with CSV) to show what you mean?
Have you tried using cell processors to do this referential validation?
James
Two examples from real life. First from the attached file here, is a data load of e-commerce data, so it has things that have a product and variant, or product and item relationships, such as buying a shirt online, it comes in different sizes, so you need information about both the overall shirt, as well as things specific to the size that you're buying. Attached, you'll see that there's a base "product" that has partnumber of Cords, and a few items with a parentPartNumber of Cords. Essentially, I'd like to validate that anything referenced in that parentPartNumber column already exists earlier in the file in the PartNumber Field. Second example in next post with extra csv.
Next example is a little more complicated, but is adding information to those same products/items we were working with in the last one. In this case, I'm adding attribute information, and need to make sure that PartNumbers referenced here exist in the other file, in the PartNumber field.
As for trying it with CellProcessors, haven't gotten there yet, just was doing a library evaluation yesterday, and this came closest to the validation types I'd need and still able to work through a Java program, so I can make the validation more dynamic on files like these. These files get interesting because not every column is required, nor are the above examples complete (there's other columns that can be added in to be handled).
Hi Adam, thanks for all the info. I've put together a working example of how you might achieve this with Super CSV and cell processors.
I've created 2 new cell processors:
RecordValue
(which simply records each value it encounters and adds it to a Collection)IsValueIn
(which checks that the value is in the Collection of allowed values). You can't use theIsIncludedIn
processor that comes with Super CSV as it doesn't allow modification of the collection.Using them together you can do referential validation. The cell processor array looks like this (where
partNumbers
is just aHashSet
):See the attached
CatalogValidation
class to see this in action. I've also demonstrated how the validation on the second file can be done. If you run it with the files you attached you'll see that it (correctly) fails validation with the following exception:I'm considering adding these processors (in some form) to Super CSV, as these kind of requirements keep coming up.
Let me know what you think!
Last edit: James Bassett 2013-01-06
Looks good to me, should be a nice way to guarantee referential integrity on these files.
One other question though, any easy way to processor definition based on row content in some form? I'm guessing I could just extend optional in some form, though it's not entirely clear how I'd delineate one row's iteration from another. Example from above would be ensure that every "Item" had a parentPartNumber defined, since a normal Optional(new IsValueIn) would allow free-standing item variants without products.
I think you're asking whether you can validate that Items have a parentPartNumber but skip that validation for Products. This is essentially cross-validation (validation involving more than 1 column).
The only way I can think of to achieve this with cell processors is to write a custom cell processor that inspects the value in another column, and decides what to do based on the value. Every cell processor has access to the
CsvContext
, which contains the raw (unprocessed) values.So you can write a processor such as the following. It behaves like
Optional
if the value in a given column (where column numbers start at 1) equals the value its expecting. Otherwise, it simply delegates to the next processor in the chain.I replaced the processor definition for the parentPartNumber column with
and it works for Products and Items but it fails validation on the Packages at the end of the file.
I'll leave that as an exercise for you to solve, but you can see that you can achieve cross-validation if you really have to.
Hope this helps.
Makes sense. The fun thing I'm working with here that makes that even more complicated, is that these files, when they're eventually parsed, are keyed off of the column headers, not the csv location, so doing a restriction in the cell processor around a specific column would be less than ideal on face value. Though I guess with passing in the column # on setup of processor makes that more workable... Just interesting to work with since I have to look up which cell processors to use by header also.
The RecordValue/IsValueIn CellProcessors are working great, btw. If you want it, I have some modifications to that and the SuppressException code to make error messages more friendly to non-technical audiences, likely generating these files in excel.
Thanks for all the quick responses.
Yeah feel free to attach any modifications - it'd be good to get some ideas.
I'm currently working on adding skip comment support - if you have any ideas there just let me know.
Bam! Another quick response ;)
Here's the tweaks. Minor adjustments to IsValueIn, NotNull, and IsIncludedIn to be a bit more verbose with their specific validation messages. Added a new exception type, thrown by SuppressWarnings, the "PrettySuperCsvCellProcessorException", which has a bit more of a non-technical friendly output format (no class names, accepts a fieldname concept). SuppressWarnings itself got a bit smarter. To begin with, it rewrites the CsvContext so it can accurately display error columns - those were broken in the version you posted to StackOverflow. I also made it so that it knows the fieldName it's working with, so it accepts that, and passes it on to the thrown exception. I invoked it a bit more dynamically, using a HashMap as my master store for the "other" validations I'm doing on fields, then do just a "theProcessors.add( new SuppressException( field, fieldMapping.get(field) ) );" to add each resulting CellProcessor wrapped in a new SuppressException as appropriate.
I've added the 2 processors (now called
Collector
andIsElementOf
) for the upcoming release.[r278]
Related
Commit: [r278]