Babeldoc: Universal Document Processor / Feature Requests / #31 Add pattern matching to FlatToXml for segmented lines.

Add pattern matching to FlatToXml for segmented lines.

#31 Add pattern matching to FlatToXml for segmented lines.

Status: open

Owner: nobody

Labels: None

Priority: 5

Updated: 2003-12-24

Created: 2003-12-24

Creator: Mitch Christensen

Private: No

This is an enhancement, more than a bug.

When processing character-delimited (CSV) data using
segmented lines, having to use a fixed column pattern
matching scheme is a problem. With CSV data fields are
variable length. As such, if you are matching a pattern
beyond the beginning of a given line, the position of your
target string tends to fluctuate.

For example, if you are looking for the string "sample" in
the fifth column, and any of the first four columns are
variable length, how will you know what column the
string "sample" begins in?

A *much* better solution (IMHO) is to support regular
expression based pattern matching to recognize line
segments. I've implemented this approach in my own
local build of 1.2.0 RC2 as follows:

Added support for <segment-pattern/> element within
the <segment/> element. This is in addition to the
<segment-column/>, <segment-width/> and <segment-
value/> elements. With <segment-pattern/> I can do
the following:

<segment-pattern>[0-9][0-9][0-9][0-9].*</segment-
pattern>

to specify lines that begin with four consecutive digits,
or:

<segment-pattern>.*sample.*</segment-pattern>

to identify lines containing the string "sample" regardless
of where it lies in the line.

Here is a brief outline of the changes required to support
this enhancement:

Modified
com.babeldoc.conversion.flatfile.digester.DigesterConvers
ionUnmarshaller.dofigureDigester(), ~ line 458 to include:

digester.addCallMethod("conversion/line-
segments/segment/segment-pattern",
"setSegmentPattern", 0);

Modified
com.babeldoc.conversion.flatfile.digester.LineSegment.jav
a as follows:
o Add a 'segmentName' String attribute with
setter/getters.
o Set default values for segmentColumn, segmentValue
and segmentWidth in case they are never set in the
FlatToXml conversion specification (i.e. <segment-
pattern/> is used exclusively).

Modified
com.babeldoc.conversion.flatfile.LineSegmentData.java
to support the new segmentPattern attribute.

Modified
com.babeldoc.conversion.flatfile.FlatFileConverter.handle
SegmentedLine() ~line 448 as follows:

// do regex pattern
matching or fixed column identification
// (Mitch)
if ((pattern != null &&
line.matches(pattern))
|| ((width > 0)
&&
(line.length() > (column + width))
&&
line.substring(column, column + width).equals(value))) {
Element
segmentElement =

handleLineSegmentElement(element,
lineSegment);

to apply pattern matching in addition to exact string
matching.

I think this is it.

I've tested this here, and this works nicely.

The following caveats apply:

o If your regular expression includes XML sensitive
characters, you can wrap it in a CDATA section. This
works fine.
o This requires JDK 1.4 for the pattern matching.

Discussion

Dejan Krsmanovic - 2003-12-24

Logged In: YES
user_id=608954

I guess this should be tracked as RFE, not a Bug.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dejan Krsmanovic - 2003-12-24

labels: 531667 -->

assigned_to: triphop --> nobody
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

bruce mcdonald - 2003-12-24

Logged In: YES
user_id=547388

True that.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Add pattern matching to FlatToXml for segmented lines.

Group

Searches

Help

#31 Add pattern matching to FlatToXml for segmented lines.

Discussion