| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| matchdo-1.0.tgz | 2020-05-02 | 36.5 kB | |
| README.txt | 2020-05-02 | 3.0 kB | |
| Totals: 2 Items | 39.5 kB | 0 |
DESCRIPTION
Matchdo is a command line tool to match names contained in 2 data sets.
(Matchdo was used in one organisation to match thousands of computer accounts
with over 100,000 HR records.)
INSTALL
First install Text::LevenshteinXS from CPAN or from your distributions package libraries,
Matchdo has been tuned to work best with this version.
(Using a non XS version will result in you waiting a long time for your output)
If using utf8 chars, install Text::Levenshtein::XS instead
and replace 'use Text::LevenshteinXS' with 'use Text::Levenshtein::XS' in Matchdo.pm
cp matchdo.pl Matchdo.pm matchdo.conf indexdo.conf /usr/local/bin/
chmod +x /usr/local/bin/matchdo.pl
cp matchdo2list.pl /usr/local/bin
chmod +x /usr/local/bin/matchdo2list.pl
USAGE
cat source-data |matchdo.pl [index-data] [field-options-string] [islike-options-string(s)] >output-file
cat list-source.txt |matchdo.pl list-index.txt "gn;sn;^gn;^sn;^dept;^location;%indexname;%debug;^id:u;@id;" >list-matchdo.txt
then use 'matchdo2list.pl' to convert the matchdo output to pipe delimited CSV
(see the example/ folder)
QUICK START
see: example/readme-example.txt
1. match sourcefile names against indexfile names
2. make CSV (bar '|' delimited CSV)
3. make dups CSV
. field identifiers
:g givenName, :s surname, :i ID field, :u unique field in indexfile
'^' is used to identify indexfile fieldnames
'@' is used to print the duplicates found
. field identifier usage
':u' must be given with an indexfile fieldname (it identifies matched/found indexfile records)
':g :s' must be given with the appropriate sourcefile and indexfile fieldnames
':i' is optional with a sourcefile fieldname,
':i:u' are often given to the same indexfile fieldname (ie. when employeeID is the unique field)
1. use matchdo.pl to match sourcefile names against indexfile names
. fieldnames can be in any order
. these 3 system fieldnames should always be given:
%indexname;%debug;%matchStatus
cat list-source.txt |matchdo/matchdo.pl list-index.txt "dn;givenName:g;sn:s;%indexname;%debug;%matchStatus;^fname:g;^lname:s;^dept;^location;^employeeid:i:u" >list-matchdo.txt
2. use matchdo2list.pl to make the CSV
. list only the specified inputfile fields (dn,givenName,sn)
. list all indexfile fields
. use employeeid to link the indexfile with the inputfile matchStatus ID
cat list-matchdo.txt |matchdo/matchdo2list.pl -l employeeid dn givenName sn list-index.txt >list-matches.txt
note:
-l <fieldname>: specifies the fieldname(s) to link files with
fieldnames to display go after the filename, inputfile fieldnames go first
3. use matchdo2list.pl to make the dups CSV
. list only the specified inputfile fields (dn,givenName,sn)
. list all indexfile fields
. use employeeid to link the indexfile with the inputfile matchStatus IDs
cat list-matchdo.txt |matchdo/matchdo2list.pl -d1 -l employeeid dn givenName sn list-index.txt >list-matches-dups.txt
see also: readme-matchdo.txt
END
Greg Breheny