Download Latest Version matchdo-1.0.tgz (36.5 kB)
Email in envelope

Get an email when there's a new version of matchdo

Home
Name Modified Size InfoDownloads / Week
matchdo-1.0.tgz 2020-05-02 36.5 kB
README.txt 2020-05-02 3.0 kB
Totals: 2 Items   39.5 kB 0
DESCRIPTION

 Matchdo is a command line tool to match names contained in 2 data sets.

 (Matchdo was used in one organisation to match thousands of computer accounts
 with over 100,000 HR records.)


INSTALL

 First install Text::LevenshteinXS from CPAN or from your distributions package libraries,
 Matchdo has been tuned to work best with this version.
 (Using a non XS version will result in you waiting a long time for your output)

 If using utf8 chars, install Text::Levenshtein::XS instead
 and replace 'use Text::LevenshteinXS' with 'use Text::Levenshtein::XS' in Matchdo.pm

 cp matchdo.pl Matchdo.pm matchdo.conf indexdo.conf /usr/local/bin/
 chmod +x /usr/local/bin/matchdo.pl

 cp matchdo2list.pl /usr/local/bin
 chmod +x /usr/local/bin/matchdo2list.pl


USAGE

  cat source-data |matchdo.pl [index-data] [field-options-string] [islike-options-string(s)] >output-file

  cat list-source.txt |matchdo.pl list-index.txt "gn;sn;^gn;^sn;^dept;^location;%indexname;%debug;^id:u;@id;" >list-matchdo.txt

  then use 'matchdo2list.pl' to convert the matchdo output to pipe delimited CSV

  (see the example/ folder)


QUICK START

    see: example/readme-example.txt

	1. match sourcefile names against indexfile names
	2. make CSV (bar '|' delimited CSV)
	3. make dups CSV

	. field identifiers

		:g givenName, :s surname, :i ID field, :u unique field in indexfile
		'^' is used to identify indexfile fieldnames
		'@' is used to print the duplicates found

		. field identifier usage

			':u' must be given with an indexfile fieldname (it identifies matched/found indexfile records)
			':g :s' must be given with the appropriate sourcefile and indexfile fieldnames
			':i' is optional with a sourcefile fieldname,
			':i:u' are often given to the same indexfile fieldname (ie. when employeeID is the unique field)

	1. use matchdo.pl to match sourcefile names against indexfile names

	. fieldnames can be in any order
	. these 3 system fieldnames should always be given:
		%indexname;%debug;%matchStatus

	cat list-source.txt |matchdo/matchdo.pl list-index.txt "dn;givenName:g;sn:s;%indexname;%debug;%matchStatus;^fname:g;^lname:s;^dept;^location;^employeeid:i:u" >list-matchdo.txt


	2. use matchdo2list.pl to make the CSV

	. list only the specified inputfile fields (dn,givenName,sn)
	. list all indexfile fields
	. use employeeid to link the indexfile with the inputfile matchStatus ID

		cat list-matchdo.txt |matchdo/matchdo2list.pl -l employeeid dn givenName sn list-index.txt >list-matches.txt

	note:
		-l <fieldname>: specifies the fieldname(s) to link files with
		fieldnames to display go after the filename, inputfile fieldnames go first


	3. use matchdo2list.pl to make the dups CSV

	. list only the specified inputfile fields (dn,givenName,sn)
	. list all indexfile fields
	. use employeeid to link the indexfile with the inputfile matchStatus IDs

		cat list-matchdo.txt |matchdo/matchdo2list.pl -d1 -l employeeid dn givenName sn list-index.txt >list-matches-dups.txt

see also: readme-matchdo.txt

END


Greg Breheny

Source: README.txt, updated 2020-05-02