README
This document describes the main features concerning the role and structure of the data cleaning program developed.
The following image shows the structure of the GUI window and then describes the role of each of its parts.
1. The button "Import CSV" displays a browse window where you select the data set to be cleaned, which should be in csv format.
2. The check box "Enable GoogleMaps API" allows a search of the address provided by Google Maps. This search returns a numeric value between 1 and 8 related to the accuracy of the address provided. Next to the checkbox and the corresponding label is a text box which specifies at what value this will certainly have replace the values in the original database by data provided by Google Maps.
3. In the Statistics pane, the results that the operation carried out are shown, which include the total assessed records, records cleaned, the total operation time, and the percentage of occurrence of each level of certainty of Google Maps from the total records.
4. In the corresponding region, it is added a table with the data cleaned and organized by columns: Name, Middle Name, Last Name, Company Name, Street, Ext., Int., Block, Lot, Cologne, ZIP, City, State, Maps, GeoCoords.
5. The tables shown in the corresponding region have the patterns (strings) that characterize each entry of name and address, and the frequency of each. This is included in order to allow the program to include more patterns not considered at the moment, to refine the data clean.
With regard to programming code, it is important to note the following:
* In the class CleanerThread all programming regarding the cleaning of data is found.
* It begins with the loading of a clean data set, which is done by opening a browser window.
* Then the initial values of the Statistics panel are set and the table structure in which the data must appear organized is defined.
* Subsequently, strings arrangements that include words that serve as reference for the characterization of certain elements of the address as identifiers are defined. They are to be used in the organization of steering columns.
* Later, the address region starts. It begins with a pre-format, by eliminating unknown characters and performing character substitutions needed for the proper processing. After a separation of words is performed (using the space as a separator) to proceed to the characterization of each of these according to the comparison between words and the array elements of different types of identifiers. This characterization is obtained for each input string, which will serve to locate the items in the appropriate columns by regular expressions. Finally using a vector containing the elements of an external database the city and state are validated with the zip code provided.
* Then, name region continues. We begin by classifying all those entries containing S.A, S. A, or more than three words separated by "." as a company name. Then, similar to what was done in the addresses, a separation of words is performed (using the space as a separator) and each of these is classified as name, last name, both or neither, according to its presence in vectors that include elements of external databases of names and last names. This characterization is obtained for each input string, which will serve to locate the items in the appropriate columns by regular expressions.
* The method LoadDB generates a vector for names or last names, taking the elements of the databases whose addresses are located at the beginning of the method.
* The method LoadDBCP generates a vector for postal codes, cities and states, taking the elements of the databases whose address is located at the beginning of the method.