Tabulator Code
Brought to you by:
schroedl
File | Date | Author | Commit |
---|---|---|---|
bin | 2012-01-25 | schroedl | [r5] |
CHANGES | 2012-01-25 | schroedl | [r5] |
COPYING | 2009-06-17 | schroedl | [r1] Initial check in |
README | 2011-11-25 | schroedl | [r4] |
---------------------------------------------------------------------- Welcome to Tabulator! ---------------------------------------------------------------------- URL: http://tabulator.sourceforge.net Author: Stefan Schroedl <stefan.schroedl@gmx.de> Date: 2011/11/25 release 1.1 2009/06/16 release 1.0 ---------------------------------------------------------------------- 1. What is Tabulator? ---------------------------------------------------------------------- Tabular text files (a.k.a., tab-delimited, csv, or flat file format) can sometimes be a more convenient and efficient alternative to to relational databases; this is particularly the case for sequential batch processing of large amounts of data, where index-based access ins not a priority. Unix already provides many tools to do that, such as 'cut', 'paste', 'join', 'sort', etc. Tabulator is a collection of command-line tools for Unix/Linux platforms that build on these programs, but make them more easy and flexible to use. Particularly, they * allow to reference columns by names rather than position, as indicated by the first line in a file; this can make scripts. more readable, and robust to changes in the input data format. * automate file format recognition (delimiter, compression etc). * check file format (e.g., consistent number of columns). * offer expanded functionality. ---------------------------------------------------------------------- 2. Installation ---------------------------------------------------------------------- Installation is easy - just unpack the tarball, add the unpacked directory to your PATH. ---------------------------------------------------------------------- 3. License ---------------------------------------------------------------------- Tabulator is licensed under the GNU General Public License Version 3.0 (GPLv3), see the COPYING file for details. ---------------------------------------------------------------------- 4. Implementation Notes ---------------------------------------------------------------------- The scripts have been developed over time to help me with various data processing tasks, and were not designed from the outset to be released in one package. Therefore, some scripts are implemented in Python, and some in Perl; and there might be redundant functionality between scripts. ---------------------------------------------------------------------- 5. Documentation ---------------------------------------------------------------------- Here is a brief list of the programs together with their main functionality. Each one provides more documentation and examples when called with the '-m' or '-h' options. A common assumption is that the first line in input files contains the column names. * tblcat: concatenate files of the same data format without header repetition * tblcmd: execute a program on the body of a file (e.g., sort, uniq), without affecting the header * tbldesc: for each column, summarize type (e.g., char, int, float), percentage of undefined values, min/max/mean/median/std etc. Can also provide correlation coefficients with a target column. Example: Suppose file is name,house_nr,height,shoe_size arthur,42,6.1,11.5 berta,101,5.5,8 chris,333,5.9,10 don,77,5.9,12.5 Then 'tbldesc file' prints: summarizing file_desc (4 lines, target column: shoe_size) field name type char% uniq min max avg std mse corr prob% 1 name char 100 4 [arthur; berta; chris; don] 2 house_nr int 0 4 42 333 138 114 172 -0.287 71.25 3 height float 0 3 5.5 6.1 5.85 0.218 4.89 0.812 18.82 4 shoe_size float 0 4 8 12.5 10.5 1.7 0.0 1.0 0.00 * tblmap: simple line-wise ("map") computation similar to awk. Example: Compute ratio of columns 'sales' and 'clients' for lines where the column 'region' has value 'us': tblmap -s'region=="us"' -c'sales_per_client=sales/client' <file> * tblred: compute ("reduce") aggregations (e.g., sum, average) over groups of keys. Example: tblred -k'region' 'sales_ratio=sales/sum(sales)' computes for each line proportion of column 'sales' to total sales for all lines with the same value of column 'region'. * tbljoin: In contrast to Unix join, the input files don't have to be pre-sorted, and multiple join columns can be specified. Example: Suppose file1 is name,street,house zorro,desert road,5 john,main st,2 arthur,pan-galactic bypass,42 arthur,main st,15 and file2 is name,street,phone john,main st,654-321 arthur,main st,121-212 john,round cir,123-456 Then 'tbljoin file1 file2' gives name,street,house,phone arthur,main st,15,121-212 john,main st,2,654-321 * tblhist: computation and plotting of the histogram of column values * tblsplit: split a file into several ones based on a column value Example: Suppose file is continent,country americas,us americas,mx europe,de europe,fr. Then 'tblsplit -rk'continent' file' generates two files: file.select.continent=americas: country us mx and file.select.continent=europe: country de fr * tbltex: formatting for latex tables * tbltranspose: transposition of rows and columns * tbluniq: check for and cut out duplicate columns; also, discover value dependencies. ---------------------------------------------------------------------- 6. Limitations ---------------------------------------------------------------------- * There is no special interpretation of block delimiters like `'` or `"`; it is the user's responsibility to ensure that the column delimiter cannot occur within column values. * tbljoin, tblred, tblhist, tbluniq, tblcat, and tbltex have some restrictions when run as a filter (repeated reading is necessary)