Tabulator Code
Brought to you by:
schroedl
| File | Date | Author | Commit |
|---|---|---|---|
| bin | 2012-01-25 | schroedl | [r5] |
| CHANGES | 2012-01-25 | schroedl | [r5] |
| COPYING | 2009-06-17 | schroedl | [r1] Initial check in |
| README | 2011-11-25 | schroedl | [r4] |
----------------------------------------------------------------------
Welcome to Tabulator!
----------------------------------------------------------------------
URL:
http://tabulator.sourceforge.net
Author:
Stefan Schroedl <stefan.schroedl@gmx.de>
Date:
2011/11/25 release 1.1
2009/06/16 release 1.0
----------------------------------------------------------------------
1. What is Tabulator?
----------------------------------------------------------------------
Tabular text files (a.k.a., tab-delimited, csv, or flat file format)
can sometimes be a more convenient and efficient alternative to
to relational databases; this is particularly the case for sequential
batch processing of large amounts of data, where index-based access
ins not a priority.
Unix already provides many tools to do that, such as 'cut', 'paste',
'join', 'sort', etc. Tabulator is a collection of command-line tools
for Unix/Linux platforms that build on these programs, but make them
more easy and flexible to use. Particularly, they
* allow to reference columns by names rather than position, as
indicated by the first line in a file; this can make scripts.
more readable, and robust to changes in the input data format.
* automate file format recognition (delimiter, compression etc).
* check file format (e.g., consistent number of columns).
* offer expanded functionality.
----------------------------------------------------------------------
2. Installation
----------------------------------------------------------------------
Installation is easy - just unpack the tarball, add the unpacked
directory to your PATH.
----------------------------------------------------------------------
3. License
----------------------------------------------------------------------
Tabulator is licensed under the GNU General Public License Version 3.0
(GPLv3), see the COPYING file for details.
----------------------------------------------------------------------
4. Implementation Notes
----------------------------------------------------------------------
The scripts have been developed over time to help me with various data
processing tasks, and were not designed from the outset to be released
in one package. Therefore, some scripts are implemented in Python, and
some in Perl; and there might be redundant functionality between
scripts.
----------------------------------------------------------------------
5. Documentation
----------------------------------------------------------------------
Here is a brief list of the programs together with their main
functionality. Each one provides more documentation and examples when
called with the '-m' or '-h' options. A common assumption is that
the first line in input files contains the column names.
* tblcat: concatenate files of the same data format without header
repetition
* tblcmd: execute a program on the body of a file (e.g., sort, uniq),
without affecting the header
* tbldesc: for each column, summarize type (e.g., char, int,
float), percentage of undefined values, min/max/mean/median/std
etc. Can also provide correlation coefficients with a target
column.
Example:
Suppose file is
name,house_nr,height,shoe_size
arthur,42,6.1,11.5
berta,101,5.5,8
chris,333,5.9,10
don,77,5.9,12.5
Then 'tbldesc file' prints:
summarizing file_desc (4 lines, target column: shoe_size)
field name type char% uniq min max avg std mse corr prob%
1 name char 100 4 [arthur; berta; chris; don]
2 house_nr int 0 4 42 333 138 114 172 -0.287 71.25
3 height float 0 3 5.5 6.1 5.85 0.218 4.89 0.812 18.82
4 shoe_size float 0 4 8 12.5 10.5 1.7 0.0 1.0 0.00
* tblmap: simple line-wise ("map") computation similar to awk.
Example:
Compute ratio of columns 'sales' and 'clients' for lines where
the column 'region' has value 'us':
tblmap -s'region=="us"' -c'sales_per_client=sales/client' <file>
* tblred: compute ("reduce") aggregations (e.g., sum, average) over groups
of keys.
Example:
tblred -k'region' 'sales_ratio=sales/sum(sales)' computes
for each line proportion of column 'sales' to total sales for
all lines with the same value of column 'region'.
* tbljoin: In contrast to Unix join, the input files don't have to
be pre-sorted, and multiple join columns can be specified.
Example:
Suppose file1 is
name,street,house
zorro,desert road,5
john,main st,2
arthur,pan-galactic bypass,42
arthur,main st,15
and file2 is
name,street,phone
john,main st,654-321
arthur,main st,121-212
john,round cir,123-456
Then 'tbljoin file1 file2' gives
name,street,house,phone
arthur,main st,15,121-212
john,main st,2,654-321
* tblhist: computation and plotting of the histogram of column
values
* tblsplit: split a file into several ones based on a column value
Example:
Suppose file is
continent,country
americas,us
americas,mx
europe,de
europe,fr.
Then 'tblsplit -rk'continent' file' generates two files:
file.select.continent=americas:
country
us
mx
and
file.select.continent=europe:
country
de
fr
* tbltex: formatting for latex tables
* tbltranspose: transposition of rows and columns
* tbluniq: check for and cut out duplicate columns; also, discover
value dependencies.
----------------------------------------------------------------------
6. Limitations
----------------------------------------------------------------------
* There is no special interpretation of block delimiters like `'`
or `"`; it is the user's responsibility to ensure that the column
delimiter cannot occur within column values.
* tbljoin, tblred, tblhist, tbluniq, tblcat, and tbltex have some
restrictions when run as a filter (repeated reading is necessary)