Menu

Tree [r5] /
 History

HTTPS access


File Date Author Commit
 bin 2012-01-25 schroedl [r5]
 CHANGES 2012-01-25 schroedl [r5]
 COPYING 2009-06-17 schroedl [r1] Initial check in
 README 2011-11-25 schroedl [r4]

Read Me

----------------------------------------------------------------------
Welcome to Tabulator! 
----------------------------------------------------------------------
URL:
	http://tabulator.sourceforge.net
Author:
	 Stefan Schroedl <stefan.schroedl@gmx.de>
Date:
	2011/11/25 release 1.1
	2009/06/16 release 1.0

----------------------------------------------------------------------
1. What is Tabulator?
----------------------------------------------------------------------

Tabular text files (a.k.a., tab-delimited, csv, or flat file format)
can sometimes be a more convenient and efficient alternative to
to relational databases; this is particularly the case for sequential
batch processing of large amounts of data, where index-based access
ins not a priority.

Unix already provides many tools to do that, such as 'cut', 'paste',
'join', 'sort', etc. Tabulator is a collection of command-line tools
for  Unix/Linux platforms that build on these programs, but make them
more easy and flexible to use. Particularly, they
   * allow to reference columns by names rather than position, as
     indicated by the first line in a file; this can make scripts.
     more readable, and robust to changes in the input data format.
   * automate file format recognition (delimiter, compression etc).
   * check file format (e.g., consistent number of columns).
   * offer expanded functionality.

----------------------------------------------------------------------
2. Installation
----------------------------------------------------------------------

Installation is easy - just unpack the tarball, add the unpacked
directory to your PATH.

----------------------------------------------------------------------
3. License
----------------------------------------------------------------------

Tabulator is licensed under the GNU General Public License Version 3.0
(GPLv3), see the COPYING file for details.

----------------------------------------------------------------------
4. Implementation Notes
----------------------------------------------------------------------

The scripts have been developed over time to help me with various data
processing tasks, and were not designed from the outset to be released
in one package. Therefore, some scripts are implemented in Python, and
some in Perl; and there might be redundant functionality between
scripts.

----------------------------------------------------------------------
5. Documentation
----------------------------------------------------------------------

Here is a brief list of the programs together with their main
functionality. Each one provides more documentation and examples when
called with the '-m' or '-h' options. A common assumption is that
the first line in input files contains the column names.

   * tblcat: concatenate files of the same data format without header
     repetition 
 
   * tblcmd: execute a program on the body of a file (e.g., sort, uniq), 
     without affecting the header

   * tbldesc: for each column, summarize type (e.g., char, int,
     float), percentage of undefined values, min/max/mean/median/std
     etc. Can also provide correlation coefficients with a target
     column. 
     Example:
     Suppose file is
        name,house_nr,height,shoe_size
        arthur,42,6.1,11.5
        berta,101,5.5,8
        chris,333,5.9,10
        don,77,5.9,12.5

     Then 'tbldesc file' prints:
        summarizing file_desc (4 lines, target column: shoe_size)
        field name     type char% uniq min max avg  std    mse  corr   prob%
        1 name       char 100    4 [arthur; berta; chris; don]
        2 house_nr   int    0    4 42   333 138   114    172   -0.287 71.25
        3 height     float  0    3 5.5  6.1 5.85  0.218  4.89   0.812 18.82
        4 shoe_size  float  0    4 8   12.5 10.5  1.7     0.0   1.0    0.00

   * tblmap: simple line-wise ("map") computation similar to awk.
     Example: 
        Compute ratio of columns 'sales' and 'clients' for lines where
        the column 'region' has value 'us': 
        tblmap -s'region=="us"' -c'sales_per_client=sales/client' <file>

   * tblred: compute ("reduce") aggregations (e.g., sum, average) over groups
     of keys.
     Example: 
        tblred -k'region' 'sales_ratio=sales/sum(sales)' computes
        for each line proportion of column 'sales' to total sales for
        all lines with the same value of column 'region'.

   * tbljoin: In contrast to Unix join, the input files don't have to
     be pre-sorted, and multiple join columns can be specified. 
     Example:
     Suppose file1 is
        name,street,house
        zorro,desert road,5
        john,main st,2
        arthur,pan-galactic bypass,42
        arthur,main st,15
     and file2 is
        name,street,phone
        john,main st,654-321
        arthur,main st,121-212
        john,round cir,123-456
    Then 'tbljoin file1 file2' gives
        name,street,house,phone
        arthur,main st,15,121-212
        john,main st,2,654-321

   * tblhist: computation and plotting of the histogram of column
     values

   * tblsplit: split a file into several ones based on a column value
     Example:
     Suppose file is
        continent,country
        americas,us
        americas,mx
        europe,de
        europe,fr.
     Then 'tblsplit -rk'continent' file' generates two files:
        file.select.continent=americas:
        country
        us
        mx
     and
        file.select.continent=europe:
        country
        de
        fr

   * tbltex: formatting for latex tables

   * tbltranspose: transposition of rows and columns

   * tbluniq: check for and cut out duplicate columns; also, discover
     value dependencies.
   
----------------------------------------------------------------------
6. Limitations
----------------------------------------------------------------------

   * There is no special interpretation of block delimiters like `'`
     or `"`; it is the user's responsibility to ensure that the column
     delimiter cannot occur within column values.

   * tbljoin, tblred, tblhist, tbluniq, tblcat, and tbltex have some
     restrictions when run as a filter (repeated reading is necessary)