[37798a]: inst / @dataframe / rationale.txt Maximize Restore History

Download this file

rationale.txt    114 lines (103 with data), 5.3 kB

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
1) Context
I was recently performing I-V measurements of a MOS
(Metal-Oxide-Semiconductor) structure. A full set of measurements
contained a DC biaising voltage, a AC frequency, a small signal
capacitance and conductance. I had to change a few times the
measurement device configuration, so sometimes the sweeping occured
first on frequency, then on voltage, sometimes in the reverse
order. To make it short, I had to deal with many input files with
inconsistent columns order. The code to identify this order quickly
became clumsy.
The idea of a dataframe is to implement a mix between matrix and
cells. Its' like a matrix, where each column contains elements of the
same type. Unlike a matrix, columns type may be dissimilar. Also,
each colum MUST have a name, and rows MAY have a name. Moreover, to
make it easy to interface with databases, each row must have an unique
identifier. The goal is to make possible to use constructs like
y(:, ["Fr*"; "VB*"; "C";"G"])
where y is the dataframe, and column selection is based on
regexp. This way, the translation between names and indexes uses all
the power of regexpes.
2) Implementation
a dataframe is a class containing the following members:
_cnt = [0 0] : row count, column count, ... nth dimension count
_name = cell(1, 2) : row names, column names, ...
_ridx = [] : a unique Id for each row
_data = cell(0, 0) : a container for each column
_type = cell(0, 0) : the type of each column
The constructor can be used as
- no argument: convert the whole workspace to a dataframe (TBD)
- one null argument: return an empty dataframe
- one numeric or cell argument: transform it to a dataframe; tries to
infer column names from the name of the input argument.
- one char array with more than one line: uses it as rownames
- one single line char array: take it as the name of a file to read
data from. Expected format is csv, try to be carefull with
quoted/unquoted strings, also tries to remove trailing and leading
spaces from string entries. Do not try to cope with things such as
separator INSIDE quoted strings.
-supplemental arguments may occur either as pairs (string, value),
either as vectors. In the first case, the string contains an optional
parameter whose value is contained in the next argument. In the
second case, the argument is right-appended to the dataframe. Valid
optional parameters are
- rownames: a character array with the row names
- unquot: a logical to indicate if strings must be unquoted, default=true
- seeked: a string which must occur in the first row to start
considering values. Previous lines are skipped.
3) Access (reading)
- like a single matrix: df(:, 3); df(3, :). If all the results are of
the same type, returns a matrix, otherwise a dataframe. This behavior
can be inhibited by having the last argument set to 'dataframe':
df(3, 3, 'dataframe') will return a one-by-one dataframe
- by columnames:
df(:, ["Fr*"; "VB*"; "C";])
will try to match a columname beginning by "F" followed by an
optional 'r', thus 'F', 'Fr��quence' and 'Freqs'; then a columname
starting by "V" with an optional "B", like f.i. "VBias", then a
columname with is the exact string 'C'.
- by rownames: same principle
- either member selector may also be logical:
df(df.OK=='A', ['C';'G'])
- as a struct: either use one of the column name (df.C), either use
one of the allowed accessor for internal fields: "rownames",
"colnames", "rowcnt", "colcnt", "rowidx", "types". Direct access to
the members like y._type is allowed, but should be restricted to
class members and friends. "types" accept both numeric and strings
arguments, the latter being converter to column order based upon
columns name.
- as a cell: TODO: define how to fill the cell array with all the
fields.
4) Modifying
- as a matrix, using '()': use the same syntax as reading:
df(3, 'Fr*') = 200
df(df.OK=='?', ['C'; 'G']) = NaN;
Note that removing elements may only occur on a full row of colum
basis. Removing a single element is not allowed.
- as a struct: either access a columname, as
df.C = [];
either accessing the internal fields through entry points 'rownames'
and 'colnames', where care is taken to adapt the strings width in
order to make them compatibles. The entry point "types", with
arguments numeric or strings, has the effect to cast whole column(s)
to a new type:
df.types{[3 5]} = 'uint16'
df.type{"Freq"} = "uint32"
- as a cell: TBD
5) other overloaded functions: display, size, numel, cat. The latter
has to be thoroughfully tested. In particular, I've put the
restriction that horizontal cat requires that the row indexes are the
same for both elems. For vertical cat, how should we proceed ? Require
uniqueness of row indexes, and sorting ? Other ?
6) to be done:
- the 'load' function is in fact contained inside the constructor;
maybe we should have a specific load function ?
- be able to load a dataframe from a URI specification
- write a simple 'save' function
- adding data to a dataframe: R doesn't seems to allow adding rows
to a data.frame, should we follow it ?
- add test cases
- implement a 'factor' class for categorised data
- make all functions below statistics/ dataframe compatible
Pascal Dupuis
Louvain-la-Neuve, July First, 2010.