Dataframe package
The dataframe package is part of the Octave Forge project. It is a data manipulation toolbox similar to R data.frame and is maintained by Pascal Dupuis.
Introduction
This package permits to handle complex (both in the sense of complex numbers and high complexity) data as if they were ordinary arrays, except that each column MAY possess a different type. It also provides a fairly complete interface to CSV files, permitting to cope with a number of oddities, like e.g., CSV files starting with a header spread over a few lines. The resulting array tries as far as it can to mimic an array in such a way that binary operators and usual functions will work as expected.
Meta-information is also handled. Rows and columns may have a name, and this name is searchable. If for whatever reason the ordering of a CSV file changes, searching by column names will return the expected information.
Example
To get a first taste, let's load the test csv file coming with the package:
>> experiment = dataframe('data_test.csv') warning: load: '/home/padupuis/matlab/dataframe/inst/data_test.csv' found by searching load path warning: fopen: '/home/padupuis/matlab/dataframe/inst/data_test.csv' found by searching load path ans = dataframe with 10 rows and 7 columns Src: data_test.csv Comment: #notice there is a extra separator Comment: # a comment line and an empty one Comment: # the next lines use \r\n \r and \f as linefeed Comment: # one empty input field _1 DataName VBIAS Freq x_IBIAS_ C GOUT OK_ Nr char double double double double double char 1 DataValue -6.0000 300000 1.6272e-11 7.0215e-13 1.6044e-07 A 2 DataValue -5.8000 300000 1.5990e-11 6.9607e-13 1.5728e-07 E 3 DataValue -5.6000 300000 1.3790e-11 6.9048e-13 1.5489e-07 ! 4 DataValue -5.4000 300000 1.4420e-11 6.8517e-13 1.5478e-07 ? 5 DataValue -5.2000 300000 1.2930e-11 6.7965e-13 1.5189e-07 C 6 DataValue -5.0000 300000 1.2610e-11 6.7444e-13 1.4931e-07 B 7 DataValue -4.8000 300000 1.4390e-11 6.7011e-13 1.4876e-07 A 8 DataValue -4.6000 300000 1.0890e-11 6.6416e-13 1.4890e-07 3 9 DataValue -4.4000 300000 NA 6.5859e-13 1.4558e-07 C 10 DataValue -4.2000 300000 1.0610e-11 6.5355e-13 1.4431e-07 B
Those data were produced while performing a voltage sweep on a sensor, measuring with an impedance bridge the parallel capacitor and conductance at a given frequency.
The first lines contain few meta-information: name of the source file and a few comments found in the csv file. The purpose is to annotate the results.
Then we have the content. Each column starts with a name, then a type. Next we find the content lines, each of them with an index. Then we find the content; control values (polarization voltage, applied frequency), then measured values: DC current, capacitor, conductance. The last column is categorical: the user introduced some code telling if the result makes senses or not.
Let us now select the control values:
cv = experiment(1:3, ["Vbias"; "Freq"]) cv = dataframe with 3 rows and 1 columns Src: data_test.csv Comment: #notice there is a extra separator Comment: # a comment line and an empty one Comment: # the next lines use \r\n \r and \f as linefeed Comment: # one empty input field _1 Freq Nr double 1 300000 2 300000 3 300000
The selection occurred on a range for the lines, by names on the column. The search criteria is here a string array. All columns whose name match are returned.
The result is returned as a dataframe. This can be changed:
>> experiment.array(6, "OK_") ans = B >> class(ans) ans = char
When selecting vectors, this transformation in array is automatic. The DC current is contained in elements 31 to 40 (fourth column):
>> experiment(31:40) ans = Columns 1 through 9: 1.6272e-11 1.5990e-11 1.3790e-11 1.4420e-11 1.2930e-11 1.2610e-11 1.4390e-11 1.0890e-11 NA Column 10: 1.0610e-11
Note that the access 'experiment("x_IBIAS")' is illegal: does it refer to row or column names ?
- Accessing in this pseudo-structure way is valid in the following cases
- choosing the output format
- array, cell, dataframe (may be abbreviated as 'df')
- attribute selection
- rownames, colnames, rowcnt, colcnt, rowidx, types, source, header, comment
- constructor call
- new (no other deferencing may occur
- column selection
- just provide one valid column name
To be similar to R implementation, constructs such as x.as.array are also allowed.
A simple example:
truc={"Id", "Name", "Type";1, "onestring", "bla"; 2, "somestring", "foobar";} truc = { [1,1] = Id [2,1] = 1 [3,1] = 2 [1,2] = Name [2,2] = onestring [3,2] = somestring [1,3] = Type [2,3] = bla [3,3] = foobar } >> tt=dataframe(truc) tt = dataframe with 2 rows and 3 columns _1 Id Name Type Nr double char char 1 1 onestring bla 2 2 somestring foobar
The first cell line is intended to contain column names; the rest is column content. The type is automatically inferred from the cell content. Now let us select one column by its name:
>> tt(:, 'Name') ans = dataframe with 2 rows and 1 columns _1 Name Nr char 1 onestring 2 somestring
In this case, a sub-dataframe is returned. Struct-like indexing is also implemented:
>> tt.Id ans = 1 2
When the output is a vector and can be simplified to something simple ... it is.