Dataframe package

From Octave
Jump to navigation Jump to search

The dataframe package is part of the Octave Forge project. It is a data manipulation toolbox similar to R data.frame and is maintained by Pascal Dupuis.

Introduction

This package permits to handle complex (both in the sense of complex numbers and high complexity) data as if they were ordinary arrays, except that each column MAY possess a different type. It also provides a fairly complete interface to CSV files, permitting to cope with a number of oddities, like e.g., CSV files starting with a header spread over a few lines. The resulting array tries as far as it can to mimic an array in such a way that binary operators and usual functions will work as expected.

Meta-information is also handled. Rows and columns may have a name, and this name is searchable. If for whatever reason the ordering of a CSV file changes, searching by column names will return the expected information.

Example

To get a first taste, let's load the test csv file coming with the package:

 >> experiment = dataframe('data_test.csv')
 warning: load: '/home/padupuis/matlab/dataframe/inst/data_test.csv' found by searching load path
 warning: fopen: '/home/padupuis/matlab/dataframe/inst/data_test.csv' found by searching load path
 ans = dataframe with 10 rows and 7 columns
 Src: data_test.csv
 Comment: #notice there is a extra separator
 Comment: # a comment line and an empty one
 Comment: # the next lines use \r\n \r and \f as linefeed
 Comment: # one empty input field
 _1  DataName   VBIAS   Freq   x_IBIAS_          C       GOUT  OK_
 Nr      char  double double     double     double     double char
  1 DataValue -6.0000 300000 1.6272e-11 7.0215e-13 1.6044e-07    A
  2 DataValue -5.8000 300000 1.5990e-11 6.9607e-13 1.5728e-07    E
  3 DataValue -5.6000 300000 1.3790e-11 6.9048e-13 1.5489e-07    !
  4 DataValue -5.4000 300000 1.4420e-11 6.8517e-13 1.5478e-07    ?
  5 DataValue -5.2000 300000 1.2930e-11 6.7965e-13 1.5189e-07    C
  6 DataValue -5.0000 300000 1.2610e-11 6.7444e-13 1.4931e-07    B
  7 DataValue -4.8000 300000 1.4390e-11 6.7011e-13 1.4876e-07    A
  8 DataValue -4.6000 300000 1.0890e-11 6.6416e-13 1.4890e-07    3
  9 DataValue -4.4000 300000         NA 6.5859e-13 1.4558e-07    C
 10 DataValue -4.2000 300000 1.0610e-11 6.5355e-13 1.4431e-07    B

Those data were produced while performing a voltage sweep on a sensor, measuring with an impedance bridge the parallel capacitor and conductance at a given frequency.

The first lines contain few meta-information: name of the source file and a few comments found in the csv file. The purpose is to annotate the results.

Then we have the content. Each column starts with a name, then a type. Next we find the content lines, each of them with an index. Then we find the content; control values (polarization voltage, applied frequency), then measured values: DC current, capacitor, conductance. The last column is categorical: the user introduced some code telling if the result makes senses or not.

Let us now select the control values:

 cv = experiment(1:3, ["Vbias"; "Freq"])
 cv = dataframe with 3 rows and 1 columns
 Src: data_test.csv
 Comment: #notice there is a extra separator
 Comment: # a comment line and an empty one
 Comment: # the next lines use \r\n \r and \f as linefeed
 Comment: # one empty input field
 _1   Freq
 Nr double
  1 300000
  2 300000
  3 300000

The selection occurred on a range for the lines, by names on the column. The search criteria is here a string array. All columns whose name match are returned.

The result is returned as a dataframe. This can be changed:

>> experiment.array(6, "OK_")
ans = B
>> class(ans)
ans = char

When selecting vectors, this transformation in array is automatic. The DC current is contained in elements 31 to 40 (fourth column):

 >> experiment(31:40)
ans =
Columns 1 through 9:
  1.6272e-11   1.5990e-11   1.3790e-11   1.4420e-11   1.2930e-11   1.2610e-11   1.4390e-11   1.0890e-11           NA
Column 10:
  1.0610e-11

Note that the access 'experiment("x_IBIAS")' is illegal: does it refer to row or column names ?

Accessing in this pseudo-structure way is valid in the following cases
choosing the output format
array, cell, dataframe (may be abbreviated as 'df')
attribute selection
rownames, colnames, rowcnt, colcnt, rowidx, types, source, header, comment
constructor call
new (no other deferencing may occur
column selection
just provide one valid column name

To be similar to R implementation, constructs such as x.as.array are also allowed.

A simple example:

truc={"Id", "Name", "Type";1, "onestring", "bla"; 2, "somestring", "foobar";}
truc =
{
  [1,1] = Id
  [2,1] =  1
  [3,1] =  2
  [1,2] = Name
  [2,2] = onestring
  [3,2] = somestring
  [1,3] = Type
  [2,3] = bla
  [3,3] = foobar
}
>> tt=dataframe(truc)
tt = dataframe with 2 rows and 3 columns
_1     Id       Name   Type
Nr double       char   char
 1      1  onestring    bla
 2      2 somestring foobar

The first cell line is intended to contain column names; the rest is column content. The type is automatically inferred from the cell content. Now let us select one column by its name:

>> tt(:, 'Name')
ans = dataframe with 2 rows and 1 columns
_1       Name
Nr       char
1  onestring
2 somestring

In this case, a sub-dataframe is returned. Struct-like indexing is also implemented:

>> tt.Id
ans =
  1
  2

When the output is a vector and can be simplified to something simple ... it is.