Dataframe package

From Octave
Revision as of 08:52, 1 May 2015 by 88.177.161.164 (talk) (Example based on the 'data_test.csv')
Jump to navigation Jump to search

Dataframe, Data manipulation toolbox similar to R data.frame

At an mature development stage. hg

  • Maintainer: Pascal Dupuis
  • Contributors:

This package permits to handle complex (both in the sense of complex numbers and high complexity) data as if they were ordinary arrays, except that each column MAY possess a different type. It also complete a fairly complete interface to CSV files, permitting to cope with a number of oddities, like f.i. CSV files starting with a header spread over a few lines. The resulting array tries as far as it can to mimick an array, in such a way that binary operators and usual functions will work as expected.

Meta-information is also handled. Rows and columns may have a name, and this name is searchable. If for whatever reason the ordering of a CSV file changes, searching by column names will return the expected information.

To get a first taste, let's load the test csv file coming with the package:

 >> dataframe('data_test.csv')
 warning: load: '/home/padupuis/matlab/dataframe/inst/data_test.csv' found by searching load path
 warning: fopen: '/home/padupuis/matlab/dataframe/inst/data_test.csv' found by searching load path
 ans = dataframe with 10 rows and 7 columns
 Src: data_test.csv
 Comment: #notice there is a extra separator
 Comment: # a comment line and an empty one
 Comment: # the next lines use \r\n \r and \f as linefeed
 Comment: # one empty input field
 _1  DataName   VBIAS   Freq   x_IBIAS_          C       GOUT  OK_
 Nr      char  double double     double     double     double char
  1 DataValue -6.0000 300000 1.6272e-11 7.0215e-13 1.6044e-07    A
  2 DataValue -5.8000 300000 1.5990e-11 6.9607e-13 1.5728e-07    E
  3 DataValue -5.6000 300000 1.3790e-11 6.9048e-13 1.5489e-07    !
  4 DataValue -5.4000 300000 1.4420e-11 6.8517e-13 1.5478e-07    ?
  5 DataValue -5.2000 300000 1.2930e-11 6.7965e-13 1.5189e-07    C
  6 DataValue -5.0000 300000 1.2610e-11 6.7444e-13 1.4931e-07    B
  7 DataValue -4.8000 300000 1.4390e-11 6.7011e-13 1.4876e-07    A
  8 DataValue -4.6000 300000 1.0890e-11 6.6416e-13 1.4890e-07    3
  9 DataValue -4.4000 300000         NA 6.5859e-13 1.4558e-07    C
 10 DataValue -4.2000 300000 1.0610e-11 6.5355e-13 1.4431e-07    B

Those data were produced while performing a voltage sweep on a sensor, measuring with a impedance bridge the parallel capacitor and conductance at a given frequency.

The first lines contains a few meta-information: name of the source file and a few comments found in the csv file. The purpose is to annotate the results.

Then we have the content. Each column starts with a name, then a type. Next we find the content lines, each of them with an index. Then we find the content; control values (polarization voltage, applied frequency), then measured values: DC current, capacitor, conductance. The last column is categorical: the user introduced some code telling if the result makes senses or not.


A simple example:

truc={"Id", "Name", "Type";1, "onestring", "bla"; 2, "somestring", "foobar";}
truc =
{
  [1,1] = Id
  [2,1] =  1
  [3,1] =  2
  [1,2] = Name
  [2,2] = onestring
  [3,2] = somestring
  [1,3] = Type
  [2,3] = bla
  [3,3] = foobar
}
>> tt=dataframe(truc)
tt = dataframe with 2 rows and 3 columns
_1     Id       Name   Type
Nr double       char   char
 1      1  onestring    bla
 2      2 somestring foobar

The first cell line is intended to contain column names; the rest is column content. The type is automatically inferred from the cell content. Now let us select one column by its name:

>> tt(:, 'Name')
ans = dataframe with 2 rows and 1 columns
_1       Name
Nr       char
1  onestring
2 somestring

In this case, a sub-dataframe is returned. Struct-like indexing is also implemented:

>> tt.Id
ans =
  1
  2

When the output is a vector and can be simplified to something simple ... it is.