Dataframe package: Difference between revisions
No edit summary |
m (→Example: typo correction) |
||
(11 intermediate revisions by 6 users not shown) | |||
Line 1: | Line 1: | ||
The {{Forge|dataframe}} package is part of the [[Octave Forge]] project. It is a data manipulation toolbox similar to R data.frame and is maintained by Pascal Dupuis. | |||
== Introduction == | |||
This package permits to handle complex (both in the sense of complex numbers and high complexity) data as if they were ordinary arrays, except that each column MAY possess a different type. It also provides a fairly complete interface to CSV files, permitting to cope with a number of oddities, like e.g., CSV files starting with a header spread over a few lines. The resulting array tries as far as it can to mimic an array in such a way that binary operators and usual functions will work as expected. | |||
Meta-information is also handled. Rows and columns may have a name, and this name is searchable. If for whatever reason the ordering of a CSV file changes, searching by column names will return the expected information. | |||
== Example == | |||
To get a first taste, let's load the test csv file coming with the package: | |||
>> experiment = dataframe('data_test.csv') | |||
warning: load: '/home/padupuis/matlab/dataframe/inst/data_test.csv' found by searching load path | |||
warning: fopen: '/home/padupuis/matlab/dataframe/inst/data_test.csv' found by searching load path | |||
ans = dataframe with 10 rows and 7 columns | |||
Src: data_test.csv | |||
Comment: #notice there is a extra separator | |||
Comment: # a comment line and an empty one | |||
Comment: # the next lines use \r\n \r and \f as linefeed | |||
Comment: # one empty input field | |||
_1 DataName VBIAS Freq x_IBIAS_ C GOUT OK_ | |||
Nr char double double double double double char | |||
1 DataValue -6.0000 300000 1.6272e-11 7.0215e-13 1.6044e-07 A | |||
2 DataValue -5.8000 300000 1.5990e-11 6.9607e-13 1.5728e-07 E | |||
3 DataValue -5.6000 300000 1.3790e-11 6.9048e-13 1.5489e-07 ! | |||
4 DataValue -5.4000 300000 1.4420e-11 6.8517e-13 1.5478e-07 ? | |||
5 DataValue -5.2000 300000 1.2930e-11 6.7965e-13 1.5189e-07 C | |||
6 DataValue -5.0000 300000 1.2610e-11 6.7444e-13 1.4931e-07 B | |||
7 DataValue -4.8000 300000 1.4390e-11 6.7011e-13 1.4876e-07 A | |||
8 DataValue -4.6000 300000 1.0890e-11 6.6416e-13 1.4890e-07 3 | |||
9 DataValue -4.4000 300000 NA 6.5859e-13 1.4558e-07 C | |||
10 DataValue -4.2000 300000 1.0610e-11 6.5355e-13 1.4431e-07 B | |||
Those data were produced while performing a voltage sweep on a sensor, measuring with an impedance bridge | |||
the parallel capacitor and conductance at a given frequency. | |||
The first lines contain few meta-information: name of the source file and a few comments found in the | |||
csv file. The purpose is to annotate the results. | |||
Then we have the content. Each column starts with a name, then a type. Next we find the content lines, each | |||
of them with an index. Then we find the content; control values (polarization voltage, applied frequency), | |||
then measured values: DC current, capacitor, conductance. The last column is categorical: the user introduced | |||
some code telling if the result makes senses or not. | |||
Let us now select the control values: | |||
cv = experiment(1:3, ["Vbias"; "Freq"]) | |||
cv = dataframe with 3 rows and 1 columns | |||
Src: data_test.csv | |||
Comment: #notice there is a extra separator | |||
Comment: # a comment line and an empty one | |||
Comment: # the next lines use \r\n \r and \f as linefeed | |||
Comment: # one empty input field | |||
_1 Freq | |||
Nr double | |||
1 300000 | |||
2 300000 | |||
3 300000 | |||
The selection occurred on a range for the lines, by names on the column. The search criteria is here a | |||
string array. All columns whose name match are returned. | |||
The result is returned as a dataframe. This can be changed: | |||
>> experiment.array(6, "OK_") | |||
ans = B | |||
>> class(ans) | |||
ans = char | |||
When selecting vectors, this transformation in array is automatic. The DC current is contained in elements | |||
31 to 40 (fourth column): | |||
>> experiment(31:40) | |||
ans = | |||
Columns 1 through 9: | |||
1.6272e-11 1.5990e-11 1.3790e-11 1.4420e-11 1.2930e-11 1.2610e-11 1.4390e-11 1.0890e-11 NA | |||
Column 10: | |||
1.0610e-11 | |||
Note that the access 'experiment("x_IBIAS")' is illegal: does it refer to row or column names ? | |||
;Accessing in this pseudo-structure way is valid in the following cases: | |||
;choosing the output format: array, cell, dataframe (may be abbreviated as 'df') | |||
;attribute selection: rownames, colnames, rowcnt, colcnt, rowidx, types, source, header, comment | |||
;constructor call: new (no other deferencing may occur | |||
;column selection: just provide one valid column name | |||
To be similar to R implementation, constructs such as x.as.array are also allowed. | |||
A simple example: | |||
truc={"Id", "Name", "Type";1, "onestring", "bla"; 2, "somestring", "foobar";} | |||
truc = | |||
{ | |||
[1,1] = Id | |||
[2,1] = 1 | |||
[3,1] = 2 | |||
[1,2] = Name | |||
[2,2] = onestring | |||
[3,2] = somestring | |||
[1,3] = Type | |||
[2,3] = bla | |||
[3,3] = foobar | |||
} | |||
>> tt=dataframe(truc) | |||
tt = dataframe with 2 rows and 3 columns | |||
_1 Id Name Type | |||
Nr double char char | |||
1 1 onestring bla | |||
2 2 somestring foobar | |||
The first cell line is intended to contain column names; the rest is column content. The type is automatically inferred from the cell content. Now let us select one column by its name: | |||
>> tt(:, 'Name') | |||
ans = dataframe with 2 rows and 1 columns | |||
_1 Name | |||
Nr char | |||
1 onestring | |||
2 somestring | |||
In this case, a sub-dataframe is returned. Struct-like indexing is also implemented: | |||
>> tt.Id | |||
ans = | |||
1 | |||
2 | |||
When the output is a vector and can be simplified to something simple ... it is. | |||
[[Category:Octave Forge]] |
Latest revision as of 15:14, 20 January 2023
The dataframe package is part of the Octave Forge project. It is a data manipulation toolbox similar to R data.frame and is maintained by Pascal Dupuis.
Introduction[edit]
This package permits to handle complex (both in the sense of complex numbers and high complexity) data as if they were ordinary arrays, except that each column MAY possess a different type. It also provides a fairly complete interface to CSV files, permitting to cope with a number of oddities, like e.g., CSV files starting with a header spread over a few lines. The resulting array tries as far as it can to mimic an array in such a way that binary operators and usual functions will work as expected.
Meta-information is also handled. Rows and columns may have a name, and this name is searchable. If for whatever reason the ordering of a CSV file changes, searching by column names will return the expected information.
Example[edit]
To get a first taste, let's load the test csv file coming with the package:
>> experiment = dataframe('data_test.csv') warning: load: '/home/padupuis/matlab/dataframe/inst/data_test.csv' found by searching load path warning: fopen: '/home/padupuis/matlab/dataframe/inst/data_test.csv' found by searching load path ans = dataframe with 10 rows and 7 columns Src: data_test.csv Comment: #notice there is a extra separator Comment: # a comment line and an empty one Comment: # the next lines use \r\n \r and \f as linefeed Comment: # one empty input field _1 DataName VBIAS Freq x_IBIAS_ C GOUT OK_ Nr char double double double double double char 1 DataValue -6.0000 300000 1.6272e-11 7.0215e-13 1.6044e-07 A 2 DataValue -5.8000 300000 1.5990e-11 6.9607e-13 1.5728e-07 E 3 DataValue -5.6000 300000 1.3790e-11 6.9048e-13 1.5489e-07 ! 4 DataValue -5.4000 300000 1.4420e-11 6.8517e-13 1.5478e-07 ? 5 DataValue -5.2000 300000 1.2930e-11 6.7965e-13 1.5189e-07 C 6 DataValue -5.0000 300000 1.2610e-11 6.7444e-13 1.4931e-07 B 7 DataValue -4.8000 300000 1.4390e-11 6.7011e-13 1.4876e-07 A 8 DataValue -4.6000 300000 1.0890e-11 6.6416e-13 1.4890e-07 3 9 DataValue -4.4000 300000 NA 6.5859e-13 1.4558e-07 C 10 DataValue -4.2000 300000 1.0610e-11 6.5355e-13 1.4431e-07 B
Those data were produced while performing a voltage sweep on a sensor, measuring with an impedance bridge the parallel capacitor and conductance at a given frequency.
The first lines contain few meta-information: name of the source file and a few comments found in the csv file. The purpose is to annotate the results.
Then we have the content. Each column starts with a name, then a type. Next we find the content lines, each of them with an index. Then we find the content; control values (polarization voltage, applied frequency), then measured values: DC current, capacitor, conductance. The last column is categorical: the user introduced some code telling if the result makes senses or not.
Let us now select the control values:
cv = experiment(1:3, ["Vbias"; "Freq"]) cv = dataframe with 3 rows and 1 columns Src: data_test.csv Comment: #notice there is a extra separator Comment: # a comment line and an empty one Comment: # the next lines use \r\n \r and \f as linefeed Comment: # one empty input field _1 Freq Nr double 1 300000 2 300000 3 300000
The selection occurred on a range for the lines, by names on the column. The search criteria is here a string array. All columns whose name match are returned.
The result is returned as a dataframe. This can be changed:
>> experiment.array(6, "OK_") ans = B >> class(ans) ans = char
When selecting vectors, this transformation in array is automatic. The DC current is contained in elements 31 to 40 (fourth column):
>> experiment(31:40) ans = Columns 1 through 9: 1.6272e-11 1.5990e-11 1.3790e-11 1.4420e-11 1.2930e-11 1.2610e-11 1.4390e-11 1.0890e-11 NA Column 10: 1.0610e-11
Note that the access 'experiment("x_IBIAS")' is illegal: does it refer to row or column names ?
- Accessing in this pseudo-structure way is valid in the following cases
- choosing the output format
- array, cell, dataframe (may be abbreviated as 'df')
- attribute selection
- rownames, colnames, rowcnt, colcnt, rowidx, types, source, header, comment
- constructor call
- new (no other deferencing may occur
- column selection
- just provide one valid column name
To be similar to R implementation, constructs such as x.as.array are also allowed.
A simple example:
truc={"Id", "Name", "Type";1, "onestring", "bla"; 2, "somestring", "foobar";} truc = { [1,1] = Id [2,1] = 1 [3,1] = 2 [1,2] = Name [2,2] = onestring [3,2] = somestring [1,3] = Type [2,3] = bla [3,3] = foobar } >> tt=dataframe(truc) tt = dataframe with 2 rows and 3 columns _1 Id Name Type Nr double char char 1 1 onestring bla 2 2 somestring foobar
The first cell line is intended to contain column names; the rest is column content. The type is automatically inferred from the cell content. Now let us select one column by its name:
>> tt(:, 'Name') ans = dataframe with 2 rows and 1 columns _1 Name Nr char 1 onestring 2 somestring
In this case, a sub-dataframe is returned. Struct-like indexing is also implemented:
>> tt.Id ans = 1 2
When the output is a vector and can be simplified to something simple ... it is.