IO package: Difference between revisions

Jump to navigation Jump to search
5,060 bytes added ,  17 June 2014
restrutured + how to use io package
(restrutured + how to use io package)
Line 1: Line 1:
The IO package is part of the octave-forge project and provides input/output from/in external formats.
The IO package is part of the octave-forge project and provides input/output from/in external formats.


== ODS support ==
=== About read/write support (TL;DR) ===
 
Most people need this package to read and write Excel files. But the io package can read/write Open/Libre Office, Gnumeric and some less important files too.
 
<pre><nowiki>
File extension      COM POI POI/OOXML JXL OXS UNO OTK JOD OCT
--------------------------------------------------------------
.xls (Excel95)        R                R      R
.xls (Excel97-2003)  +  +      +      +  +  +
.xlsx (Excel2007+)    ~          +        (+)  R          +
.xlsb, .xlsm          ~                    ?  R          R?
.wk1                  +                        R
.wks                  +                        R
.dbf                  +                        +
.ods                  ~                        +  +  +  +
.sxc                                            +      +
.fods                                          +
.uos                                            +
.dif                                            +
.csv                  +                        R
.gnumeric                                                  +
--------------------------------------------------------------
</nowiki></pre>
 
 
==== .xls ~= .xlsx ====
 
'''This is the most important information you have to keep in mind when you have to work with "Excel" files.'''
* .xls - is the outdated default binary file format from <= Office 2003 - '''avoid this format!'''
* .xlsx - is the new default file format since Office 2007. It consists of xml files which are stored in a .zip container. - '''always save or convert in this format!'''
** The ''(new)'' OCT interface can read ''(since version 1.2.5)'' and write ''(since version 2.2.0)'' .xlsx files dependency-free! No need of MS Windows+Office nor Java.
 
==== different interfaces ====
 
The io package comes with different interfaces to read/write different file formats.
# COM
## This ''(interface)'' is only available on MS Windows '''and''' with a MS Office installation.
# [POI POI/OOXML JXL OXS UNO OTK JOD]
## Say hello to the many java-based interfaces to read very slowly shitty formats
# OCT
## This is the new impressive and fast ''(mostly written in Octave itself! + two C files to bypass bottlenecks)'' interface which only supports .xlsx, .ods and .gnumeric files
 
So, if you want to read/write '''.xlsx''' files, you'll only need the io-package >=2.2.0.
 
But if you had to read/write '''.xls''' files, you'll need either
* MS Windows with MS Office backings - or
* need Octave build with --enable-java, + Java, and one or more of the Java interfaces!
 
==== force a drive ====
 
If you don't want that the io-autodetect take control, you can force the usage of a driver
 
Force native OCT interface - only for .xlsx, .ods, .gnumeric
<pre>OCT = xlsread ('file.xlsx', 1, [], 'OCT');</pre>
 
Force COM interface - may only work with .xls, .xlsx on Windows OS and available office installation.
<pre>COM = xlsread ('file.xlsx', 1, [], 'COM');</pre>
 
Force POI interface - may only work if you've did javaaddpath for the Apache POI .jar files - only .xls
<pre>POI = xlsread ('file.xls', 1, [], 'POI');</pre>
 
And so on ...
 
==== Java example ====
 
# Make sure you've setup everything with java correctly
# get e.g. apache poi jar library files and add them with javaaddpath
<pre><nowiki>
octave:1>    javaaddpath('~/poi_library/poi-3.8-20120326.jar');
octave:2>    javaaddpath('~/poi_library/poi-ooxml-3.8-20120326.jar');
octave:3>    javaaddpath('~/poi_library/poi-ooxml-schemas-3.8-20120326.jar');
octave:4>    javaaddpath('~/poi_library/xmlbeans-2.3.0.jar');
octave:5>    javaaddpath('~/poi_library/dom4j-1.6.1.jar');
octave:6>
octave:6> pkg load io
octave:7> chk_spreadsheet_support
ans =                    6
octave:8> javaclasspath
  STATIC JAVA PATH
 
      - empty -
 
  DYNAMIC JAVA PATH
 
      /home/markus/poi_library/poi-3.8-20120326.jar
      /home/markus/poi_library/poi-ooxml-3.8-20120326.jar
      /home/markus/poi_library/poi-ooxml-schemas-3.8-20120326.jar
      /home/markus/poi_library/xmlbeans-2.3.0.jar
      /home/markus/poi_library/dom4j-1.6.1.jar
 
</nowiki></pre>
 
The chk_spreadsheet_support output should be now > 0.
 
<pre><nowiki>
                  0 No spreadsheet I/O support found
                ---------- XLS (Excel) interfaces: ----------
                  1 = COM (ActiveX / Excel)
                  2 = POI (Java / Apache POI)
                  4 = POI+OOXML (Java / Apache POI)
                  8 = JXL (Java / JExcelAPI)
                  16 = OXS (Java / OpenXLS)
                --- ODS (OpenOffice.org Calc) interfaces ----
                  32 = OTK (Java/ ODF Toolkit)
                  64 = JOD (Java / jOpenDocument)
                ----------------- XLS & ODS: ----------------
                128 = UNO (Java / UNO bridge - OpenOffice.org)
</nowiki></pre>
 
And reading/writing .xls files should work.
 
== Detailed Information (TL) ==
 
The following might be more interesting if you're interested in how things work inside the io package.
 
=== ODS support ===
(ODS = Open Document Format spreadsheet data format, used by e.g., LibreOffice and OpenOffice.org)
(ODS = Open Document Format spreadsheet data format, used by e.g., LibreOffice and OpenOffice.org)


=== Files content ===
==== Files content ====
* '''odsread.m''' &mdash; no-hassle read script for reading from an ODS file and parsing the numeric and text data into separate arrays.
* '''odsread.m''' &mdash; no-hassle read script for reading from an ODS file and parsing the numeric and text data into separate arrays.
* '''odswrite.m''' &mdash; no-hassle write script for writing to an ODS file.
* '''odswrite.m''' &mdash; no-hassle write script for writing to an ODS file.
Line 24: Line 139:




=== Required support software ===
==== Required support software ====
For the OCT interface (since 1.2.4/1.2.5, read-only support!):
For the OCT interface (since 1.2.4/1.2.5, read-only support!):
* Nothing except unzip
* Nothing except unzip
Line 45: Line 160:
Alternatively, the io package contains a function script file "chk_spreadsheet_support.m" which can set up the Java classpath.
Alternatively, the io package contains a function script file "chk_spreadsheet_support.m" which can set up the Java classpath.


=== Usage ===
==== Usage ====


(see “help ods<function_filename>” in octave terminal.)
(see “help ods<function_filename>” in octave terminal.)
Line 64: Line 179:
If you use odsopen / ods2oct / … / oct2ods / …. / odsclose, DO NOT FORGET to invoke odsclose in the end. The file pointers can contain an enormous amount of data and may needlessly keep precious memory allocated. In case of the UNO interface, the hidden OpenOffice.org invocation (soffice.bin) can even block proper closing of Octave.
If you use odsopen / ods2oct / … / oct2ods / …. / odsclose, DO NOT FORGET to invoke odsclose in the end. The file pointers can contain an enormous amount of data and may needlessly keep precious memory allocated. In case of the UNO interface, the hidden OpenOffice.org invocation (soffice.bin) can even block proper closing of Octave.


=== Spreadsheet formula support ===
==== Spreadsheet formula support ====


When using the OTK or UNO interface you can:
When using the OTK or UNO interface you can:
Line 79: Line 194:
The only exception is if you select the UNO interface, as that invokes OpenOffice.org behind the scenes, and OOo obviously has a validator and evaluator built-in.
The only exception is if you select the UNO interface, as that invokes OpenOffice.org behind the scenes, and OOo obviously has a validator and evaluator built-in.


=== Gotchas ===
==== Gotchas ====
I know of one big gotcha: i.e. reading dates (& time). A less obvious one is Java memory pool allocation size.
I know of one big gotcha: i.e. reading dates (& time). A less obvious one is Java memory pool allocation size.


==== Date and time in ODS ====
===== Date and time in ODS =====
Octave (as does Matlab) stores dates as a number representing the number of days since January 1, 0 (and as an aside ignores a.o. Pope Gregorius' intervention in 1582 when 10 days were simply skipped).
Octave (as does Matlab) stores dates as a number representing the number of days since January 1, 0 (and as an aside ignores a.o. Pope Gregorius' intervention in 1582 when 10 days were simply skipped).


Line 97: Line 212:
While adding data and time values has been implemented in the write scripts, the wait is for clever solutions to distinguish dates from floats in octave cell arrays.
While adding data and time values has been implemented in the write scripts, the wait is for clever solutions to distinguish dates from floats in octave cell arrays.


==== Java memory pool allocation size ====
===== Java memory pool allocation size =====
The Java virtual machine (JVM) initializes one big chunk of your computer's RAM in which all Java classes and methods etc. are to be loaded: the Java memory pool. It does this because Java has a very sophisticated “garbage collection” system. At least on Windows, the initial size is 2MB and the maximum size is 64MB. On Linux this allocated size is much bigger. This part of memory is where the Java-based ODS octave routines (and the Java-based ods routines) live and keep their variables etc.
The Java virtual machine (JVM) initializes one big chunk of your computer's RAM in which all Java classes and methods etc. are to be loaded: the Java memory pool. It does this because Java has a very sophisticated “garbage collection” system. At least on Windows, the initial size is 2MB and the maximum size is 64MB. On Linux this allocated size is much bigger. This part of memory is where the Java-based ODS octave routines (and the Java-based ods routines) live and keep their variables etc.


Line 111: Line 226:
After processing a large chunk of spreadsheet information you might notice that octave's memory footprint does not shrink so it looks like Java's memory pool does not shrink back; but rest assured, the memory footprint is the allocated (reserved) memory size, not the actual used size. After the JVM has done its garbage collection, only the so-called “working set” of the memory allocation is really in use and that is a trimmed-down part of the memory allocation pool. On Windows systems it often suffices to minimize the octave terminal for a few seconds to get a more reasonable memory footprint.
After processing a large chunk of spreadsheet information you might notice that octave's memory footprint does not shrink so it looks like Java's memory pool does not shrink back; but rest assured, the memory footprint is the allocated (reserved) memory size, not the actual used size. After the JVM has done its garbage collection, only the so-called “working set” of the memory allocation is really in use and that is a trimmed-down part of the memory allocation pool. On Windows systems it often suffices to minimize the octave terminal for a few seconds to get a more reasonable memory footprint.


==== Reading cells containing errors ====
===== Reading cells containing errors =====
Spreadsheet cells containing erroneous stuff are transferred to Octave as NaNs. But not all errors can be catched. Cells showing #Value# in OpenOffice.org Calc due to e.g., invalid formulas, may have a 0 (zero) value stored in the value fields. It is impossible to catch this as there is no run-time formula evaluator (yet) in ODF Toolkit nor jOpenDocument (like there is in Apache POI for Excel).
Spreadsheet cells containing erroneous stuff are transferred to Octave as NaNs. But not all errors can be catched. Cells showing #Value# in OpenOffice.org Calc due to e.g., invalid formulas, may have a 0 (zero) value stored in the value fields. It is impossible to catch this as there is no run-time formula evaluator (yet) in ODF Toolkit nor jOpenDocument (like there is in Apache POI for Excel).


Line 120: Line 235:
* jOpenDocument doesn't set the so-called <office:value-type='string'> attribute in cells containing text; as a consequence ODF Toolkit will treat them as empty cells. OOo will read them OK.
* jOpenDocument doesn't set the so-called <office:value-type='string'> attribute in cells containing text; as a consequence ODF Toolkit will treat them as empty cells. OOo will read them OK.


=== Matlab compatibility ===
==== Matlab compatibility ====
AFAIK there's no similar functionality in Matlab (yet?), only for reading and then very limited.
AFAIK there's no similar functionality in Matlab (yet?), only for reading and then very limited.
odsread is fairly function-compatible to xlsread, however.
odsread is fairly function-compatible to xlsread, however.
Line 126: Line 241:
Same goes for odswrite, odsfinfo and xlsfinfo – however odsfinfo has better functionality IMO.
Same goes for odswrite, odsfinfo and xlsfinfo – however odsfinfo has better functionality IMO.


=== Comparison of interfaces ===
==== Comparison of interfaces ====
The OCT interface (present as of io-1.2.4) offers read support for ODS 1.2, complete with all the options of ODFtoolkit and UNO, but fairly slow.
The OCT interface (present as of io-1.2.4) offers read support for ODS 1.2, complete with all the options of ODFtoolkit and UNO, but fairly slow.


Line 147: Line 262:
However, UNO is not stable yet (see below).
However, UNO is not stable yet (see below).


=== Troubleshooting ===
==== Troubleshooting ====
Some hints for troubleshooting ODS support are given here.
Some hints for troubleshooting ODS support are given here.
Since April 2011 the function chk_spreadsheet_support() has been included in the io package. Calling it with arguments ('', 3) (empty string and debug level 3) will echo a lot of diagnostics to the screen. Large parts of the steps outlined below have been automated in this script.
Since April 2011 the function chk_spreadsheet_support() has been included in the io package. Calling it with arguments ('', 3) (empty string and debug level 3) will echo a lot of diagnostics to the screen. Large parts of the steps outlined below have been automated in this script.
Line 174: Line 289:
** The exact case (URE or ure, Basis or basis), name ("Basis3.2" or just "basis") and subdirectory tree (URE/java or URE/share/java) varies across OOo versions and -clones, so chk_spreadsheet_support.m can have a hard time finding all needed classes. In particularly bad cases, when chk_spreadsheet_support cannot find them, you might need to add one or more of these these classes manually to the javaclasspath.
** The exact case (URE or ure, Basis or basis), name ("Basis3.2" or just "basis") and subdirectory tree (URE/java or URE/share/java) varies across OOo versions and -clones, so chk_spreadsheet_support.m can have a hard time finding all needed classes. In particularly bad cases, when chk_spreadsheet_support cannot find them, you might need to add one or more of these these classes manually to the javaclasspath.


=== Development ===
==== Development ====
As with the Excel r/w stuff, adding new interfaces should be easy and straightforward. Add relevant stanzas in odsopen, odsclose, odsfinfo & getusedrange and add new subfunctions (for the real work) to getusedrange_<INTF>, oct2ods and ods2oct.
As with the Excel r/w stuff, adding new interfaces should be easy and straightforward. Add relevant stanzas in odsopen, odsclose, odsfinfo & getusedrange and add new subfunctions (for the real work) to getusedrange_<INTF>, oct2ods and ods2oct.


Line 191: Line 306:
But IMO data sets larger than 5.105 cells should not be kept in spreadsheets anyway. Use real databases for such data sets.
But IMO data sets larger than 5.105 cells should not be kept in spreadsheets anyway. Use real databases for such data sets.


=== ODFDOM versions ===
==== ODFDOM versions ====
I have tried various odfdom version. As to 0.8 & 0.8.5, while the API has been simplified enormously (finally one can address cells by spreadsheet address rather than find out yourself by parsing the table-column/-row/-cell structure), many irrecoverable bugs have been introduced :-((
I have tried various odfdom version. As to 0.8 & 0.8.5, while the API has been simplified enormously (finally one can address cells by spreadsheet address rather than find out yourself by parsing the table-column/-row/-cell structure), many irrecoverable bugs have been introduced :-((
In addition processing ODS files became significantly slower (up to 7 times!).
In addition processing ODS files became significantly slower (up to 7 times!).
Line 203: Line 318:
* oct2ods.m (revision 7159)
* oct2ods.m (revision 7159)


== XLS support ==
=== XLS support ===
=== Files content ===
==== Files content ====
* '''xlsread.m''' &mdash; All-in-one function for reading data from one specific worksheet in an Excel spreadsheet file. This script has Matlab-compatible functionality.
* '''xlsread.m''' &mdash; All-in-one function for reading data from one specific worksheet in an Excel spreadsheet file. This script has Matlab-compatible functionality.
* '''xlswrite.m''' &mdash; All-in-one function for writing data to one specific worksheet in an Excel spreadsheet file. This script has Matlab-compatible functionality.
* '''xlswrite.m''' &mdash; All-in-one function for writing data to one specific worksheet in an Excel spreadsheet file. This script has Matlab-compatible functionality.
Line 216: Line 331:
* '''spsh_chkrange.m''', '''spsh_prstype.m''', '''getusedrange.m''', '''calccelladdress.m''', '''parse_sp_range.m''' &mdash; Support files called by the scripts and not meant for direct invocation by users.
* '''spsh_chkrange.m''', '''spsh_prstype.m''', '''getusedrange.m''', '''calccelladdress.m''', '''parse_sp_range.m''' &mdash; Support files called by the scripts and not meant for direct invocation by users.


=== Required support software ===
==== Required support software ====
For the OCT interface (since 1.2.4/1.2.5, read-only support for OOXML (.xlsx)!):
For the OCT interface (since 1.2.4/1.2.5, read-only support for OOXML (.xlsx)!):
* Nothing except unzip
* Nothing except unzip
Line 251: Line 366:
NOTE: EXPERIMENTAL!! A working OpenOffice.org installation. The utility function chk_spreadsheet_support can be used to add the needed entries to the javaclasspath.
NOTE: EXPERIMENTAL!! A working OpenOffice.org installation. The utility function chk_spreadsheet_support can be used to add the needed entries to the javaclasspath.


=== Usage ===
==== Usage ====
'''xlsread''' and '''xlswrite''' are mere wrappers for '''xlsopen - xls2oct - xlsclose - parsecell''' and '''xlsopen - oct2xls - xlsclose''' sequences, resp. They exist for the sake of Matlab compatibility.
'''xlsread''' and '''xlswrite''' are mere wrappers for '''xlsopen - xls2oct - xlsclose - parsecell''' and '''xlsopen - oct2xls - xlsclose''' sequences, resp. They exist for the sake of Matlab compatibility.


Line 301: Line 416:
When using JExcelAPI (JXL), after writing into a worksheet you MUST save the file – adding data to the same or another worksheet is no more possible after the first call to oct2xls(). This is a limitation of JExcelAPI.
When using JExcelAPI (JXL), after writing into a worksheet you MUST save the file – adding data to the same or another worksheet is no more possible after the first call to oct2xls(). This is a limitation of JExcelAPI.


=== Spreadsheet formula support ===
==== Spreadsheet formula support ====
When using the COM, POI, JXL, and UNO interfaces you can:
When using the COM, POI, JXL, and UNO interfaces you can:
* (When reading, xls2oct) either read spreadsheet formula results, or the literal formula text strings (also works with OCT interface);
* (When reading, xls2oct) either read spreadsheet formula results, or the literal formula text strings (also works with OCT interface);
Line 317: Line 432:
Worse, older Excel versions feature less functions than newer versions. So be wary as this may make for interesting confusion.
Worse, older Excel versions feature less functions than newer versions. So be wary as this may make for interesting confusion.


=== Matlab compatibility ===
==== Matlab compatibility ====


'''xlsread''', '''xlswrite''' and '''xlsfinfo''' are for the most part Matlab-compatible. Some small differences are mentioned below.
'''xlsread''', '''xlswrite''' and '''xlsfinfo''' are for the most part Matlab-compatible. Some small differences are mentioned below.
Line 343: Line 458:
** When invoking Excel/COM interface, octave-forge's xlsfinfo also echoes the type of sheet (worksheet, chart), not just the sheet names. Using Java I haven't found similar functionality (yet).
** When invoking Excel/COM interface, octave-forge's xlsfinfo also echoes the type of sheet (worksheet, chart), not just the sheet names. Using Java I haven't found similar functionality (yet).


=== Comparison of interfaces & usage ===
==== Comparison of interfaces & usage ====
Using Excel itself (through '''COM''' / '''ActiveX''' on Windows systems) is probably the most robust and versatile and especially FAST option. There's one gotcha: in case of some type of COM errors Excel will keep running invisibly; you can only end it through Task Manager. A tiny problem is that one cannot find out easily through COM what file types are supported; xls, wks, wk1, xlsx, etc. Another -obvious- limitation is that COM Excel access only works on Windows systems where Excel is installed.
Using Excel itself (through '''COM''' / '''ActiveX''' on Windows systems) is probably the most robust and versatile and especially FAST option. There's one gotcha: in case of some type of COM errors Excel will keep running invisibly; you can only end it through Task Manager. A tiny problem is that one cannot find out easily through COM what file types are supported; xls, wks, wk1, xlsx, etc. Another -obvious- limitation is that COM Excel access only works on Windows systems where Excel is installed.


Line 365: Line 480:
So Java is a compromise between portability and rapid development time versus capacity (and speed). But IMO data sets larger than 5.10^5 cells should not be kept in spreadsheets anyway. Better use real databases for such data sets.
So Java is a compromise between portability and rapid development time versus capacity (and speed). But IMO data sets larger than 5.10^5 cells should not be kept in spreadsheets anyway. Better use real databases for such data sets.


=== Troubleshooting ===
==== Troubleshooting ====
Some hints for troubleshooting Excel support are contained in this thread: http://sourceforge.net/mailarchive/forum.php?thread_name=4C61B649.9090802%40hccnet.nl&forum_name=octave-dev dated August 10, 2010. A more structured approach is below.
Some hints for troubleshooting Excel support are contained in this thread: http://sourceforge.net/mailarchive/forum.php?thread_name=4C61B649.9090802%40hccnet.nl&forum_name=octave-dev dated August 10, 2010. A more structured approach is below.


Line 396: Line 511:
#: xls2 = xlsopen ('test.xls', 1, 'jxl'). If this works and xls2 is a struct with various fields containing objects, the JExcelAPI interface (JXL) works as well. Don't forget to do xls2 = xlsclose (xls2) to close the file.
#: xls2 = xlsopen ('test.xls', 1, 'jxl'). If this works and xls2 is a struct with various fields containing objects, the JExcelAPI interface (JXL) works as well. Don't forget to do xls2 = xlsclose (xls2) to close the file.


=== Development ===
==== Development ====
xlsopen/xlsclose and friends have been written so that adding other interfaces (Perl? native octave? ...?) should be very easily accomplished. Xlsopen.m merely needs two stanzas, xlsfinfo.m and getusedrange.m each need an additional elseif stanza, and xlsclose.m needs a small stanza for closing the pointer struct and writing to disk. The real work lies in creating the relevant xls2<...>2oct & oct2<...>2xls & <getusedrange_...> subfunction scripts in xls2oct.m, oct2xls.m and getusedrange.m, resp., but that shouldn't be really hard, depending on the interface support libraries' quality and documentation. Separating the file access functions and the actual reading/writing from/to the workbook in memory has made developer's life (I mean: my time developing this stuff) much easier.
xlsopen/xlsclose and friends have been written so that adding other interfaces (Perl? native octave? ...?) should be very easily accomplished. Xlsopen.m merely needs two stanzas, xlsfinfo.m and getusedrange.m each need an additional elseif stanza, and xlsclose.m needs a small stanza for closing the pointer struct and writing to disk. The real work lies in creating the relevant xls2<...>2oct & oct2<...>2xls & <getusedrange_...> subfunction scripts in xls2oct.m, oct2xls.m and getusedrange.m, resp., but that shouldn't be really hard, depending on the interface support libraries' quality and documentation. Separating the file access functions and the actual reading/writing from/to the workbook in memory has made developer's life (I mean: my time developing this stuff) much easier.


Line 405: Line 520:
*Support for "passing function handle" in xlsread.
*Support for "passing function handle" in xlsread.


== OCT interface ==
=== OCT interface ===


Since io package version 1.2.4, an interface called "OCT" was added. Except for unzip, it has no dependencies. It's still experimental but fast! Feel free to test it and give us a feedback.
Since io package version 1.2.4, an interface called "OCT" was added. Except for unzip, it has no dependencies. It's still experimental but fast! Feel free to test it and give us a feedback.
32

edits

Navigation menu