IO package: Difference between revisions

Jump to navigation Jump to search
10,536 bytes added ,  18 October 2019
m
Fix typo
m (Fix typo)
(32 intermediate revisions by 8 users not shown)
Line 1: Line 1:
The IO package is part of the octave-forge project and provides input/output from/in external formats.
The {{Forge|io|IO package}} is part of the Octave Forge project and provides input/output from/in external formats.


== ODS support ==
<div class="tocinline">__TOC__</div>
 
=== About read/write support ===
 
Most people need this package to read and write Excel files. But the io package can read/write Open/Libre Office, Gnumeric and some less important files too.
 
<pre><nowiki>
File extension      COM POI POI/OOXML JXL OXS UNO OTK JOD OCT
--------------------------------------------------------------
.xls (Excel95)        R                R      R
.xls (Excel97-2003)  +  +      +      +  +  +
.xlsx (Excel2007+)    ~          +        (+)  +          +
.xlsb, .xlsm          ~                    ?  R          R?
.wk1                  +                        R
.wks                  +                        R
.dbf                  +                        +
.ods                  ~                        +  +  +  +
.sxc                                            +      +
.fods                                          +
.uos                                            +
.dif                  +                        +
.csv                  +                        R
.gnumeric                                                  +
--------------------------------------------------------------
 
R : only read;  + : full read/write; ~ : dependent on Excel version
</nowiki></pre>
 
 
==== xlswrite / odswrite versus xlsopen / odsopen ..... xlsclose / odsclose ====
 
Matlab users are used to xlsread and xlswrite, functions that can only read data from, or write data to, one sheet in a spreadsheet file at a time. For each operation, xlsread and xlswrite first have to read the entire spreadsheet file, for write operations xlswrite also has to finally write it out completely to disk.
There are faster ways, but then you'll have to dive into ActiveX/COM/VisualBasic programming.
 
If you want to move multiple pieces of data to/from a spreadsheet file, the io package offers a much more versatile scheme:
* First open the spreadsheet file using xlsopen (for Excel or gnumeric files) or odsopen (.ods or .gnumeric).
'''NOTE''': the output of these functions is a file pointer handle that you should treat carefully!
* (for reading data) Read the data using raw_data = xls2oct (<fileptr> [,sheet#] [,cellrange] [,options])
* Next, optionally split the data in numerical, text and raw data and optionally get the limits of where these came from:
[num, txt, raw, lims] = parsecell (data, <fileptr.lims>)
* (for writing data) Write the data using <fileptr> = oct2xls (data, <fileptr> [,sheet#] [,cellrange] [,options])
* When you're finished, DO NOT FORGET to close the file pointer handle:
<fileptr> = xlsclose (<fileptr>)</pre>
 
Mixing read and write operations in any order is permitted (the only exception: not with the JXL -JExcelAPI- interface).
The same goes for odsopen-ods2oct-oct2ods-odsclose sequences.
 
Obviously this is much more flexible (and FASTER) than xlsread and xlswrite. In fact, Octave's io package xlsread is a mere wrapper for an xlsopen-xls2oct-parsecell-xlsclose sequence. Similarly for xlswrite, odsread, and odswrite.
 
==== .xls ~= .xlsx ====
 
'''This is the most important information you have to keep in mind when you have to work with "Excel" files.'''
* .xls - is an outdated default binary file format from <= Office 2003 - '''try to avoid this format!'''
* .xlsx - is the new default file format since Office 2007. [https://en.wikipedia.org/wiki/OOXML It consists of xml files stored in a .zip container.] - '''always save in or convert to this format!'''
* The ''(new)'' OCT interface can read ''(since version 1.2.5)'' and write ''(since version 2.2.0)'' .xlsx files dependency-free! No need of MS Windows+Office nor Java.
* Windows is notorious for hiding "known" file extensions. However in Windows Explorer it is easy to change this and have Windows show all file extensions.
 
 
==== different interfaces ====
 
The io package comes with different interfaces to read/write different file formats.
# COM
## This ''(interface)'' is only available on MS Windows '''and''' with an MS Office installation.
# [POI, POI/OOXML, JXL, OXS, UNO, OTK, JOD]
## These are java-based interfaces. They are generally slower than Octave's native OCT interface; OTOH they offer more flexibility. Generally the OCT interface offers sufficient flexibility and speed.
# OCT
## This is the new impressive and fast ''(mostly written in Octave itself! + two C files to bypass bottlenecks)'' interface which presently supports .xlsx, .ods and .gnumeric files.
(Note that .ods is a complicated file format with many gotchas that doesn't lend itself for fast file I/O. So unfortunately the fastest .ods interface is the Java-based jOpenDocument (JOD) (luckily it is GPL). However if speed is not an issue or if you hate Java, the OCT interface still performs fast enough.)
 
So, if you want to read/write '''.xlsx''' files, you'll only need the io-package >=2.2.0.
 
But if you have to read/write '''.xls''' files, you'll need either
* MS Windows with MS Office backings - or
* Octave built with --enable-java, + a Java JRE or -JDK, and one or more of the Java interfaces (i.e., the class libs)!
 
If you want to read/write .gnumeric files, the OCT interface is even the only option.
 
For some rarely used file formats you'll need LibreOffice + Octave built with Java enabled + a Java JRE or -JDK. But OK, once there you can enjoy formats then like Unified Office Format, Data Interchange Format, SYLK, OpenDocument Flat XML, the old OpenOffice.org .sxc format and some others you may have heard of ;-)
 
 
==== force an interface ====
 
If you don't want that the io-autodetect take control, you can easily force the usage of an interface. Examples:
 
Force native OCT interface - only for .xlsx, .ods, .gnumeric
<pre>OCT = xlsread ('file.xlsx', 1, [], 'OCT');</pre>
 
Force COM interface - may only work with .xls, .xlsx on Windows OS and available office installation.
<pre>COM = xlsread ('file.xlsx', 1, [], 'COM');</pre>
 
Force POI interface - may only work if you've did javaaddpath for the Apache POI .jar files - only .xls
<pre>POI = xlsread ('file.xls', 1, [], 'POI');</pre>
 
And so on ...
 
 
==== 32 vs. 64-bit issues ====
 
Generally, if you use a Java-based interface for spreadsheet I/O, it doesn't matter much whether you use Octave 32-bit or Octave 64-bit.
However, the UNO interface (for LibreOffice & OpenOffice.org) is more pertinent: 64-bit Octave can only use a 64-bit LibreOffice with UNO, same for 32 bit.
 
 
==== Java example ====
 
# Again: You only need Java if you have to read/write .xls files! You don't need this for .xlsx files!
# Make sure you've setup everything with java correctly
# get e.g. apache poi jar library files and add them with javaaddpath
<pre><nowiki>
octave:1>    javaaddpath('~/poi_library/poi-3.8-20120326.jar');
octave:2>    javaaddpath('~/poi_library/poi-ooxml-3.8-20120326.jar');
octave:3>    javaaddpath('~/poi_library/poi-ooxml-schemas-3.8-20120326.jar');
octave:4>    javaaddpath('~/poi_library/xmlbeans-2.3.0.jar');
octave:5>    javaaddpath('~/poi_library/dom4j-1.6.1.jar');
octave:6>
octave:6> pkg load io
octave:7> chk_spreadsheet_support
ans =                    6
octave:8> javaclasspath
  STATIC JAVA PATH
 
      - empty -
 
  DYNAMIC JAVA PATH
 
      /home/markus/poi_library/poi-3.8-20120326.jar
      /home/markus/poi_library/poi-ooxml-3.8-20120326.jar
      /home/markus/poi_library/poi-ooxml-schemas-3.8-20120326.jar
      /home/markus/poi_library/xmlbeans-2.3.0.jar
      /home/markus/poi_library/dom4j-1.6.1.jar
 
</nowiki></pre>
 
An easier way is to collect all required Java class libs for spreadsheet I/O (the .jar files) in one subdir and have chk_spreadsheet_support .m sort it all out:
<pre><nowiki>
octave:8>  chk_spreadsheet_support ('/full/path/to/subdir/with/.jar/files')
</nowiki></pre>
 
For UNO (LibreOffice-behind-the-scenes) the call is a bit different:
<pre><nowiki>
octave:8>  chk_spreadsheet_support ('', 0, '/full/path/to/LibreOffice/installation')
</nowiki></pre>
 
(On Windows, the io package tries to automatically find all required Java class libs and LibreOffice. To help it, put the Java class libs in you user profile (home directory) in a subdir "java", e.g., C:\Users\Eddy\java. chk_spreadsheet_support searches that location automagically.
On Linux this automatic searching has been disabled as the io package took ages (well, minutes) to load...)
 
 
Anyway, the chk_spreadsheet_support output should be now > 0.
 
<pre><nowiki>
                  0 No spreadsheet I/O support found
                ---------- XLS (Excel) interfaces: ----------
                  1 = COM (ActiveX / Excel)
                  2 = POI (Java / Apache POI)
                  4 = POI+OOXML (Java / Apache POI)
                  8 = JXL (Java / JExcelAPI)
                  16 = OXS (Java / OpenXLS)
                --- ODS (OpenOffice.org Calc) interfaces ----
                  32 = OTK (Java/ ODF Toolkit)
                  64 = JOD (Java / jOpenDocument)
                ----------------- XLS & ODS: ----------------
                128 = UNO (Java / UNO bridge - OpenOffice.org)
</nowiki></pre>
 
And reading/writing .xls files should work.
 
POI is located differently on every system. To easily find where are the files and their names try to search for: libapache-poi-java.list.
in terminal:
<pre><nowiki>
find / -name "libapache-poi-java*" 2>/dev/null
</nowiki></pre>
in octave (poi path in ubuntu):
<pre><nowiki>
fid = fopen ("/var/lib/dpkg/info/libapache-poi-java.list");
line = fgetl (fid);
while (line != -1)
    javaaddpath(line);
    line = fgetl (fid);
endwhile
fclose (fid);
pkg load io;
disp(chk_spreadsheet_support); % should be 2
</nowiki></pre>
 
== Detailed Information (TL) ==
 
The following might be more interesting if you're interested in how things work inside the io package.
 
=== ODS support ===
(ODS = Open Document Format spreadsheet data format, used by e.g., LibreOffice and OpenOffice.org)
(ODS = Open Document Format spreadsheet data format, used by e.g., LibreOffice and OpenOffice.org)


=== Files content ===
==== Files content ====
* '''odsread.m''' &mdash; no-hassle read script for reading from an ODS file and parsing the numeric and text data into separate arrays.
* '''odsread.m''' &mdash; no-hassle read script for reading from an ODS file and parsing the numeric and text data into separate arrays.
* '''odswrite.m''' &mdash; no-hassle write script for writing to an ODS file.
* '''odswrite.m''' &mdash; no-hassle write script for writing to an ODS file.
Line 24: Line 211:




=== Required support software ===
==== Required support software ====
For the OCT interface (since 1.2.4/1.2.5, read-only support!):
For the OCT interface (since 1.2.4/1.2.5, read-only support!):
* Nothing except unzip
* Nothing except unzip
Line 45: Line 232:
Alternatively, the io package contains a function script file "chk_spreadsheet_support.m" which can set up the Java classpath.
Alternatively, the io package contains a function script file "chk_spreadsheet_support.m" which can set up the Java classpath.


=== Usage ===
==== Usage ====


(see “help ods<function_filename>” in octave terminal.)
(see “help ods<function_filename>” in octave terminal.)
Line 64: Line 251:
If you use odsopen / ods2oct / … / oct2ods / …. / odsclose, DO NOT FORGET to invoke odsclose in the end. The file pointers can contain an enormous amount of data and may needlessly keep precious memory allocated. In case of the UNO interface, the hidden OpenOffice.org invocation (soffice.bin) can even block proper closing of Octave.
If you use odsopen / ods2oct / … / oct2ods / …. / odsclose, DO NOT FORGET to invoke odsclose in the end. The file pointers can contain an enormous amount of data and may needlessly keep precious memory allocated. In case of the UNO interface, the hidden OpenOffice.org invocation (soffice.bin) can even block proper closing of Octave.


=== Spreadsheet formula support ===
==== Spreadsheet formula support ====


When using the OTK or UNO interface you can:
When using the OTK or UNO interface you can:
Line 79: Line 266:
The only exception is if you select the UNO interface, as that invokes OpenOffice.org behind the scenes, and OOo obviously has a validator and evaluator built-in.
The only exception is if you select the UNO interface, as that invokes OpenOffice.org behind the scenes, and OOo obviously has a validator and evaluator built-in.


=== Gotchas ===
==== Gotchas ====
I know of one big gotcha: i.e. reading dates (& time). A less obvious one is Java memory pool allocation size.
I know of one big gotcha: i.e. reading dates (& time). A less obvious one is Java memory pool allocation size.


==== Date and time in ODS ====
===== Date and time in ODS =====
Octave (as does Matlab) stores dates as a number representing the number of days since January 1, 0 (and as an aside ignores a.o. Pope Gregorius' intervention in 1582 when 10 days were simply skipped).
Octave (as does Matlab) stores dates as a number representing the number of days since January 1, 0 (and as an aside ignores a.o. Pope Gregorius' intervention in 1582 when 10 days were simply skipped).


Line 97: Line 284:
While adding data and time values has been implemented in the write scripts, the wait is for clever solutions to distinguish dates from floats in octave cell arrays.
While adding data and time values has been implemented in the write scripts, the wait is for clever solutions to distinguish dates from floats in octave cell arrays.


==== Java memory pool allocation size ====
===== Java memory pool allocation size =====
The Java virtual machine (JVM) initializes one big chunk of your computer's RAM in which all Java classes and methods etc. are to be loaded: the Java memory pool. It does this because Java has a very sophisticated “garbage collection” system. At least on Windows, the initial size is 2MB and the maximum size is 64MB. On Linux this allocated size is much bigger. This part of memory is where the Java-based ODS octave routines (and the Java-based ods routines) live and keep their variables etc.
The Java virtual machine (JVM) initializes one big chunk of your computer's RAM in which all Java classes and methods etc. are to be loaded: the Java memory pool. It does this because Java has a very sophisticated “garbage collection” system. At least on Windows, the initial size is 2MB and the maximum size is 64MB. On Linux this allocated size is much bigger. This part of memory is where the Java-based ODS octave routines (and the Java-based ods routines) live and keep their variables etc.


Line 111: Line 298:
After processing a large chunk of spreadsheet information you might notice that octave's memory footprint does not shrink so it looks like Java's memory pool does not shrink back; but rest assured, the memory footprint is the allocated (reserved) memory size, not the actual used size. After the JVM has done its garbage collection, only the so-called “working set” of the memory allocation is really in use and that is a trimmed-down part of the memory allocation pool. On Windows systems it often suffices to minimize the octave terminal for a few seconds to get a more reasonable memory footprint.
After processing a large chunk of spreadsheet information you might notice that octave's memory footprint does not shrink so it looks like Java's memory pool does not shrink back; but rest assured, the memory footprint is the allocated (reserved) memory size, not the actual used size. After the JVM has done its garbage collection, only the so-called “working set” of the memory allocation is really in use and that is a trimmed-down part of the memory allocation pool. On Windows systems it often suffices to minimize the octave terminal for a few seconds to get a more reasonable memory footprint.


==== Reading cells containing errors ====
===== Reading cells containing errors =====
Spreadsheet cells containing erroneous stuff are transferred to Octave as NaNs. But not all errors can be catched. Cells showing #Value# in OpenOffice.org Calc due to e.g., invalid formulas, may have a 0 (zero) value stored in the value fields. It is impossible to catch this as there is no run-time formula evaluator (yet) in ODF Toolkit nor jOpenDocument (like there is in Apache POI for Excel).
Spreadsheet cells containing erroneous stuff are transferred to Octave as NaNs. But not all errors can be catched. Cells showing #Value# in OpenOffice.org Calc due to e.g., invalid formulas, may have a 0 (zero) value stored in the value fields. It is impossible to catch this as there is no run-time formula evaluator (yet) in ODF Toolkit nor jOpenDocument (like there is in Apache POI for Excel).


Line 120: Line 307:
* jOpenDocument doesn't set the so-called <office:value-type='string'> attribute in cells containing text; as a consequence ODF Toolkit will treat them as empty cells. OOo will read them OK.
* jOpenDocument doesn't set the so-called <office:value-type='string'> attribute in cells containing text; as a consequence ODF Toolkit will treat them as empty cells. OOo will read them OK.


=== Matlab compatibility ===
==== Matlab compatibility ====
AFAIK there's no similar functionality in Matlab (yet?), only for reading and then very limited.
AFAIK there's no similar functionality in Matlab (yet?), only for reading and then very limited.
odsread is fairly function-compatible to xlsread, however.
odsread is fairly function-compatible to xlsread, however.
Line 126: Line 313:
Same goes for odswrite, odsfinfo and xlsfinfo – however odsfinfo has better functionality IMO.
Same goes for odswrite, odsfinfo and xlsfinfo – however odsfinfo has better functionality IMO.


=== Comparison of interfaces ===
==== Comparison of interfaces ====
The OCT interface (present as of io-1.2.4) offers read support for ODS 1.2, complete with all the options of ODFtoolkit and UNO, but fairly slow.
The OCT interface (present as of io-1.2.4) offers read support for ODS 1.2, complete with all the options of ODFtoolkit and UNO, but fairly slow.


Line 147: Line 334:
However, UNO is not stable yet (see below).
However, UNO is not stable yet (see below).


=== Troubleshooting ===
==== Troubleshooting ====
Some hints for troubleshooting ODS support are given here.
Some hints for troubleshooting ODS support are given here.
Since April 2011 the function chk_spreadsheet_support() has been included in the io package. Calling it with arguments ('', 3) (empty string and debug level 3) will echo a lot of diagnostics to the screen. Large parts of the steps outlined below have been automated in this script.
Since April 2011 the function chk_spreadsheet_support() has been included in the io package. Calling it with arguments ('', 3) (empty string and debug level 3) will echo a lot of diagnostics to the screen. Large parts of the steps outlined below have been automated in this script.
Line 174: Line 361:
** The exact case (URE or ure, Basis or basis), name ("Basis3.2" or just "basis") and subdirectory tree (URE/java or URE/share/java) varies across OOo versions and -clones, so chk_spreadsheet_support.m can have a hard time finding all needed classes. In particularly bad cases, when chk_spreadsheet_support cannot find them, you might need to add one or more of these these classes manually to the javaclasspath.
** The exact case (URE or ure, Basis or basis), name ("Basis3.2" or just "basis") and subdirectory tree (URE/java or URE/share/java) varies across OOo versions and -clones, so chk_spreadsheet_support.m can have a hard time finding all needed classes. In particularly bad cases, when chk_spreadsheet_support cannot find them, you might need to add one or more of these these classes manually to the javaclasspath.


=== Development ===
==== Development ====
As with the Excel r/w stuff, adding new interfaces should be easy and straightforward. Add relevant stanzas in odsopen, odsclose, odsfinfo & getusedrange and add new subfunctions (for the real work) to getusedrange_<INTF>, oct2ods and ods2oct.
As with the Excel r/w stuff, adding new interfaces should be easy and straightforward. Add relevant stanzas in odsopen, odsclose, odsfinfo & getusedrange and add new subfunctions (for the real work) to getusedrange_<INTF>, oct2ods and ods2oct.


Line 191: Line 378:
But IMO data sets larger than 5.105 cells should not be kept in spreadsheets anyway. Use real databases for such data sets.
But IMO data sets larger than 5.105 cells should not be kept in spreadsheets anyway. Use real databases for such data sets.


=== ODFDOM versions ===
==== ODFDOM versions ====
I have tried various odfdom version. As to 0.8 & 0.8.5, while the API has been simplified enormously (finally one can address cells by spreadsheet address rather than find out yourself by parsing the table-column/-row/-cell structure), many irrecoverable bugs have been introduced :-((
I have tried various odfdom version. As to 0.8 & 0.8.5, while the API has been simplified enormously (finally one can address cells by spreadsheet address rather than find out yourself by parsing the table-column/-row/-cell structure), many irrecoverable bugs have been introduced :-((
In addition processing ODS files became significantly slower (up to 7 times!).
In addition processing ODS files became significantly slower (up to 7 times!).
Line 203: Line 390:
* oct2ods.m (revision 7159)
* oct2ods.m (revision 7159)


== XLS support ==
=== XLS support ===
=== Files content ===
==== Files content ====
* '''xlsread.m''' &mdash; All-in-one function for reading data from one specific worksheet in an Excel spreadsheet file. This script has Matlab-compatible functionality.
* '''xlsread.m''' &mdash; All-in-one function for reading data from one specific worksheet in an Excel spreadsheet file. This script has Matlab-compatible functionality.
* '''xlswrite.m''' &mdash; All-in-one function for writing data to one specific worksheet in an Excel spreadsheet file. This script has Matlab-compatible functionality.
* '''xlswrite.m''' &mdash; All-in-one function for writing data to one specific worksheet in an Excel spreadsheet file. This script has Matlab-compatible functionality.
Line 216: Line 403:
* '''spsh_chkrange.m''', '''spsh_prstype.m''', '''getusedrange.m''', '''calccelladdress.m''', '''parse_sp_range.m''' &mdash; Support files called by the scripts and not meant for direct invocation by users.
* '''spsh_chkrange.m''', '''spsh_prstype.m''', '''getusedrange.m''', '''calccelladdress.m''', '''parse_sp_range.m''' &mdash; Support files called by the scripts and not meant for direct invocation by users.


=== Required support software ===
==== Required support software ====
For the OCT interface (since 1.2.4/1.2.5, read-only support for OOXML (.xlsx)!):
For the OCT interface (since 1.2.4/1.2.5, read-only support for OOXML (.xlsx)!):
* Nothing except unzip
* Nothing except unzip
Line 222: Line 409:
For the Excel/COM interface:
For the Excel/COM interface:
* A windows computer with Excel installed. Note that 64-bit MS-Office has no support for COMx /ActiveX so you might have to resort to the Java interfaces below
* A windows computer with Excel installed. Note that 64-bit MS-Office has no support for COMx /ActiveX so you might have to resort to the Java interfaces below
* Octave-forge Windows-1.0.8 or later package WITH LATEST SVN PATCHES APPLIED. Currently (2013) windows-1.2.1 is the best option.
* Octave Forge Windows-1.0.8 or later package WITH LATEST SVN PATCHES APPLIED. Currently (2013) windows-1.2.1 is the best option.


For the Java / Apache POI / JExcelAPI interfaces (general):
For the Java / Apache POI / JExcelAPI interfaces (general):
* octave-forge java-1.2.8 package or later version on Linux
* octave Forge java-1.2.8 package or later version on Linux
* octave-forge java-1.2.8 with latest svn fixes on Windows/MingW
* octave Forge java-1.2.8 with latest svn fixes on Windows/MingW
* Java JRE or JDK > 1.6.0 (hasn't been tested with earlier versions). Although not an Octave issue, as to security you'd better get the latest Java version anyway.
* Java JRE or JDK > 1.6.0 (hasn't been tested with earlier versions). Although not an Octave issue, as to security you'd better get the latest Java version anyway.


Line 251: Line 438:
NOTE: EXPERIMENTAL!! A working OpenOffice.org installation. The utility function chk_spreadsheet_support can be used to add the needed entries to the javaclasspath.
NOTE: EXPERIMENTAL!! A working OpenOffice.org installation. The utility function chk_spreadsheet_support can be used to add the needed entries to the javaclasspath.


=== Usage ===
==== Usage ====
'''xlsread''' and '''xlswrite''' are mere wrappers for '''xlsopen - xls2oct - xlsclose - parsecell''' and '''xlsopen - oct2xls - xlsclose''' sequences, resp. They exist for the sake of Matlab compatibility.
'''xlsread''' and '''xlswrite''' are mere wrappers for '''xlsopen - xls2oct - xlsclose - parsecell''' and '''xlsopen - oct2xls - xlsclose''' sequences, resp. They exist for the sake of Matlab compatibility.


'''xlsfinfo''' can be used for finding out what worksheet names exist in the file. For OOXML files you either need MS-Excel 2007 for Windows (or later version) installed, and/or the input parameter REQINTF should be specified with a value of 'poi' (case-insensitive) and -obviously- the complete POI interface must have been installed.
'''xlsfinfo''' can be used for finding out what worksheet names exist in the file. For OOXML files you can do with the OCT interface (specify "oct" for the REQINTF parameter). For other Excel file types you need MS-Excel for Windows (or later version) and the windows package (specify "com" for REQINTF), and/or Apache POI and Java support (then the input parameter REQINTF should be specified with a value of 'poi' (case-insensitive) and -obviously- the complete POI interface must have been installed).


Invoking '''xlsopen'''/..../'''xlsclose''' directly provides for much more flexibility, speed, and robustness than '''xlsread''' / '''xlswrite'''. Indeed, using the same file handle (pointer struct) you can mix reading & writing before writing the workbook out to disk using xlsclose.
Invoking '''xlsopen'''/..../'''xlsclose''' directly provides for much more flexibility, speed, and robustness than '''xlsread''' / '''xlswrite'''. Indeed, using the same file handle (pointer struct) you can mix reading & writing before writing the workbook out to disk using xlsclose.
Line 301: Line 488:
When using JExcelAPI (JXL), after writing into a worksheet you MUST save the file – adding data to the same or another worksheet is no more possible after the first call to oct2xls(). This is a limitation of JExcelAPI.
When using JExcelAPI (JXL), after writing into a worksheet you MUST save the file – adding data to the same or another worksheet is no more possible after the first call to oct2xls(). This is a limitation of JExcelAPI.


=== Spreadsheet formula support ===
==== Spreadsheet formula support ====
When using the COM, POI, JXL, and UNO interfaces you can:
When using the COM, POI, JXL, and UNO interfaces you can:
* (When reading, xls2oct) either read spreadsheet formula results, or the literal formula text strings;
* (When reading, xls2oct) either read spreadsheet formula results, or the literal formula text strings (also works with OCT interface);
* (When writing, oct2xls) either enter formulas in the worksheet as formulas, or enter them as literal text strings.
* (When writing, oct2xls) either enter formulas in the worksheet as formulas, or enter them as literal text strings.


Line 317: Line 504:
Worse, older Excel versions feature less functions than newer versions. So be wary as this may make for interesting confusion.
Worse, older Excel versions feature less functions than newer versions. So be wary as this may make for interesting confusion.


=== Matlab compatibility ===
==== Matlab compatibility ====


'''xlsread''', '''xlswrite''' and '''xlsfinfo''' are for the most part Matlab-compatible. Some small differences are mentioned below.
'''xlsread''', '''xlswrite''' and '''xlsfinfo''' are for the most part Matlab-compatible. Some small differences are mentioned below.
Line 323: Line 510:
* xlsread
* xlsread
** Matlab's xlsread supports invoking extra functions while reading ("passing function handle"); octave not. But this can be simulated outside xlsread.
** Matlab's xlsread supports invoking extra functions while reading ("passing function handle"); octave not. But this can be simulated outside xlsread.
** Matlab's xlsread flags some spreadsheet errors, octave-forge just returns blank cells.
** Matlab's xlsread flags some spreadsheet errors, Octave Forge just returns blank cells.
** Octave-forge returns info about the actual (rather than the requested) cell range where the data came from. Personally I find it very useful to know from what part of a worksheet the data originate so I've put quite some effort in it :-) Matlab can't, due to Excel automatically trimming returned arrays from empty outer columns and rows. Octave is more clever but the Visual Basic call used for determining the actually used range has some limitations:
** Octave Forge returns info about the actual (rather than the requested) cell range where the data came from. Personally I find it very useful to know from what part of a worksheet the data originate so I've put quite some effort in it :-) Matlab can't, due to Excel automatically trimming returned arrays from empty outer columns and rows. Octave is more clever but the Visual Basic call used for determining the actually used range has some limitations:
**# it relies on cached range values and thus may be out-of-date;
**# it relies on cached range values and thus may be out-of-date;
**# it counts empty formatted cells too. When using ActiveX/COM, if octave's xlsfinfo.m returns wrong data ranges it is most often an overestimation.
**# it counts empty formatted cells too. When using ActiveX/COM, if octave's xlsfinfo.m returns wrong data ranges it is most often an overestimation.
*:Matlab's xlsread ignores all non-numeric data values outside the smallest rectangle encompassing all numerical values. Octave's xlsread doesn't. This means that Matlab ignores all row/column headers, not very user-friendly IMO.
*:Matlab's xlsread ignores all non-numeric data values outside the smallest rectangle encompassing all numerical values. Octave's xlsread doesn't. This means that Matlab ignores all row/column headers, not very user-friendly IMO.
** When using the Java interface, reading and writing xls-files by octave-forge is platform-independent. On systems w/o installed Excel, Matlab can only read Excel 95 formatted .xls files (written using ML xlswrite's 'Basic" option) – and then differently than under Windows.....
** When using the Java interface, reading and writing xls-files by Octave Forge is platform-independent. On systems w/o installed Excel, Matlab can only read Excel 95 formatted .xls files (written using ML xlswrite's 'Basic" option) – and then differently than under Windows.....
** Matlab's xlsread returns strings for cells containing date values. This makes for endless if-then-elseif-else-end constructs to catch all expected date formats. Octave returns numerical data (where 0 = 1/1/1900 – you can easily transfer them into proper octave date values yourself using e.g. datestr(), see bottom of this document for more info). For dates before 1/1/1900, Octave returns dates as text strings.
** Matlab's xlsread returns strings for cells containing date values. This makes for endless if-then-elseif-else-end constructs to catch all expected date formats. Octave returns numerical data (where 0 = 1/1/1900 – you can easily transfer them into proper octave date values yourself using e.g. datestr(), see bottom of this document for more info). For dates before 1/1/1900, Octave returns dates as text strings.
** Matlab's xlsread invokes csvread if no Excel interface is present. Octave-forge's xlsread doesn't.
** Matlab's xlsread invokes csvread if no Excel interface is present. Octave Forge's xlsread doesn't.
** Octave can read either formula results (evaluated formulas) or the formula text strings; Matlab can't.
** Octave can read either formula results (evaluated formulas) or the formula text strings; Matlab can't.


* xlswrite
* xlswrite
** Octave-forge's xlswrite works on systems w/o Excel support, Matlab's doesn't (properly).
** Octave Forge's xlswrite works on systems w/o Excel support, Matlab's doesn't (properly).
**When specifying a sheet number larger than the number of existing sheets in an .xls file, Matlab's xlswrite adds empty sheets until the new sheet number is created; Octave's xlswrite only adds one sheet called "Sheet<number>" where <number> is the specified sheet number.
**When specifying a sheet number larger than the number of existing sheets in an .xls file, Matlab's xlswrite adds empty sheets until the new sheet number is created; Octave's xlswrite only adds one sheet called "Sheet<number>" where <number> is the specified sheet number.
** Even better (IMO) while M's xlswrite always creates Sheet1/Sheet2/Sheet3 when creating a new spreadsheet, octave's xlswrite only creates the requested worksheet. (Did you know that you can instruct Excel to create spreadsheets with just one, or any number of, worksheets? Look in Tools | Options, General tab.)
** Even better (IMO) while M's xlswrite always creates Sheet1/Sheet2/Sheet3 when creating a new spreadsheet, octave's xlswrite only creates the requested worksheet. (Did you know that you can instruct Excel to create spreadsheets with just one, or any number of, worksheets? Look in Tools | Options, General tab.)
** Oh and octave doesn't touch the "active sheet" - but that's not automatically an advantage.
** Oh and octave doesn't touch the "active sheet" - but that's not automatically an advantage.
** If the specified write range is larger than the actual data array, Matlab's xlswrite adds #N/A cells to fill up the lowermost rows and rightmost columns; octave-forge's xlswrite doesn't.
** If the specified write range is larger than the actual data array, Matlab's xlswrite adds #N/A cells to fill up the lowermost rows and rightmost columns; Octave Forge's xlswrite doesn't.


* xlsfinfo
* xlsfinfo
** When invoking Excel/COM interface, octave-forge's xlsfinfo also echoes the type of sheet (worksheet, chart), not just the sheet names. Using Java I haven't found similar functionality (yet).
** When invoking Excel/COM interface, Octave Forge's xlsfinfo also echoes the type of sheet (worksheet, chart), not just the sheet names. Using Java I haven't found similar functionality (yet).


=== Comparison of interfaces & usage ===
==== Comparison of interfaces & usage ====
Using Excel itself (through '''COM''' / '''ActiveX''' on Windows systems) is probably the most robust and versatile and especially FAST option. There's one gotcha: in case of some type of COM errors Excel will keep running invisibly; you can only end it through Task Manager. A tiny problem is that one cannot find out easily through COM what file types are supported; xls, wks, wk1, xlsx, etc. Another -obvious- limitation is that COM Excel access only works on Windows systems where Excel is installed.
Using Excel itself (through '''COM''' / '''ActiveX''' on Windows systems) is probably the most robust and versatile and especially FAST option. There's one gotcha: in case of some type of COM errors Excel will keep running invisibly; you can only end it through Task Manager. A tiny problem is that one cannot find out easily through COM what file types are supported; xls, wks, wk1, xlsx, etc. Another -obvious- limitation is that COM Excel access only works on Windows systems where Excel is installed.


'''JExcelAPI''' (Java-based and therefore platform-independent) is proven technology but switching between reading and writing is quite involved and memory-hungry when processing large spreadsheets. As the docs state, JExcelAPI is optimized for reading and it does do that well - but still slower than Excel/COM. The fact that upon a switch from reading to writing the existing spreadsheet is overwritten in place by a blank one and that you can only get the contents back wen writing out all of the changes is worrying - and any change after the first write() is lost as a next write() doesn't seem to work, worse yet, you may completely loose the spreadsheet in question. The first is by JExcelAPI design, the second is probably a bug (in octave-forge/Java or JExcelAPI ? I don't know). Adding data to existing spreadsheets does work, but IMO undue user confidence is needed. JExcelAPI supports BIFF5 (only reading) and BIFF8 (Excel 95 and Excel 97-2003, respectively). Upon overwriting, BIFF5 spreadsheets are converted silently to BIFF8. JexcelAPI, unlike ApachePOI, doesn't evaluate functions while reading but instead relies on cached results (i.e. results computed by Excel itself). Depending on Excel settings ("Automatic calculation" ON or OFF) this may or may not yield incorrect (or expected) results.
'''JExcelAPI''' (Java-based and therefore platform-independent) is proven technology but switching between reading and writing is quite involved and memory-hungry when processing large spreadsheets. As the docs state, JExcelAPI is optimized for reading and it does do that well - but still slower than Excel/COM. The fact that upon a switch from reading to writing the existing spreadsheet is overwritten in place by a blank one and that you can only get the contents back wen writing out all of the changes is worrying - and any change after the first write() is lost as a next write() doesn't seem to work, worse yet, you may completely loose the spreadsheet in question. The first is by JExcelAPI design, the second is probably a bug (in Octave Forge/Java or JExcelAPI ? I don't know). Adding data to existing spreadsheets does work, but IMO undue user confidence is needed. JExcelAPI supports BIFF5 (only reading) and BIFF8 (Excel 95 and Excel 97-2003, respectively). Upon overwriting, BIFF5 spreadsheets are converted silently to BIFF8. JexcelAPI, unlike ApachePOI, doesn't evaluate functions while reading but instead relies on cached results (i.e. results computed by Excel itself). Depending on Excel settings ("Automatic calculation" ON or OFF) this may or may not yield incorrect (or expected) results.


'''Apache POI''' (Java-based and platform-independent too) is based on the OpenOffice.org I/O Excel r/w routines. It is a more versatile than JExcelAPI, while it doesn't support BIFF5 it does support BIFF8 (Excel 97 – 2003) and OOXML (Excel 2007). It is slower than native JXL let alone Excel & COM but it features active formula evaluation, although at the moment (v. 3.8) not all Excel functions have been implemented. Obviously, as new functions are added in every new Excel release it's hard to catch up for Apache POI. I've made the relevant subfunction (xls2jpoi2oct) fall back to cached formula results (and yield a suitable warning) for non-implemented Excel functions while reading Excel files.
'''Apache POI''' (Java-based and platform-independent too) is based on the OpenOffice.org I/O Excel r/w routines. It is a more versatile than JExcelAPI, while it doesn't support BIFF5 it does support BIFF8 (Excel 97 – 2003) and OOXML (Excel 2007). It is slower than native JXL let alone Excel & COM but it features active formula evaluation, although at the moment (v. 3.8) not all Excel functions have been implemented. Obviously, as new functions are added in every new Excel release it's hard to catch up for Apache POI. I've made the relevant subfunction (xls2jpoi2oct) fall back to cached formula results (and yield a suitable warning) for non-implemented Excel functions while reading Excel files.
Line 355: Line 542:


All in all, of the three Java options I'd prefer Apache POI rather than OpenXLS or JexcelAPI. But the latter is indispensable for BIFF5 formats. Once UNO is stable it is to be preferred as it can read ALL file formats supported by OOo (viz. wk1, ods, xlsx, sxc, ...)
All in all, of the three Java options I'd prefer Apache POI rather than OpenXLS or JexcelAPI. But the latter is indispensable for BIFF5 formats. Once UNO is stable it is to be preferred as it can read ALL file formats supported by OOo (viz. wk1, ods, xlsx, sxc, ...)
'''OCT''' offers read support for OOXML files (.xlsx) only, but it is by far the fastest read option; faster than Excel itself.


Some notes on the choice for Java:
Some notes on the choice for Java:
Line 363: Line 552:
So Java is a compromise between portability and rapid development time versus capacity (and speed). But IMO data sets larger than 5.10^5 cells should not be kept in spreadsheets anyway. Better use real databases for such data sets.
So Java is a compromise between portability and rapid development time versus capacity (and speed). But IMO data sets larger than 5.10^5 cells should not be kept in spreadsheets anyway. Better use real databases for such data sets.


=== Troubleshooting ===
==== Troubleshooting ====
Some hints for troubleshooting Excel support are contained in this thread: http://sourceforge.net/mailarchive/forum.php?thread_name=4C61B649.9090802%40hccnet.nl&forum_name=octave-dev dated August 10, 2010. A more structured approach is below.
Some hints for troubleshooting Excel support are contained in this thread: http://sourceforge.net/mailarchive/forum.php?thread_name=4C61B649.9090802%40hccnet.nl&forum_name=octave-dev dated August 10, 2010. A more structured approach is below.


Line 394: Line 583:
#: xls2 = xlsopen ('test.xls', 1, 'jxl'). If this works and xls2 is a struct with various fields containing objects, the JExcelAPI interface (JXL) works as well. Don't forget to do xls2 = xlsclose (xls2) to close the file.
#: xls2 = xlsopen ('test.xls', 1, 'jxl'). If this works and xls2 is a struct with various fields containing objects, the JExcelAPI interface (JXL) works as well. Don't forget to do xls2 = xlsclose (xls2) to close the file.


=== Development ===
==== Development ====
xlsopen/xlsclose and friends have been written so that adding other interfaces (Perl? native octave? ...?) should be very easily accomplished. Xlsopen.m merely needs two stanzas, xlsfinfo.m and getusedrange.m each need an additional elseif stanza, and xlsclose.m needs a small stanza for closing the pointer struct and writing to disk. The real work lies in creating the relevant xls2<...>2oct & oct2<...>2xls & <getusedrange_...> subfunction scripts in xls2oct.m, oct2xls.m and getusedrange.m, resp., but that shouldn't be really hard, depending on the interface support libraries' quality and documentation. Separating the file access functions and the actual reading/writing from/to the workbook in memory has made developer's life (I mean: my time developing this stuff) much easier.
xlsopen/xlsclose and friends have been written so that adding other interfaces (Perl? native octave? ...?) should be very easily accomplished. Xlsopen.m merely needs two stanzas, xlsfinfo.m and getusedrange.m each need an additional elseif stanza, and xlsclose.m needs a small stanza for closing the pointer struct and writing to disk. The real work lies in creating the relevant xls2<...>2oct & oct2<...>2xls & <getusedrange_...> subfunction scripts in xls2oct.m, oct2xls.m and getusedrange.m, resp., but that shouldn't be really hard, depending on the interface support libraries' quality and documentation. Separating the file access functions and the actual reading/writing from/to the workbook in memory has made developer's life (I mean: my time developing this stuff) much easier.


Line 403: Line 592:
*Support for "passing function handle" in xlsread.
*Support for "passing function handle" in xlsread.


== OCT interface ==
=== OCT interface ===


Since io package version 1.2.4, an interface called "OCT" was added. Except for unzip, it has no dependencies. It's still experimental but fast! Feel free to test it and give us a feedback.
Since io package version 1.2.4, an interface called "OCT" was added. Except for unzip, it has no dependencies. It's still experimental but fast! Feel free to test it and give us a feedback.
Currently it just support reading .xlsx, .ods and .gnumeric files. Writing is not supported yet!
Currently it supports reading and writing .xlsx, .ods and .gnumeric files (the latter in yet-to-be-released io-2.2.2).
If  
If  
<pre>chk_spreadsheet_support == 0</pre>
<pre>chk_spreadsheet_support == 0</pre>
Line 412: Line 601:
it's used automatically (default interface). Otherwise you can force the usage like  
it's used automatically (default interface). Otherwise you can force the usage like  


<pre>m=xlsread('file.xlsx',1,[],'OCT');</pre>
<pre>m = xlsread ('file.xlsx', 1, [], 'OCT');</pre>
 
Since io package version 2.2.0, the "OCT" interface has experimental write support for .xlsx and .ods formats, since io-2.2.2 (expected mid-May 2014) also for gnumeric. If you can't wait for gnumeric I/O you can checkout a snapshot from svn (see octave.sf.net, http://sourceforge.net/p/octave/code/HEAD/tree/trunk/octave-forge/main/io/)


[[Category:Octave-Forge]]
[[Category:Octave Forge]]
21

edits

Navigation menu