IO package: Difference between revisions

Jump to navigation Jump to search
3,574 bytes added ,  18 October 2019
m
Fix typo
m (Fix typo)
(17 intermediate revisions by 6 users not shown)
Line 1: Line 1:
The IO package is part of the octave-forge project and provides input/output from/in external formats.
The {{Forge|io|IO package}} is part of the Octave Forge project and provides input/output from/in external formats.


=== About read/write support (TL;DR) ===
<div class="tocinline">__TOC__</div>
 
=== About read/write support ===


Most people need this package to read and write Excel files. But the io package can read/write Open/Libre Office, Gnumeric and some less important files too.
Most people need this package to read and write Excel files. But the io package can read/write Open/Libre Office, Gnumeric and some less important files too.
Line 10: Line 12:
.xls (Excel95)        R                R      R
.xls (Excel95)        R                R      R
.xls (Excel97-2003)  +  +      +      +  +  +
.xls (Excel97-2003)  +  +      +      +  +  +
.xlsx (Excel2007+)    ~          +        (+)  R           +
.xlsx (Excel2007+)    ~          +        (+)  +           +
.xlsb, .xlsm          ~                    ?  R          R?
.xlsb, .xlsm          ~                    ?  R          R?
.wk1                  +                        R
.wk1                  +                        R
Line 19: Line 21:
.fods                                          +
.fods                                          +
.uos                                            +
.uos                                            +
.dif                                           +
.dif                 +                        +
.csv                  +                        R
.csv                  +                        R
.gnumeric                                                  +
.gnumeric                                                  +
--------------------------------------------------------------
--------------------------------------------------------------
R : only read;  + : full read/write; ~ : dependent on Excel version
</nowiki></pre>
</nowiki></pre>


==== xlswrite / odswrite versus xlsopen / odsopen ..... xlsclose / odsclose ====
Matlab users are used to xlsread and xlswrite, functions that can only read data from, or write data to, one sheet in a spreadsheet file at a time. For each operation, xlsread and xlswrite first have to read the entire spreadsheet file, for write operations xlswrite also has to finally write it out completely to disk.
There are faster ways, but then you'll have to dive into ActiveX/COM/VisualBasic programming.
If you want to move multiple pieces of data to/from a spreadsheet file, the io package offers a much more versatile scheme:
* First open the spreadsheet file using xlsopen (for Excel or gnumeric files) or odsopen (.ods or .gnumeric).
'''NOTE''': the output of these functions is a file pointer handle that you should treat carefully!
* (for reading data) Read the data using raw_data = xls2oct (<fileptr> [,sheet#] [,cellrange] [,options])
* Next, optionally split the data in numerical, text and raw data and optionally get the limits of where these came from:
[num, txt, raw, lims] = parsecell (data, <fileptr.lims>)
* (for writing data) Write the data using <fileptr> = oct2xls (data, <fileptr> [,sheet#] [,cellrange] [,options])
* When you're finished, DO NOT FORGET to close the file pointer handle:
<fileptr> = xlsclose (<fileptr>)</pre>
Mixing read and write operations in any order is permitted (the only exception: not with the JXL -JExcelAPI- interface).
The same goes for odsopen-ods2oct-oct2ods-odsclose sequences.
Obviously this is much more flexible (and FASTER) than xlsread and xlswrite. In fact, Octave's io package xlsread is a mere wrapper for an xlsopen-xls2oct-parsecell-xlsclose sequence. Similarly for xlswrite, odsread, and odswrite.


==== .xls ~= .xlsx ====
==== .xls ~= .xlsx ====
Line 31: Line 55:
* .xls - is an outdated default binary file format from <= Office 2003 - '''try to avoid this format!'''
* .xls - is an outdated default binary file format from <= Office 2003 - '''try to avoid this format!'''
* .xlsx - is the new default file format since Office 2007. [https://en.wikipedia.org/wiki/OOXML It consists of xml files stored in a .zip container.] - '''always save in or convert to this format!'''
* .xlsx - is the new default file format since Office 2007. [https://en.wikipedia.org/wiki/OOXML It consists of xml files stored in a .zip container.] - '''always save in or convert to this format!'''
** The ''(new)'' OCT interface can read ''(since version 1.2.5)'' and write ''(since version 2.2.0)'' .xlsx files dependency-free! No need of MS Windows+Office nor Java.
* The ''(new)'' OCT interface can read ''(since version 1.2.5)'' and write ''(since version 2.2.0)'' .xlsx files dependency-free! No need of MS Windows+Office nor Java.
* Windows is notorious for hiding "known" file extensions. However in Windows Explorer it is easy to change this and have Windows show all file extensions.
* Windows is notorious for hiding "known" file extensions. However in Windows Explorer it is easy to change this and have Windows show all file extensions.


==== different interfaces ====
==== different interfaces ====
Line 54: Line 79:


For some rarely used file formats you'll need LibreOffice + Octave built with Java enabled + a Java JRE or -JDK. But OK, once there you can enjoy formats then like Unified Office Format, Data Interchange Format, SYLK, OpenDocument Flat XML, the old OpenOffice.org .sxc format and some others you may have heard of ;-)
For some rarely used file formats you'll need LibreOffice + Octave built with Java enabled + a Java JRE or -JDK. But OK, once there you can enjoy formats then like Unified Office Format, Data Interchange Format, SYLK, OpenDocument Flat XML, the old OpenOffice.org .sxc format and some others you may have heard of ;-)


==== force an interface ====
==== force an interface ====
Line 69: Line 95:


And so on ...
And so on ...
==== 32 vs. 64-bit issues ====
Generally, if you use a Java-based interface for spreadsheet I/O, it doesn't matter much whether you use Octave 32-bit or Octave 64-bit.
However, the UNO interface (for LibreOffice & OpenOffice.org) is more pertinent: 64-bit Octave can only use a 64-bit LibreOffice with UNO, same for 32 bit.


==== Java example ====
==== Java example ====
Line 100: Line 133:
</nowiki></pre>
</nowiki></pre>


The chk_spreadsheet_support output should be now > 0.
An easier way is to collect all required Java class libs for spreadsheet I/O (the .jar files) in one subdir and have chk_spreadsheet_support .m sort it all out:
<pre><nowiki>
octave:8>  chk_spreadsheet_support ('/full/path/to/subdir/with/.jar/files')
</nowiki></pre>
 
For UNO (LibreOffice-behind-the-scenes) the call is a bit different:
<pre><nowiki>
octave:8>  chk_spreadsheet_support ('', 0, '/full/path/to/LibreOffice/installation')
</nowiki></pre>
 
(On Windows, the io package tries to automatically find all required Java class libs and LibreOffice. To help it, put the Java class libs in you user profile (home directory) in a subdir "java", e.g., C:\Users\Eddy\java. chk_spreadsheet_support searches that location automagically.
On Linux this automatic searching has been disabled as the io package took ages (well, minutes) to load...)
 
 
Anyway, the chk_spreadsheet_support output should be now > 0.


<pre><nowiki>
<pre><nowiki>
Line 118: Line 165:


And reading/writing .xls files should work.
And reading/writing .xls files should work.
POI is located differently on every system. To easily find where are the files and their names try to search for: libapache-poi-java.list.
in terminal:
<pre><nowiki>
find / -name "libapache-poi-java*" 2>/dev/null
</nowiki></pre>
in octave (poi path in ubuntu):
<pre><nowiki>
fid = fopen ("/var/lib/dpkg/info/libapache-poi-java.list");
line = fgetl (fid);
while (line != -1)
    javaaddpath(line);
    line = fgetl (fid);
endwhile
fclose (fid);
pkg load io;
disp(chk_spreadsheet_support); % should be 2
</nowiki></pre>


== Detailed Information (TL) ==
== Detailed Information (TL) ==
Line 344: Line 409:
For the Excel/COM interface:
For the Excel/COM interface:
* A windows computer with Excel installed. Note that 64-bit MS-Office has no support for COMx /ActiveX so you might have to resort to the Java interfaces below
* A windows computer with Excel installed. Note that 64-bit MS-Office has no support for COMx /ActiveX so you might have to resort to the Java interfaces below
* Octave-forge Windows-1.0.8 or later package WITH LATEST SVN PATCHES APPLIED. Currently (2013) windows-1.2.1 is the best option.
* Octave Forge Windows-1.0.8 or later package WITH LATEST SVN PATCHES APPLIED. Currently (2013) windows-1.2.1 is the best option.


For the Java / Apache POI / JExcelAPI interfaces (general):
For the Java / Apache POI / JExcelAPI interfaces (general):
* octave-forge java-1.2.8 package or later version on Linux
* octave Forge java-1.2.8 package or later version on Linux
* octave-forge java-1.2.8 with latest svn fixes on Windows/MingW
* octave Forge java-1.2.8 with latest svn fixes on Windows/MingW
* Java JRE or JDK > 1.6.0 (hasn't been tested with earlier versions). Although not an Octave issue, as to security you'd better get the latest Java version anyway.
* Java JRE or JDK > 1.6.0 (hasn't been tested with earlier versions). Although not an Octave issue, as to security you'd better get the latest Java version anyway.


Line 445: Line 510:
* xlsread
* xlsread
** Matlab's xlsread supports invoking extra functions while reading ("passing function handle"); octave not. But this can be simulated outside xlsread.
** Matlab's xlsread supports invoking extra functions while reading ("passing function handle"); octave not. But this can be simulated outside xlsread.
** Matlab's xlsread flags some spreadsheet errors, octave-forge just returns blank cells.
** Matlab's xlsread flags some spreadsheet errors, Octave Forge just returns blank cells.
** Octave-forge returns info about the actual (rather than the requested) cell range where the data came from. Personally I find it very useful to know from what part of a worksheet the data originate so I've put quite some effort in it :-) Matlab can't, due to Excel automatically trimming returned arrays from empty outer columns and rows. Octave is more clever but the Visual Basic call used for determining the actually used range has some limitations:
** Octave Forge returns info about the actual (rather than the requested) cell range where the data came from. Personally I find it very useful to know from what part of a worksheet the data originate so I've put quite some effort in it :-) Matlab can't, due to Excel automatically trimming returned arrays from empty outer columns and rows. Octave is more clever but the Visual Basic call used for determining the actually used range has some limitations:
**# it relies on cached range values and thus may be out-of-date;
**# it relies on cached range values and thus may be out-of-date;
**# it counts empty formatted cells too. When using ActiveX/COM, if octave's xlsfinfo.m returns wrong data ranges it is most often an overestimation.
**# it counts empty formatted cells too. When using ActiveX/COM, if octave's xlsfinfo.m returns wrong data ranges it is most often an overestimation.
*:Matlab's xlsread ignores all non-numeric data values outside the smallest rectangle encompassing all numerical values. Octave's xlsread doesn't. This means that Matlab ignores all row/column headers, not very user-friendly IMO.
*:Matlab's xlsread ignores all non-numeric data values outside the smallest rectangle encompassing all numerical values. Octave's xlsread doesn't. This means that Matlab ignores all row/column headers, not very user-friendly IMO.
** When using the Java interface, reading and writing xls-files by octave-forge is platform-independent. On systems w/o installed Excel, Matlab can only read Excel 95 formatted .xls files (written using ML xlswrite's 'Basic" option) – and then differently than under Windows.....
** When using the Java interface, reading and writing xls-files by Octave Forge is platform-independent. On systems w/o installed Excel, Matlab can only read Excel 95 formatted .xls files (written using ML xlswrite's 'Basic" option) – and then differently than under Windows.....
** Matlab's xlsread returns strings for cells containing date values. This makes for endless if-then-elseif-else-end constructs to catch all expected date formats. Octave returns numerical data (where 0 = 1/1/1900 – you can easily transfer them into proper octave date values yourself using e.g. datestr(), see bottom of this document for more info). For dates before 1/1/1900, Octave returns dates as text strings.
** Matlab's xlsread returns strings for cells containing date values. This makes for endless if-then-elseif-else-end constructs to catch all expected date formats. Octave returns numerical data (where 0 = 1/1/1900 – you can easily transfer them into proper octave date values yourself using e.g. datestr(), see bottom of this document for more info). For dates before 1/1/1900, Octave returns dates as text strings.
** Matlab's xlsread invokes csvread if no Excel interface is present. Octave-forge's xlsread doesn't.
** Matlab's xlsread invokes csvread if no Excel interface is present. Octave Forge's xlsread doesn't.
** Octave can read either formula results (evaluated formulas) or the formula text strings; Matlab can't.
** Octave can read either formula results (evaluated formulas) or the formula text strings; Matlab can't.


* xlswrite
* xlswrite
** Octave-forge's xlswrite works on systems w/o Excel support, Matlab's doesn't (properly).
** Octave Forge's xlswrite works on systems w/o Excel support, Matlab's doesn't (properly).
**When specifying a sheet number larger than the number of existing sheets in an .xls file, Matlab's xlswrite adds empty sheets until the new sheet number is created; Octave's xlswrite only adds one sheet called "Sheet<number>" where <number> is the specified sheet number.
**When specifying a sheet number larger than the number of existing sheets in an .xls file, Matlab's xlswrite adds empty sheets until the new sheet number is created; Octave's xlswrite only adds one sheet called "Sheet<number>" where <number> is the specified sheet number.
** Even better (IMO) while M's xlswrite always creates Sheet1/Sheet2/Sheet3 when creating a new spreadsheet, octave's xlswrite only creates the requested worksheet. (Did you know that you can instruct Excel to create spreadsheets with just one, or any number of, worksheets? Look in Tools | Options, General tab.)
** Even better (IMO) while M's xlswrite always creates Sheet1/Sheet2/Sheet3 when creating a new spreadsheet, octave's xlswrite only creates the requested worksheet. (Did you know that you can instruct Excel to create spreadsheets with just one, or any number of, worksheets? Look in Tools | Options, General tab.)
** Oh and octave doesn't touch the "active sheet" - but that's not automatically an advantage.
** Oh and octave doesn't touch the "active sheet" - but that's not automatically an advantage.
** If the specified write range is larger than the actual data array, Matlab's xlswrite adds #N/A cells to fill up the lowermost rows and rightmost columns; octave-forge's xlswrite doesn't.
** If the specified write range is larger than the actual data array, Matlab's xlswrite adds #N/A cells to fill up the lowermost rows and rightmost columns; Octave Forge's xlswrite doesn't.


* xlsfinfo
* xlsfinfo
** When invoking Excel/COM interface, octave-forge's xlsfinfo also echoes the type of sheet (worksheet, chart), not just the sheet names. Using Java I haven't found similar functionality (yet).
** When invoking Excel/COM interface, Octave Forge's xlsfinfo also echoes the type of sheet (worksheet, chart), not just the sheet names. Using Java I haven't found similar functionality (yet).


==== Comparison of interfaces & usage ====
==== Comparison of interfaces & usage ====
Using Excel itself (through '''COM''' / '''ActiveX''' on Windows systems) is probably the most robust and versatile and especially FAST option. There's one gotcha: in case of some type of COM errors Excel will keep running invisibly; you can only end it through Task Manager. A tiny problem is that one cannot find out easily through COM what file types are supported; xls, wks, wk1, xlsx, etc. Another -obvious- limitation is that COM Excel access only works on Windows systems where Excel is installed.
Using Excel itself (through '''COM''' / '''ActiveX''' on Windows systems) is probably the most robust and versatile and especially FAST option. There's one gotcha: in case of some type of COM errors Excel will keep running invisibly; you can only end it through Task Manager. A tiny problem is that one cannot find out easily through COM what file types are supported; xls, wks, wk1, xlsx, etc. Another -obvious- limitation is that COM Excel access only works on Windows systems where Excel is installed.


'''JExcelAPI''' (Java-based and therefore platform-independent) is proven technology but switching between reading and writing is quite involved and memory-hungry when processing large spreadsheets. As the docs state, JExcelAPI is optimized for reading and it does do that well - but still slower than Excel/COM. The fact that upon a switch from reading to writing the existing spreadsheet is overwritten in place by a blank one and that you can only get the contents back wen writing out all of the changes is worrying - and any change after the first write() is lost as a next write() doesn't seem to work, worse yet, you may completely loose the spreadsheet in question. The first is by JExcelAPI design, the second is probably a bug (in octave-forge/Java or JExcelAPI ? I don't know). Adding data to existing spreadsheets does work, but IMO undue user confidence is needed. JExcelAPI supports BIFF5 (only reading) and BIFF8 (Excel 95 and Excel 97-2003, respectively). Upon overwriting, BIFF5 spreadsheets are converted silently to BIFF8. JexcelAPI, unlike ApachePOI, doesn't evaluate functions while reading but instead relies on cached results (i.e. results computed by Excel itself). Depending on Excel settings ("Automatic calculation" ON or OFF) this may or may not yield incorrect (or expected) results.
'''JExcelAPI''' (Java-based and therefore platform-independent) is proven technology but switching between reading and writing is quite involved and memory-hungry when processing large spreadsheets. As the docs state, JExcelAPI is optimized for reading and it does do that well - but still slower than Excel/COM. The fact that upon a switch from reading to writing the existing spreadsheet is overwritten in place by a blank one and that you can only get the contents back wen writing out all of the changes is worrying - and any change after the first write() is lost as a next write() doesn't seem to work, worse yet, you may completely loose the spreadsheet in question. The first is by JExcelAPI design, the second is probably a bug (in Octave Forge/Java or JExcelAPI ? I don't know). Adding data to existing spreadsheets does work, but IMO undue user confidence is needed. JExcelAPI supports BIFF5 (only reading) and BIFF8 (Excel 95 and Excel 97-2003, respectively). Upon overwriting, BIFF5 spreadsheets are converted silently to BIFF8. JexcelAPI, unlike ApachePOI, doesn't evaluate functions while reading but instead relies on cached results (i.e. results computed by Excel itself). Depending on Excel settings ("Automatic calculation" ON or OFF) this may or may not yield incorrect (or expected) results.


'''Apache POI''' (Java-based and platform-independent too) is based on the OpenOffice.org I/O Excel r/w routines. It is a more versatile than JExcelAPI, while it doesn't support BIFF5 it does support BIFF8 (Excel 97 – 2003) and OOXML (Excel 2007). It is slower than native JXL let alone Excel & COM but it features active formula evaluation, although at the moment (v. 3.8) not all Excel functions have been implemented. Obviously, as new functions are added in every new Excel release it's hard to catch up for Apache POI. I've made the relevant subfunction (xls2jpoi2oct) fall back to cached formula results (and yield a suitable warning) for non-implemented Excel functions while reading Excel files.
'''Apache POI''' (Java-based and platform-independent too) is based on the OpenOffice.org I/O Excel r/w routines. It is a more versatile than JExcelAPI, while it doesn't support BIFF5 it does support BIFF8 (Excel 97 – 2003) and OOXML (Excel 2007). It is slower than native JXL let alone Excel & COM but it features active formula evaluation, although at the moment (v. 3.8) not all Excel functions have been implemented. Obviously, as new functions are added in every new Excel release it's hard to catch up for Apache POI. I've made the relevant subfunction (xls2jpoi2oct) fall back to cached formula results (and yield a suitable warning) for non-implemented Excel functions while reading Excel files.
Line 540: Line 605:
Since io package version 2.2.0, the "OCT" interface has experimental write support for .xlsx and .ods formats, since io-2.2.2 (expected mid-May 2014) also for gnumeric. If you can't wait for gnumeric I/O you can checkout a snapshot from svn (see octave.sf.net, http://sourceforge.net/p/octave/code/HEAD/tree/trunk/octave-forge/main/io/)
Since io package version 2.2.0, the "OCT" interface has experimental write support for .xlsx and .ods formats, since io-2.2.2 (expected mid-May 2014) also for gnumeric. If you can't wait for gnumeric I/O you can checkout a snapshot from svn (see octave.sf.net, http://sourceforge.net/p/octave/code/HEAD/tree/trunk/octave-forge/main/io/)


[[Category:Octave-Forge]]
[[Category:Octave Forge]]
21

edits

Navigation menu