Editing IO package

Jump to navigation Jump to search
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 1: Line 1:
The {{Forge|io|IO package}} is part of the Octave Forge project and provides input/output from/in external formats.
The IO package is part of the octave-forge project and provides input/output from/in external formats.
 
<div class="tocinline">__TOC__</div>


=== About read/write support ===
=== About read/write support ===
Line 96: Line 94:
And so on ...
And so on ...


==== 32 vs. 64-bit issues ====
Generally, if you use a Java-based interface for spreadsheet I/O, it doesn't matter much whether you use Octave 32-bit or Octave 64-bit, as long as Octave's bit width matches that of the Java JRE. If you want to use the UNO interface (for LibreOffice & OpenOffice.org), also LibreOffice's bit width needs to match that of Octave and the Java JRE.
So for spreadsheet I/O with Java-based add-on software like e.g., Apache POI, 64-bit Octave requires a 64-bit Java JRE and -if so desired- a 64-bit LibreOffice. The add-on SW itself (Java .jar files) is bit width agnostic.
On Windows, Octave with a loaded Octave-Forge windows package can invoke MS-Excel for spreadsheet I/O but only 32-bit MS-Office; 64-bit MS-Office does not support ActiveX. In this case it doesn't matter much whether Octave is 32-bit or 64-bit.


==== Java example ====
==== Java example ====
Line 134: Line 125:
</nowiki></pre>
</nowiki></pre>


An easier way is to collect all required Java class libs for spreadsheet I/O (the .jar files) in one subdir and have chk_spreadsheet_support .m sort it all out:
An easier way is to collect all required Java class libs fo spreadsheet I/O (the .jar files) in one subdir and have chk_spreadsheet_support .m sort it all out:
<pre><nowiki>
<pre><nowiki>
octave:8>  chk_spreadsheet_support ('/full/path/to/subdir/with/.jar/files')
octave:8>  chk_spreadsheet_support ('/full/path/to/subdir/with/.jar/files')
Line 166: Line 157:


And reading/writing .xls files should work.
And reading/writing .xls files should work.
POI is located differently on every system. To easily find where are the files and their names try to search for: libapache-poi-java.list.
in terminal:
<pre><nowiki>
find / -name "libapache-poi-java*" 2>/dev/null
</nowiki></pre>
in octave (poi path in ubuntu):
<pre><nowiki>
fid = fopen ("/var/lib/dpkg/info/libapache-poi-java.list");
line = fgetl (fid);
while (line != -1)
    javaaddpath(line);
    line = fgetl (fid);
endwhile
fclose (fid);
pkg load io;
disp(chk_spreadsheet_support); % should be 2
</nowiki></pre>


== Detailed Information (TL) ==
== Detailed Information (TL) ==
Line 410: Line 383:
For the Excel/COM interface:
For the Excel/COM interface:
* A windows computer with Excel installed. Note that 64-bit MS-Office has no support for COMx /ActiveX so you might have to resort to the Java interfaces below
* A windows computer with Excel installed. Note that 64-bit MS-Office has no support for COMx /ActiveX so you might have to resort to the Java interfaces below
* Octave Forge Windows-1.0.8 or later package WITH LATEST SVN PATCHES APPLIED. Currently (2013) windows-1.2.1 is the best option.
* Octave-forge Windows-1.0.8 or later package WITH LATEST SVN PATCHES APPLIED. Currently (2013) windows-1.2.1 is the best option.


For the Java / Apache POI / JExcelAPI interfaces (general):
For the Java / Apache POI / JExcelAPI interfaces (general):
* octave Forge java-1.2.8 package or later version on Linux
* octave-forge java-1.2.8 package or later version on Linux
* octave Forge java-1.2.8 with latest svn fixes on Windows/MingW
* octave-forge java-1.2.8 with latest svn fixes on Windows/MingW
* Java JRE or JDK > 1.6.0 (hasn't been tested with earlier versions). Although not an Octave issue, as to security you'd better get the latest Java version anyway.
* Java JRE or JDK > 1.6.0 (hasn't been tested with earlier versions). Although not an Octave issue, as to security you'd better get the latest Java version anyway.


Line 495: Line 468:


==== Spreadsheet formula support ====
==== Spreadsheet formula support ====
When using the OCT, COM, POI, JXL, and UNO interfaces you can:
When using the COM, POI, JXL, and UNO interfaces you can:
* (When reading, xls2oct) either read spreadsheet formula results, or the literal formula text strings (also works with OCT interface);
* (When reading, xls2oct) either read spreadsheet formula results, or the literal formula text strings (also works with OCT interface);
* (When writing, oct2xls) either enter formulas in the worksheet as formulas, or enter them as literal text strings.
* (When writing, oct2xls) either enter formulas in the worksheet as formulas, or enter them as literal text strings.
Line 507: Line 480:
Be aware that there's no formula evaluator in JExcelAPI (JXL) nor OpenXLS (OXS). So if you create formulas in your spreadsheet using oct2xls or xlswrite with 'JXL' or 'OXS', do not expect meaningful results when reading those files later on ,unless you open them in Excel and write them back to disk.
Be aware that there's no formula evaluator in JExcelAPI (JXL) nor OpenXLS (OXS). So if you create formulas in your spreadsheet using oct2xls or xlswrite with 'JXL' or 'OXS', do not expect meaningful results when reading those files later on ,unless you open them in Excel and write them back to disk.


While both Apache POI and JExcelAPI feature a formula validator, not all spreadsheet functions present in Excel have been implemented (yet). <br />
While both Apache POI and JExcelAPI feature a formula validator, not all spreadsheet functions present in Excel have been implemented (yet).
Worse, older Excel versions feature less functions than newer versions. So be wary as this may make for interesting confusion. <br />
Worse, older Excel versions feature less functions than newer versions. So be wary as this may make for interesting confusion.
Reading and writing "shared formulas" with the OCT interface isn't (yet) supported; see bug
[https://savannah.gnu.org/bugs/?52875 bug #52875].


==== Matlab compatibility ====
==== Matlab compatibility ====
Line 518: Line 489:
* xlsread
* xlsread
** Matlab's xlsread supports invoking extra functions while reading ("passing function handle"); octave not. But this can be simulated outside xlsread.
** Matlab's xlsread supports invoking extra functions while reading ("passing function handle"); octave not. But this can be simulated outside xlsread.
** Matlab's xlsread flags some spreadsheet errors, Octave Forge just returns blank cells.
** Matlab's xlsread flags some spreadsheet errors, octave-forge just returns blank cells.
** Octave Forge returns info about the actual (rather than the requested) cell range where the data came from. Personally I find it very useful to know from what part of a worksheet the data originate so I've put quite some effort in it :-) Matlab can't, due to Excel automatically trimming returned arrays from empty outer columns and rows. Octave is more clever but the Visual Basic call used for determining the actually used range has some limitations:
** Octave-forge returns info about the actual (rather than the requested) cell range where the data came from. Personally I find it very useful to know from what part of a worksheet the data originate so I've put quite some effort in it :-) Matlab can't, due to Excel automatically trimming returned arrays from empty outer columns and rows. Octave is more clever but the Visual Basic call used for determining the actually used range has some limitations:
**# it relies on cached range values and thus may be out-of-date;
**# it relies on cached range values and thus may be out-of-date;
**# it counts empty formatted cells too. When using ActiveX/COM, if octave's xlsfinfo.m returns wrong data ranges it is most often an overestimation.
**# it counts empty formatted cells too. When using ActiveX/COM, if octave's xlsfinfo.m returns wrong data ranges it is most often an overestimation.
*:Matlab's xlsread ignores all non-numeric data values outside the smallest rectangle encompassing all numerical values. Octave's xlsread doesn't. This means that Matlab ignores all row/column headers, not very user-friendly IMO.
*:Matlab's xlsread ignores all non-numeric data values outside the smallest rectangle encompassing all numerical values. Octave's xlsread doesn't. This means that Matlab ignores all row/column headers, not very user-friendly IMO.
** When using the Java interface, reading and writing xls-files by Octave Forge is platform-independent. On systems w/o installed Excel, Matlab can only read Excel 95 formatted .xls files (written using ML xlswrite's 'Basic" option) – and then differently than under Windows.....
** When using the Java interface, reading and writing xls-files by octave-forge is platform-independent. On systems w/o installed Excel, Matlab can only read Excel 95 formatted .xls files (written using ML xlswrite's 'Basic" option) – and then differently than under Windows.....
** Matlab's xlsread returns strings for cells containing date values. This makes for endless if-then-elseif-else-end constructs to catch all expected date formats. Octave returns numerical data (where 0 = 1/1/1900 – you can easily transfer them into proper octave date values yourself using e.g. datestr(), see bottom of this document for more info). For dates before 1/1/1900, Octave returns dates as text strings.
** Matlab's xlsread returns strings for cells containing date values. This makes for endless if-then-elseif-else-end constructs to catch all expected date formats. Octave returns numerical data (where 0 = 1/1/1900 – you can easily transfer them into proper octave date values yourself using e.g. datestr(), see bottom of this document for more info). For dates before 1/1/1900, Octave returns dates as text strings.
** Matlab's xlsread invokes csvread if no Excel interface is present. Octave Forge's xlsread doesn't.
** Matlab's xlsread invokes csvread if no Excel interface is present. Octave-forge's xlsread doesn't.
** Octave can read either formula results (evaluated formulas) or the formula text strings; Matlab can't.
** Octave can read either formula results (evaluated formulas) or the formula text strings; Matlab can't.


* xlswrite
* xlswrite
** Octave Forge's xlswrite works on systems w/o Excel support, Matlab's doesn't (properly).
** Octave-forge's xlswrite works on systems w/o Excel support, Matlab's doesn't (properly).
**When specifying a sheet number larger than the number of existing sheets in an .xls file, Matlab's xlswrite adds empty sheets until the new sheet number is created; Octave's xlswrite only adds one sheet called "Sheet<number>" where <number> is the specified sheet number.
**When specifying a sheet number larger than the number of existing sheets in an .xls file, Matlab's xlswrite adds empty sheets until the new sheet number is created; Octave's xlswrite only adds one sheet called "Sheet<number>" where <number> is the specified sheet number.
** Even better (IMO) while M's xlswrite always creates Sheet1/Sheet2/Sheet3 when creating a new spreadsheet, octave's xlswrite only creates the requested worksheet. (Did you know that you can instruct Excel to create spreadsheets with just one, or any number of, worksheets? Look in Tools | Options, General tab.)
** Even better (IMO) while M's xlswrite always creates Sheet1/Sheet2/Sheet3 when creating a new spreadsheet, octave's xlswrite only creates the requested worksheet. (Did you know that you can instruct Excel to create spreadsheets with just one, or any number of, worksheets? Look in Tools | Options, General tab.)
** Oh and octave doesn't touch the "active sheet" - but that's not automatically an advantage.
** Oh and octave doesn't touch the "active sheet" - but that's not automatically an advantage.
** If the specified write range is larger than the actual data array, Matlab's xlswrite adds #N/A cells to fill up the lowermost rows and rightmost columns; Octave Forge's xlswrite doesn't.
** If the specified write range is larger than the actual data array, Matlab's xlswrite adds #N/A cells to fill up the lowermost rows and rightmost columns; octave-forge's xlswrite doesn't.


* xlsfinfo
* xlsfinfo
** When invoking Excel/COM interface, Octave Forge's xlsfinfo also echoes the type of sheet (worksheet, chart), not just the sheet names. Using Java I haven't found similar functionality (yet).
** When invoking Excel/COM interface, octave-forge's xlsfinfo also echoes the type of sheet (worksheet, chart), not just the sheet names. Using Java I haven't found similar functionality (yet).


==== Comparison of interfaces & usage ====
==== Comparison of interfaces & usage ====
Using Excel itself (through '''COM''' / '''ActiveX''' on Windows systems) is probably the most robust and versatile and especially FAST option. There's one gotcha: in case of some type of COM errors Excel will keep running invisibly; you can only end it through Task Manager. A tiny problem is that one cannot find out easily through COM what file types are supported; xls, wks, wk1, xlsx, etc. Another -obvious- limitation is that COM Excel access only works on Windows systems where Excel is installed.
Using Excel itself (through '''COM''' / '''ActiveX''' on Windows systems) is probably the most robust and versatile and especially FAST option. There's one gotcha: in case of some type of COM errors Excel will keep running invisibly; you can only end it through Task Manager. A tiny problem is that one cannot find out easily through COM what file types are supported; xls, wks, wk1, xlsx, etc. Another -obvious- limitation is that COM Excel access only works on Windows systems where Excel is installed.


'''JExcelAPI''' (Java-based and therefore platform-independent) is proven technology but switching between reading and writing is quite involved and memory-hungry when processing large spreadsheets. As the docs state, JExcelAPI is optimized for reading and it does do that well - but still slower than Excel/COM. The fact that upon a switch from reading to writing the existing spreadsheet is overwritten in place by a blank one and that you can only get the contents back wen writing out all of the changes is worrying - and any change after the first write() is lost as a next write() doesn't seem to work, worse yet, you may completely loose the spreadsheet in question. The first is by JExcelAPI design, the second is probably a bug (in Octave Forge/Java or JExcelAPI ? I don't know). Adding data to existing spreadsheets does work, but IMO undue user confidence is needed. JExcelAPI supports BIFF5 (only reading) and BIFF8 (Excel 95 and Excel 97-2003, respectively). Upon overwriting, BIFF5 spreadsheets are converted silently to BIFF8. JexcelAPI, unlike ApachePOI, doesn't evaluate functions while reading but instead relies on cached results (i.e. results computed by Excel itself). Depending on Excel settings ("Automatic calculation" ON or OFF) this may or may not yield incorrect (or expected) results.
'''JExcelAPI''' (Java-based and therefore platform-independent) is proven technology but switching between reading and writing is quite involved and memory-hungry when processing large spreadsheets. As the docs state, JExcelAPI is optimized for reading and it does do that well - but still slower than Excel/COM. The fact that upon a switch from reading to writing the existing spreadsheet is overwritten in place by a blank one and that you can only get the contents back wen writing out all of the changes is worrying - and any change after the first write() is lost as a next write() doesn't seem to work, worse yet, you may completely loose the spreadsheet in question. The first is by JExcelAPI design, the second is probably a bug (in octave-forge/Java or JExcelAPI ? I don't know). Adding data to existing spreadsheets does work, but IMO undue user confidence is needed. JExcelAPI supports BIFF5 (only reading) and BIFF8 (Excel 95 and Excel 97-2003, respectively). Upon overwriting, BIFF5 spreadsheets are converted silently to BIFF8. JexcelAPI, unlike ApachePOI, doesn't evaluate functions while reading but instead relies on cached results (i.e. results computed by Excel itself). Depending on Excel settings ("Automatic calculation" ON or OFF) this may or may not yield incorrect (or expected) results.


'''Apache POI''' (Java-based and platform-independent too) is based on the OpenOffice.org I/O Excel r/w routines. It is a more versatile than JExcelAPI, while it doesn't support BIFF5 it does support BIFF8 (Excel 97 – 2003) and OOXML (Excel 2007). It is slower than native JXL let alone Excel & COM but it features active formula evaluation, although at the moment (v. 3.8) not all Excel functions have been implemented. Obviously, as new functions are added in every new Excel release it's hard to catch up for Apache POI. I've made the relevant subfunction (xls2jpoi2oct) fall back to cached formula results (and yield a suitable warning) for non-implemented Excel functions while reading Excel files.
'''Apache POI''' (Java-based and platform-independent too) is based on the OpenOffice.org I/O Excel r/w routines. It is a more versatile than JExcelAPI, while it doesn't support BIFF5 it does support BIFF8 (Excel 97 – 2003) and OOXML (Excel 2007). It is slower than native JXL let alone Excel & COM but it features active formula evaluation, although at the moment (v. 3.8) not all Excel functions have been implemented. Obviously, as new functions are added in every new Excel release it's hard to catch up for Apache POI. I've made the relevant subfunction (xls2jpoi2oct) fall back to cached formula results (and yield a suitable warning) for non-implemented Excel functions while reading Excel files.
Line 602: Line 573:
=== OCT interface ===
=== OCT interface ===


Since io package version 1.2.4, an interface called "OCT" was added. Except for unzip, it has no dependencies and it is faster than the Java-based interfaces.
Since io package version 1.2.4, an interface called "OCT" was added. Except for unzip, it has no dependencies. It's still experimental but fast! Feel free to test it and give us a feedback.
Currently it supports reading and writing .xlsx, .ods and .gnumeric files (the latter in yet-to-be-released io-2.2.2).
If  
If  
<pre>chk_spreadsheet_support == 0</pre>
<pre>chk_spreadsheet_support == 0</pre>
Line 610: Line 582:
<pre>m = xlsread ('file.xlsx', 1, [], 'OCT');</pre>
<pre>m = xlsread ('file.xlsx', 1, [], 'OCT');</pre>


About development: <br />
Since io package version 2.2.0, the "OCT" interface has experimental write support for .xlsx and .ods formats, since io-2.2.2 (expected mid-May 2014) also for gnumeric. If you can't wait for gnumeric I/O you can checkout a snapshot from svn (see octave.sf.net, http://sourceforge.net/p/octave/code/HEAD/tree/trunk/octave-forge/main/io/)
The OCT interface makes use of regular expressions for parsing the XML contents of OOXML, ODS and gnumeric formats. While frowned upon by XML gurus (see for example [https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 here] for some amusing postings), using regexps is much faster than any current XML parser. But the trade-off is that regexps are fragile, esp. withregard to the order in which XML tags appear in XML nodes. <br />
Just for reassurance: to date we haven't seen any problems with the OCT interface for reading and writing regular data.
 


[[Category:Octave Forge]]
[[Category:Octave-Forge]]
Please note that all contributions to Octave may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see Octave:Copyrights for details). Do not submit copyrighted work without permission!

To edit this page, please answer the question that appears below (more info):

Cancel Editing help (opens in new window)

Template used on this page: