IO package: Difference between revisions

21,270 bytes added ,  27 November 2011
→‎XLS support: new section from the package doc
No edit summary
(→‎XLS support: new section from the package doc)
Line 198: Line 198:
* ods2oct.m (revision 7158)
* ods2oct.m (revision 7158)
* oct2ods.m (revision 7159)
* oct2ods.m (revision 7159)
== XLS support ==
=== Files content ===
* xlsread.m''' — All-in-one function for reading data from one specific worksheet in an Excel spreadsheet file. This script has Matlab-compatible functionality.
* '''xlswrite.m''' — All-in-one function for writing data to one specific worksheet in an Excel spreadsheet file. This script has Matlab-compatible functionality.
* '''xlsfinfo.m''' — All-in-one function for exploring basic properties of an Excel spreadsheet file. This script has Matlab-compatible functionality.
* '''xlsopen.m''' — Function for "opening" (= providing a handle to) an Excel spreadsheet file ("workbook"). This function sorts out which interface to use for .xls access (i.e.,COM; Java & Apache POI; JexcelAPI; OpenXLS; etc.), but it's choice can be overridden.
* '''xls2oct.m''' — Function for reading data from a specific worksheet pointed to in a struct created by xlsopen.m. xls2oct can be called multiple times consecutively using the same pointer struct, each time allowing to read data from different ranges and/or worksheets. Data are returned in the form of a 2D heterogeneous cell array that can be parsed by parsecell.m. xls2oct is a mere wrapper for interface-dependent scripts that do the actual low-level reading.
* '''oct2xls.m''' — Function for writing data to a specific worksheet pointed to in a struct created by xlsopen.m. octxls can be called multiple times consecutively using the same pointer struct, each time allowing to write data to different ranges and/or worksheets. oct2xls is a mere wrapper for interface-dependent scripts that do the actual low-level writing.
* '''xlsclose.m''' — Function for closing (the handle to) an Excel workbook. When data have been written to the workbook oct2xls will write the workbook to disk. Otherwise, the file pointer is simply closed and possibly used interfaces for Excel access (COM/ActiveX/Excel.exe) will be shut down properly.
* '''parsecell.m''' — Function for separating the data in raw arrays returned by xls2oct, into numerical/logical and text (cell) arrays.
* '''chk_spreadsheet_support.m''' — Internal function for (1) checking, (2) setting up, (3) debugging spreadsheet support. While not specifically meant for direct invocation from the Octave prompt (it is more useful during initialization of Octave itself) it can be very helpful when hunting down issues with spreadsheet support in Octave.
* '''spsh_chkrange.m''', '''spsh_prstype.m''', '''getusedrange.m''', '''calccelladdress.m''', '''parse_sp_range.m''' — Support files called by the scripts and not meant for direct invocation by users.
=== Required support software ===
For the Excel/COM interface:
* A windows computer with Excel installed
* Octave-forge Windows-1.0.8 or later package WITH LATEST SVN PATCHES APPLIED
For the Java / Apache POI / JExcelAPI interfaces (general):
* octave-forge java-1.2.8 package or later version on Linux
* octave-forge java-1.2.8 with latest svn fixes on Windows/MingW
* Java jre or jdk > 1.6.0 (hasn't been tested with earlier versions)
Apache POI specific:
* class .jars: '''poi-3.5-FINAL-<date>.jar & poi-ooxml-3.5-FINAL-<date>.jar''' (or later versions) in classpath
** Get it here: http://poi.apache.org/download.html
* for OOXML support (only available with Apache POI): '''poi-ooxml-schemas-<version>.jar''', '''xbean.jar''', '''dom4j-1.6.1.jar''' in javaclasspath.
** Get them here: http://poi.apache.org/download.html ("xmlbeans" and poi-ooxml-schemas) or http://sourceforge.net/projects/dom4j/files (dom4j-<version>)
JExcelAPI specific:
* class .jar: jxl.jar in classpath
** Get it here: http://sourceforge.net/projects/jexcelapi/files/
OpenXLS specific:
* class .jar: jxl.jar in classpath
** Get it here: http://sourceforge.net/projects/openxls/
These class libs must be referenced with full pathnames in your javaclasspath.
They had best be put in /<libdir>/java where <libdir> on Linux is usually /usr/lib; on MinGW it is usually /lib. The PKG_ADD command expects the class libs there; if they are elsewhere, add them in ./share/octave/<version>/m/startup/octaverc using appropriate javaaddpath statements or a chk_spreadsheet_support() call.
UNO specific (invoking OpenOffice.org (or clones) behind the scenes):
NOTE: EXPERIMENTAL!! A working OpenOffice.org installation. The utility function chk_spreadsheet_support can be used to add the needed entries to the javaclasspath.
=== Usage ===
'''xlsread''' and '''xlswrite''' are mere wrappers for '''xlsopen-xls2oct-xlsclose-parsecell''' and '''xlsopen-oct2xls-xlsclose''' sequences, resp. They exist for the sake of Matlab compatibility.
'''xlsfinfo''' can be used for finding out what worksheet names exist in the file. For OOXML files you either need MS-Excel 2007 for Windows (or later version) installed, and/or the input parameter REQINTF should be specified with a value of 'poi' (case-insensitive) and -obviously- the complete POI interface must have been installed.
Invoking '''xlsopen'''/..../'''xlsclose''' directly provides for much more flexibility, speed, and robustness than '''xlsread''' / '''xlswrite'''. Indeed, using the same file handle (pointer struct) you can mix reading & writing before writing the workbook out to disk using xlsclose.
And: xlsopen / xlsclose hide the gory interface details from the user.
Currently only .xls files (BIFF8) can be read/written; using JExcelAPI BIFF5 can be read as well. For OOXML files either Excel 2007 for Windows (or higher) and/or the complete Apache POI interface must be installed (and probably the REQINTF parameter specified with a value of 'poi').
When using '''xlsopen'''/.../'''xlsclose''' be sure to keep track of the file handle struct.
A possible scenario:
xlh = xlsopen (<excel_filename> , [rw], [<requested interface>])
# Set rw to 1 if you want to write to a workbook immediately.
# In that case the check for file existence is skipped and
# -if needed- a new workbook created.
# If you really want an other interface than auto-selected
# by xlsopen you can request that. But xlsopen still checks
# proper support for your choice.
# Read some data
[ rawarr1, xlh ] = xls2oct (xlh, <SomeWorksheet>, <Range>)
# Be sure to specify xlh as output argument as xls2oct keeps
# track of changes and the need to write the workbook to disk
# in the xlhstruct. And the origin range is conveyed through
# the xlh pointer struct.
# Separate data into numeric and text data
[ numarr1, txtarr1, lim1 ] = parsecell (rawarr1)
# Get more data from another worksheet in the same workbook
[ rawarr2, xlh ] = xls2oct (xlh, <SomeOtherWorksheet>, <Range>)
[ numarr2, txtarr2, lim2 ] = parsecell (rawarr2)
# <... Analysis and preparation of new data in cell array Newdata....>
# Add new data to spreadsheet
xlh = oct2xls (Newdata, xlh, <AnotherWorksheet>, <Range>)
# Close the workbook and write it to disk; then clear the handle
xlh = xlsclose (xlh)
clear xlh
When not using the COM interface, specify a value of 'POI' for parameter REQINTF when accessing OOXML files in xlsread, xlswrite, xlsopen, xlsfinfo (and be sure the complete Apache POI interface is installed). If you haven't got ActiveX installed (i.e., not having MS-Excel under Windows) specifying 'POI' may not be needed as in such cases Apache POI is the next default interface.
When using JExcelAPI (JXL), after writing into a worksheet you MUST save the file – adding data to the same or another worksheet is no more possible after the first call to oct2xls(). This is a limitation of JExcelAPI.
=== Spreadsheet formula support ===
When using the POI and JXL interfaces you can:
* (When reading, xls2oct) either read spreadsheet formula results (like in COM interface), or the literal formula text strings;
* (When writing, oct2xls) either enter formulas in the worksheet as formulas, or enter them as literal text strings. The former is also like in COM.
In short, you can enter spreadsheet formulas and in a later stage read them back, change them and re-enter them in the worksheet.
The behaviour is controlled by an option structure options which for now has only one (logical) field:
* options.formulas_as_text = 0 (the default) implies enter formulas as formulas and read back formula results
* options.formulas_as_text =1 (or any positive integer) means enter formulas as text strings and read them back as text strings.
Be aware that there's no formula evaluator in JExcelAPI (JXL). So if you create formulas in your spreadsheet using oct2xls or xlswrite with 'JXL', do not expect meaningful results when reading those files later on ''unless'' you open them in Excel and write them back to disk.
While both Apache POI and JExcelAPI feature a formula validator, not all spreadsheet functions present in Excel have been implemented (yet).
Worse, older Excel versions feature less functions than newer versions. So be wary as this may make for interesting confusion.
=== Matlab compatibility ===
'''xlsread''', '''xlswrite''' and '''xlsfinfo''' are for the most part Matlab-compatible. Some small differences are mentioned below. When using the Java interfaces octave supplies some formula manipulation support.
* xlsread
** Matlab's xlsread supports invoking extra functions while reading ("passing function handle"); octave not. But this can be simulated outside xlsread.
** Matlab's xlsread flags some spreadsheet errors, octave-forge just returns blank cells.
** Octave-forge returns info about the actual (rather than the requested) cell range where the data came from. Personally I find it very useful to know from what part of a worksheet the data originate so I've put quite some effort in it :-) Matlab can't, due to Excel automatically trimming returned arrays from empty outer columns and rows. Octave is more clever but the Visual Basic call used for determining the actually used range has some limitations:
**# it relies on cached range values and thus may be out-of-date;
**# it counts empty formatted cells too. When using ActiveX/COM, if octave's xlsfinfo.m returns wrong data ranges it is most often an overestimation.
*:Matlab's xlsread ignores all non-numeric data values outside the smallest rectangle encompassing all numerical values. Octave's xlsread doesn't. This means that Matlab ignores all row/column headers, not very user-friendly IMO.
** When using the Java interface, reading and writing xls-files by octave-forge is platform-independent. On systems w/o installed Excel, Matlab can only read Excel 95 formatted .xls files (written using ML xlswrite's 'Basic" option) – and then differently than under Windows.....
** Matlab's xlsread returns strings for cells containing date values. This makes for endless if-then-elseif-else-end constructs to catch all expected date formates. Octave returns numerical data (where 0 = 1/1/1900 – you can easily transfer them into proper octave date values yourself using e.g. datestr(), see bottom of this document for more info).
** Matlab's xlsread invokes csvread if no Excel interface is present. Octave-forge's xlsread doesn't.
* xlswrite
** Octave-forge's xlswrite works on systems w/o Excel support, Matlab's doesn't (properly).
**When specifying a sheet number larger than the number of existing sheets in an .xls file, Matlab's xlswrite adds empty sheets until the new sheet number is created; Octave's xlswrite only adds one sheet called "Sheet<number>" where <number> is the specified sheet number.
** Even better (IMO) while M's xlswrite always creates Sheet1/Sheet2/Sheet3 when creating a new spreadsheet, octave's xlswrite only creates the requested worksheet. (Did you know that you can instruct Excel to create spreadsheets with just one, or any number of, worksheets? Look in Tools | Options, General tab.)
** Oh and octave doesn't touch the "active sheet" - but that's not automatically an advantage.
** If the specified write range is larger than the actual data array, Matlab's xlswrite adds #N/A cells to fill up the lowermost rows and rightmost columns; octave-forge's xlswrite doesn't.
* xlsfinfo
** When invoking Excel/COM interface, octave-forge's xlsfinfo also echoes the type of sheet (worksheet, chart), not just the sheet names. Using Java I haven't found similar functionality (yet).
=== Comparison of interfaces & usage ===
Using Excel itself (through '''COM''' / '''ActiveX''' on Windows systems) is probably the most robust and versatile and especially FAST option. There's one gotcha: in case of some type of COM errors Excel will keep running invisibly; you can only end it through Task Manager. A tiny problem is that one cannot find out easily through COM what file types are supported; xls, wks, wk1, xlsx, etc. Another -obvious- limitation is that COM Excel access only works on Windows systems where Excel is installed.
'''JExcelAPI''' (Java-based and therefore platform-independent) is proven technology but switching between reading and writing is quite involved and memory-hungry when processing large spreadsheets. As the docs state, JExcelAPI is optimized for reading and it does do that well - but still slower than Excel/COM. The fact that upon a switch from reading to writing the existing spreadsheet is overwritten in place by a blank one and that you can only get the contents back wen writing out all of the changes is worrying - and any change after the first write() is lost as a next write() doesn't seem to work, worse yet, you may completely loose the spreadsheet in question. The first is by JExcelAPI design, the second is probably a bug (in octave-forge/Java or JExcelAPI ? I don't know). Adding data to existing spreadsheets does work, but IMO undue user confidence is needed. JExcelAPI supports BIFF5 (only reading) and BIFF8 (Excel 95 and Excel 97-2003, respectively). Upon overwriting, BIFF5 spreadsheets are converted silently to BIFF8. JexcelAPI, unlike ApachePOI, doesn't evaluate functions while reading but instead relies on cached results (i.e. results computed by Excel itself). Depending on Excel settings ("Automatic calculation" ON or OFF) this may or may not yield incorrect (or expected) results.
'''Apache POI''' (Java-based and platform-independent too) is based on the OpenOffice.org I/O Excel r/w routines. It is a more versatile than JExcelAPI, while it doesn't support BIFF5 it does support BIFF8 (Excel 97 – 2003) and OOXML (Excel 2007). It is slower than native JXL let alone Excel & COM but it features active formula evaluation, although at the moment (v. 3.7) not all Excel functions have been implemented. I've made the relevant subfunction (xls2jpoi2oct) fall back to cached formula results (and yield a suitable warning) for non-implemented Excel functions while reading Excel files.
'''OpenXLS''' (an open source version of Extentech's commercial Java-xls product) is still experimental. It seems to work faster than JExcelAPI, but it has other issues - i.e., it locks the .xls file and the unlocking mechanism is a bit wonky. Sometimes xls files keep being locked until Octave is shut down. Currently OXS write support is disabled (but the code is there).
'''UNO''' (invoking OpenOffice.org or clones behind the scenes, a la ActiveX) is experimental. It works FAST (i.e., once OOo itself is loaded which can take some time) and can process much larger spreadsheets than the other Java-based interfaces because the data are not entered in the JVM but in OOo's memory. A big stumbling block is that odsclose() on a UNO xls struct will kill ALL OpenOffice.org invocations, also those that were not related to Octave! This is due to UNO-Java limitations. The underlying issue is that when Octave starts an OpenOffice.org invocation, OpenOffice.org must be closed for Octave to be able to exit; otherwise Octave will wait for OOo to shut down before it can terminate itself. So Octave must kill OOo to be able to terminate. A way out hasn't been found yet.
All in all, of the three Java options I'd prefer Apache POI rather than OpenXLS or JexcelAPI. But the latter is indispensable for BIFF5 formats. Once UNO is stable it is to be preferred as it can read ALL file formats supported by OOo (viz. wk1, ods, xlsx, sxc, ...)
Some notes on the choice for Java:
# It saves a LOT of development time to use ready-baked Java classes rather than developing your own routines and thus effectively reinvent the wheel.
# A BIG advantage is that a Java-based solution is platform-independent ("portable").
# But Java is known to be not very conservative with resources, especially not when processing XML-based formats.
So Java is a compromise between portability and rapid development time versus capacity (and speed). But IMO data sets larger than 5.105 cells should not be kept in spreadsheets anyway. Better use real databases for such data sets.
=== Troubleshooting ===
Some hints for troubleshooting Excel support are contained in this thread: http://sourceforge.net/mailarchive/forum.php?thread_name=4C61B649.9090802%40hccnet.nl&forum_name=octave-dev dated August 10, 2010. A more structured approach is below.
Since April 2011 a special purpose setup file has been included in the io package (chk_spreadsheet_support.m). Large parts of the approach below (starting at Step 2) have been automated in this script. When running it with the second input argument (debug level) set to 3 a lot of useful diagnostic output will be printed to screen.
#Check if COM / ActiveXworks (only under Windows OS). Do a ''pkg list'' and see:
##If there's a windows package mentioned (then it's installed). If not, install it.
##If there's an asterisk on the windows package line (then the package is loaded). If not, do a pkg load windows
#Check if the ActiveX server works. Do:
#:exl = actxserver ('Excel.Application') ## Note the period between 'Excel' and 'Application'
##If a COM object is returned, ActiveX / COM / Excel works. Do:
##: exl.Quit(); delete (exl) ## to shut down the (hidden) Excel invocation.
##If you get an error message, your last resort is re-installing the windows package, or trying the Java-based interfaces.
#Check if java works. Do a ''pkg list'' and see:
##If there's a java package mentioned (then it's installed). If not, install it.
##If there's an asterisk on the java package line (then the package is loaded). If not, do a pkg rebuild -auto java
#Check Java memory settings. Try
#:javamem
##If it works, check if it reports sufficiently large max memory (had better be 200 MiB, the bigger the better)
##If it doesn't work, do:
##:rt = java_invoke ('java.lang.Runtime', 'getRuntime')
##:rt.gc
##:rt.maxMemory ().doubleValue () / 1024 / 1024 ## show MaxMemory in MiB.
##In case you have insufficient memory, see in "GOTCHAS", "Java memory pool allocation size", how to increase java's memory pre-reservation.
#Check if all classes (.jarfiles) are in class path. Do a 'javaclasspath' (under unix/linux, do 'tmp = javaclasspath; strsplit (tmp,":")' (w/o quotes). See above under "REQUIRED SUPPORT SOFTWARE" what classes should be mentioned.
** If classes (.jar files) are missing, download and put them somewhere and add them to the javaclass path with their fully qualified pathname (in quotes) using javaaddpath().
** Once all classes are present and in the javaclasspath, the xls interfaces should just work. The only remaining showstoppers are insufficient write privileges for the working directory, a wrecked up octave or some other problem outside octave.
#Try opening an xls file:
#: xls1 = xlsopen ('test.xls', 1, 'poi'). If this works and xls1 is a struct with various fields containing objects, the Apache POI interface (POI) works. Do an xls1 = xlsclose (xls1) to close the file.
#: xls2 = xlsopen ('test.xls', 1, 'jxl'). If this works and xls2 is a struct with various fields containing objects, the JExcelAPI interface (JXL) works as well. Don't forget to do xls2 = xlsclose (xls2) to close the file.
=== Development ===
xlsopen/xlsclose and friends have been written so that adding other interfaces (Perl? native octave? ...?) should be very easily accomplished. Xlsopen.m merely needs two stanzas, xlsfinfo.m and getusedrange.m each need an additional elseif stanza, and xlsclose.m needs a small stanza for closing the pointer struct and writing to disk. The real work lies in creating the relevant xls2<...>2oct & oct2<...>2xls & <getusedrange_...> subfunction scripts in xls2oct.m, oct2xls.m and getusedrange.m, resp., but that shouldn't be really hard, depending on the interface support libraries' quality and documentation. Separating the file access functions and the actual reading/writing from/to the workbook in memory has made developer's life (I mean: my time developing this stuff) much easier.
Some other options for development (who?):
*Speeding up, especially Java worksheet/cell access. For cracks, not me.
*Automatic conversion of Excel date/time values into octave ones and vice versa (adding or subtracting 636960). But then again Excel's dates are 01-01-1900 based (octave's 0-0-0000) and buggy (Excel thinks 1900 is a leap year), and I sometimes have to use dates from before 1900. Maybe as an option?
*Creating Excel graphs (a significant enterprise to write from scratch).
*Support for "passing function handle" in xlsread.


[[Category:Packages|Package documentation]]
[[Category:Packages|Package documentation]]