Friday, February 24, 2012

CGI, Spreadsheet::XLSX and Archive::Zip on Windows 7

I wanted to use Spreadsheet::XLSX to open a spreadsheet uploaded to a web server / CGI program running on Windows 7. On the face of it, it's easy - the CGI module returns a file handle to the uploaded file and Spreadsheet::XLSX accepts a file handle to the file to be opened. But all I got back were errors.

After quick investigation, I discovered that Archive::Zip was rejecting the file handles as not seekable. So, I tried providing different file handles. I tried converting the lightweight file handle from CGI to one compatible with IO::Handle (per the docs). I tried slurping the file into a scalar and creating a file handle from the scalar, using IO::Handle, File::Handle and open. Nothing worked.

So I had another look at Archive::Zip and, in particular the _isSeekable subroutine. No doubt the sub is the way it is for good reasons, but I couldn't understand why it was the way it was and there was not much explanation, beyond notes of some cases that claim to work but don't (or, perhaps, didn't).

One of my programming principles is to simply do what is to be done and deal with errors if and as they occur, rather than running tests to determine whether what is to be done will succeed and not try if success seems unlikely. The test might suggest failure when the operation would succeed (the case I am dealing with here) or they might suggest success when the operation will fail. Because of the latter case, it is necessary to detect and handle errors despite the test and if errors are detected and handled appropriately when they happen there is realitvely little value and some cost in running an unreliable (more or less) test first. The only justification for using an unreliable test up-front is if the actual failed operation is significantly expensive (time, CPU, memory or whatever the relevant resource might be) and the test is not. Very often this is not the case and the up front test is almost purly a disadvantage.

In this case, the tests being done concluded that a seekable file handle was not seekable, so Archive::Zip refused to open the archive. I knew the file handles I was passing were seekable: I tried seeking and it worked.

Rather than rewrite Archive::Zip to skip the test for seekability, which would have required a more thorough review and who knows how much rewrite, I rewrote the test. To me, it's simple: If you want to know if a file handle can seek, try seeking. If it works: it works!! If it fails, then seek doesn't work.

There are many ways to test whether seek is working. One is to simply trust its return status, but comments in the existing code suggest that some implementations are broken so badly that they return success when they have failed. A more robust test would require knowing the contents of the file in advance, seeking to specific locations, reading (I don't care about writing or alternating between reading and writing in this case) and getting expected data. But I didn't know the contents of the file. I could have tried seeking to a location then seeking back to that same location, re-reading and confirming that the same data was read. This would make the test more robust.

I simply called tell and seek on the file handle and checked that no errors were indicated and that reasonable results were returned. More thorough testing could be done, like checking that the same data is read whenever tell reports the same file position, but I haven't bothered.

I don't suggest at all that what I have written is a good, general solution to the problem for Archive::Zip. I have no idea what experience, no doubt time consuming and difficult, lead to the batter of tests and heuristics in _isSeekable. Nor do I care - what I have done works for my case.

If I were worried about (or experienced) more serious errors (die'ing) when trying to tell or seek, I would have wrapped the test in an eval block, but quick and dirty worked fine in this case.

In case this may be of some use to you, here is the diff:

psh.bat% diff -u Zip.pm /inetpub/xenwof/lib/Archive/Zip.pm
--- aZip.pm      Tue Jun 30 06:51:10 2009
+++ bZip.pm   Sat Feb 25 08:04:36 2012
@@ -379,28 +379,64 @@
 sub _isSeekable {
     my $fh = shift;
     return 0 unless ref $fh;
-    if ( _ISA($fh, 'IO::Scalar') ) {
-        # IO::Scalar objects are brokenly-seekable
-        return 0;
+#    IG 21 Feb 2012
+#    Try to make a better test for whether the filehandle is seekable
+#    as the current test (below) seems to give false negatives for
+#    file handles that are seekable.
+#
+#    if ( _ISA($fh, 'IO::Scalar') ) {
+#        # IO::Scalar objects are brokenly-seekable
+#        return 0;
+#    }
+#    if ( _ISA($fh, 'IO::String') ) {
+#        return 1;
+#    }
+#    if ( _ISA($fh, 'IO::Seekable') ) {
+#        # Unfortunately, some things like FileHandle objects
+#        # return true for Seekable, but AREN'T!!!!!
+#        if ( _ISA($fh, 'FileHandle') ) {
+#            return 0;
+#        } else {
+#            return 1;
+#        }
+#    }
+#    if ( _CAN($fh, 'stat') ) {
+#        return -f $fh;
+#    }
+#    return (
+#        _CAN($fh, 'seek') and _CAN($fh, 'tell')
+#        ) ? 1 : 0;
+#
+#        IG 21 Feb 2012
+#        What we want to be able to do is seek and tell on the
+#        file handle. Let's start with tell.
+#
+    my $pos = tell($fh);
+    return 0 if($pos == -1);
+
+    # OK - tell works
+    # try seek
+    unless(seek($fh, $pos + 10, 0)) {
+        return 0;   # seek failed
     }
-    if ( _ISA($fh, 'IO::String') ) {
-        return 1;
+
+    my $pos2 = tell($fh);
+
+    unless($pos2 == $pos + 10) {
+        return 0;   # No failure indication but wrong position
     }
-    if ( _ISA($fh, 'IO::Seekable') ) {
-        # Unfortunately, some things like FileHandle objects
-        # return true for Seekable, but AREN'T!!!!!
-        if ( _ISA($fh, 'FileHandle') ) {
-            return 0;
-        } else {
-            return 1;
-        }
+
+    unless(seek($fh, $pos, 0)) {
+        return 0;   # it seems we can seek forward but not back
     }
-    if ( _CAN($fh, 'stat') ) {
-        return -f $fh;
+
+    my $pos3 = tell($fh);
+
+    unless($pos3 == $pos) {
+        return 0;   # no failure indication but wrong position
     }
-    return (
-        _CAN($fh, 'seek') and _CAN($fh, 'tell')
-        ) ? 1 : 0;
+
+    return 1;       # it works well enough for me
 }

 # Print to the filehandle, while making sure the pesky Perl special global




No comments:

Labels