OSError Invalid argument when reading shape file #235

INRIX-Mark-Gershaft · 2022-02-22T21:15:31Z

On some shape files containing data from OSM (i.e. could be encoding or special characters in road name column) I am getting OSError Invalid argument.

reader = shapefile.Reader(shape_file, encodingErrors='replace')
fields = reader.fields[1:]
field_names = [field[0] for field in fields]
feature_count = 0
for sr in reader.iterShapeRecords():
    atr = dict(zip(field_names, sr.record))
    geom = sr.shape.__geo_interface__
   feature_count += 1

When I debugged what I see is that after it read last feature it is still not at the end of the file, thus in iterShapes() it thinks that there are more shapes and tries to read it:

while shp.tell() < self.shpLength:
            yield self.__shape()

Then in __shape() it parses recLength and it is garbage (some negative number), tries to compute next records' location and puts it somewhere beyond file's valid locations.

(recNum, recLength) = unpack(">2i", f.read(8))
next = f.tell() + (2 * recLength)

Since file has 600K features I don't really know how to help further.
What is strange though is that QGIS and GeoPandas are able to parse same file.

The text was updated successfully, but these errors were encountered:

karimbahgat · 2022-02-23T21:00:32Z

Interesting. First, can you report the pyshp version, and give the full python stacktrace/error message (not just the last line)?

Encoding errors are not typically OSError, so i think it's more likely what you say about byte unpacking/args when reading the .shp file.

In the for-loop, do you do any additional processing of the shapefile, or did you show the whole loop?

Best would be if there's any public links or github repos where i can access the file? Or depending on the filesize, if you can upload as an attachment or send as a file sharing link?

INRIX-Mark-Gershaft · 2022-02-25T18:57:13Z

No, this is just smoke-test script that opens each and every shape file we produce to see if it can be read and that number of features is not 0.
segments_shapefile.zip

Version installed:
pyshp 2.1.3

Stack trace (I ran from interpreter session which is why there is no script line):

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Tools\Python\3.9.2\lib\site-packages\shapefile.py", line 1337, in iterShapeRecords
    for shape, record in izip(self.iterShapes(), self.iterRecords()):
  File "C:\Tools\Python\3.9.2\lib\site-packages\shapefile.py", line 1169, in iterShapes
    yield self.__shape()
  File "C:\Tools\Python\3.9.2\lib\site-packages\shapefile.py", line 1106, in __shape
    f.seek(next)
OSError: [Errno 22] Invalid argument

File is created using Java:
https://docs.geotools.org/stable/javadocs/org/geotools/data/shapefile/ShapefileDataStore.html

Just verified - still reproducible in latest version of PyShp

karimbahgat · 2022-03-17T10:53:32Z

Thanks for providing the file and the other details. After some testing, here is a rather lengthy response, which can be summarized as the file containing corrupted data towards the end of the file. It appears the error happens when trying to read beyond shape #135307, ie the error is with #135308:

>>> import shapefile
>>> r = shapefile.Reader('segments_shapefile.zip')
>>> for shape in r.iterShapes():
...      pass
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:/Users/kimok/Documents/GitHub\pyshp\shapefile.py", line 1429, in iterShapes  
    shape = self.__shape(oid=i, bbox=bbox)
  File "C:/Users/kimok/Documents/GitHub\pyshp\shapefile.py", line 1353, in __shape     
    f.seek(next)
  File "C:\Program Files\Python39\lib\tempfile.py", line 474, in func_wrapper
    return func(*args, **kwargs)
OSError: [Errno 22] Invalid argument
>>> shape
Shape #135307: POLYLINE
>>> r.shape(135307)
Shape #135307: POLYLINE
>>> r.shape(135308)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:/Users/kimok/Documents/GitHub\pyshp\shapefile.py", line 1385, in shape       
    i = self.__restrictIndex(i)
  File "C:/Users/kimok/Documents/GitHub\pyshp\shapefile.py", line 1249, in __restrictIndex
    raise IndexError("Shape or Record index out of range.")
IndexError: Shape or Record index out of range.

It looks like the shapefile has a length of 135308 but the reader is trying to read beyond the end of the file:

>>> r.numRecords # determined from dbf header
135308
>>> r.numShapes # determined from shx header
135308
>>> len(r._offsets) # index positions read from the shx file
135308

The error happens only with shapes() and .iterShapes() since these methods keep iterating until they have reached the actual end of the .shp file. Apparently, the .shp file contains 18464768 bytes, which is more than the end of the last 135307th shape located at byte position 18462948:

>>> # seek to end of last shape
>>> last_shape_offset = r._offsets[135307]
>>> r.shp.seek(last_shape_offset) # beginning of last shape
18462732
>>> (recNum, recLength) = unpack(">2i", r.shp.read(8))
>>> print(recNum, recLength)
135308 104
>>> r.shp.tell() + (2 * recLength) # end of last shape according to shape header
18462948
>>> r.shp.seek(0, 2) # actual end of file
18464768

Reading the actual contents of the last shape confirms that the end of the last shape should be at byte position 18462948:

>>> from struct import unpack
>>> r.shp.seek(last_shape_offset) # beginning of last shape
18462732
>>> (recNum, recLength) = unpack(">2i", r.shp.read(8))
>>> shapeType = unpack("<i", r.shp.read(4))[0]
>>> bbox = unpack("<4d", r.shp.read(32))
>>> nParts = unpack("<i", r.shp.read(4))[0]
>>> nPoints = unpack("<i", r.shp.read(4))[0]
>>> parts = unpack("<%si" % nParts, r.shp.read(nParts * 4))
>>> points = unpack("<%sd" % (2 * nPoints), r.shp.read(16*nPoints))
>>> r.shp.tell() # end of last shape according to data contents
18462948

Similarly, the shapefile header also agrees that the length of the file should be 18462948:

>>> # read the shapefile header's `File Length` entry
>>> r.shp.seek(24)
>>> unpack(">i", r.shp.read(4))[0] * 2 
18462948

Trying to read the beginning of the next shape results in unexpected data:

>>> (recNum, recLength) = unpack(">2i", r.shp.read(8))
>>> (recNum, recLength)                               
(50331648, -1049923325)

So this is an issue of the writer erroneously writing too much data, or the header containing incorrect header information, depending on how you view it. Since both the shx and dbf headers says the file length should be 135308, and since the beginning of the next shape contains unexpected data, I am inclined to believe that the extra data got corrupted in some way and that the headers represent the correct number of non-corrupted shapes/records. If I were to speculate I wonder if the java writer tried to write additional data, and when it failed for some reason it wrote the number of successfully written entries to the header, but did not remove the partially written extra data from the files.

Looking at how fiona/GDAL handle the file, it appears they also would rather believe the information in the shapefile header to determine the contents of the file:

>>> import fiona
>>> r = fiona.open('segments_shapefile/SegmentReference.shp')
>>> print(len(r))
135308
>>> r[135308]
ERROR 1: Attempt to read shape with feature id (135308) out of available range.

This does highlight however that pyshp's approach to reading corrupted files like this should be updated (see also #147, #223). It looks like a previous decision was made to believe the actual length of the shapefile rather than the number of shapes specified in the shapefile header, according to a comment in the code: Found shapefiles which report incorrect shp file length in the header. Can't trust that so we seek to the end of the file and figure it out. This is why shapes() and iterShapes() continue to read until the end of the file, which ends up failing when additional corrupted data has been written to the end of the file (as in this case). It should be noted that the dbf records() and iterRecords() do not read until the end of the file, but rather only according to the number of records stated in the dbf header.

In short, the question boils down to whether we believe the number of shapes as stated in the header, or based on the length of the file. Since the dbf reader goes by the header information, and since fiona/GDAL also appears to follow the header information, I'm inclined to think we should follow the header instead, potentially with a simple check to abort early if there's not enough data left in the file. Any opinion on this @GeospatialPython?

karimbahgat · 2022-03-17T12:37:14Z

One possible problem is that in the absence of the shx file, we don't know the number of shapes in the .shp file. A possible solution:

When the shx file is present, set numShapes based on the number of shapes defined in the header, and use that to iterate.
When a dbf file is present, use the number of dbf records in the header as the basis for iterating (a correct shapefile should have the same length). Once we have reached the end of the shp file, record the number of shapes by setting the numShapes attribute for metadata purposes.
When we only have the shp file, have a conditional clause that instead iterates incrementally until we reach the shapefile length as specified in the shapefile header (shpLength). Once we have reached the end of the shp file, record the number of shapes by setting the numShapes attribute for metadata purposes.
In all cases, check and raise an error if each read will exceed the actual length of the file (eg partially written shape, or incorrect shapefile file length header).

It would also be possible to skip through all shapes to count the number of shapes and using this to set the numShapes attribute when first loading the shapefile. Although we can probably do this pretty fast by only reading the first 8 bytes of each shape to skip to the start of the next shape, this might be an unnecessary amount of processing for a reader that should be as lazy as possible..

karimbahgat · 2022-04-30T09:06:54Z

This file should be read correctly in v2.3.0. More robust handling of shapefiles with corrupted data based on the discussion in this thread, believing the metadata rather than reading to the end of the file. See specific implementation details at d1a8fed and ff51716.

jjuch · 2024-12-19T14:52:59Z

Hi, I've encountered similar errors with the v2.3.0 version and in my case I could bring it back to an issue with unpacking the byte array containing the recNum and recLength in Reader.__len__ and Reader.__shape. The unpack function uses >2i formatting, but as they were stored as unsigned integers uint32, I had to update them to >2I formatting to have the correct value.

On the post of 17 March 2022, a similar problem is observed, where recLength is a negative value, causing an error.

I haven't done a thorough analysis on this issue, so I cannot propose a decent PR, but it could give insight to consider in the next version.

JamesParrott · 2024-12-19T17:02:52Z

Thanks for looking into this, and for the work-around @jjuch.

The difference between >2i and >2I in struct.Struct does indeed configure either signed or unsigned ints respectively (32 bit / 4 byte ones).

The original shapefile spec is quite clear that integers should be signed, by the way:

Numeric Types A shapefile stores integer and double-precision numbers. The remainder of this document
will refer to the following types:
Integer: Signed 32-bit integer (4 bytes)

Nonetheless, there are precedents in PyShp and beyond for being more tolerant than that of Shapefiles, that strictly speaking are non-compliant. Adding support to handle such files could well boil down to a single character.

Do you have an example shapefile, and do you know from where it originally came?

jjuch · 2024-12-20T09:21:04Z

Ok, my bad, it was another issue in the shape's content header, where the recLength was incorrect and it returned one of the vertices' coordinate as the next recLength, causing negative values. Thank you for the fast reply, though!

karimbahgat added enhancement invalid-file labels Mar 17, 2022

karimbahgat added this to the More robustness milestone Mar 17, 2022

karimbahgat closed this as completed Apr 30, 2022

This was referenced Apr 30, 2022

Handle shapefile headers which specify incorrect shp file length #147

Closed

Error in shapes() #223

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OSError Invalid argument when reading shape file #235

OSError Invalid argument when reading shape file #235

INRIX-Mark-Gershaft commented Feb 22, 2022

karimbahgat commented Feb 23, 2022

INRIX-Mark-Gershaft commented Feb 25, 2022 •

edited

Loading

karimbahgat commented Mar 17, 2022 •

edited

Loading

karimbahgat commented Mar 17, 2022 •

edited

Loading

karimbahgat commented Apr 30, 2022

jjuch commented Dec 19, 2024

JamesParrott commented Dec 19, 2024 •

edited

Loading

jjuch commented Dec 20, 2024 •

edited

Loading

OSError Invalid argument when reading shape file #235

OSError Invalid argument when reading shape file #235

Comments

INRIX-Mark-Gershaft commented Feb 22, 2022

karimbahgat commented Feb 23, 2022

INRIX-Mark-Gershaft commented Feb 25, 2022 • edited Loading

karimbahgat commented Mar 17, 2022 • edited Loading

karimbahgat commented Mar 17, 2022 • edited Loading

karimbahgat commented Apr 30, 2022

jjuch commented Dec 19, 2024

JamesParrott commented Dec 19, 2024 • edited Loading

jjuch commented Dec 20, 2024 • edited Loading

INRIX-Mark-Gershaft commented Feb 25, 2022 •

edited

Loading

karimbahgat commented Mar 17, 2022 •

edited

Loading

karimbahgat commented Mar 17, 2022 •

edited

Loading

JamesParrott commented Dec 19, 2024 •

edited

Loading

jjuch commented Dec 20, 2024 •

edited

Loading