Iterating over huge arrays

Tuesday September 5, 2006

I just released a new version of my arrayterator module. I cleaned it up a little bit and simplified the interface, enough for deserving an upgrade on the version number from 0.1 to 0.2.

The module solves an uncommon problem: sometimes I need to iterate over all values of a huge multi-dimensional array stored in disk. The naïve solution is to flatten the array and then do the iteration:

>>> from pynetcdf import NetCDFFile as nc
>>> f = nc('huge.file')
>>> var = f.variables['some-var']
>>> array = var[:]
>>> for value in array.flat: pass

Of course this will consume all the computer memory if the array is too big. My solution is to wrap var using my arrayterator class:

>>> from arrayterator import arrayterator
>>> array = arrayterator(var, nrecs=17)
>>> for value in array.flat: pass

This way the program will read at most 17 records from the file at a time. In a 4×10 array, e.g., the iteration is done by reading blocks of shape (4,4) from the variable. This number can be variable, depending on the array shape and the desired number of records to read; with a 7×1 array with a buffer size of 4 two blocks of size (4,1) and (3,1) will be read, in that order.

A nice feature is that you can slice the arrayterator. The result is a new arrayterator that iterates over the requested subset, exactly as if you were iterating over the sliced array. And of course the wrapper supports any number of dimensions, not just 2 as in these examples.

Roberto De Almeida

,

---

Commenting is closed for this article.

---