Should array indices start at 0 or 1? My compromise of 0.5 was
rejected without, I thought, proper consideration.
--Stan Kelly-Bootle
Introduction
Library software like netCDF or
HDF5 provides access to multidimensional data by array indices, but we
would often rather access data by coordinates, such as points on the
Earth's surface or space-time bounding boxes.
The Climate and Forecast (CF)
Metadata Conventions provide agreed-upon ways to represent coordinate
systems, but do not address making access by coordinates practical and
efficient. In particular, use of coordinate systems such as those described
in the CF Conventions section "Two-Dimensional Latitude, Longitude, Coordinate
Variables" may not provide a direct way to find array
indices from latitude and longitude coordinates for various
reasons. Examples of such data include:
- use of curvilinear grids that follow coastlines
- satellite swaths georeferenced to lat-lon grids
- coordinate systems with no CF grid_mapping_name attribute
- ungridded "station" data, such as point observations and soundings
Using a real-world (well, OK, an idealized spherical Earth)
example, we'll see several ways to access data by coordinates, as well
as how the use of appropriate software and data structures can greatly
improve the efficiency of such access.
An example dataset
The example dataset we use is from the HYbrid Coordinate Ocean Model (HYCOM).
This dataset makes a good example for a couple of reasons:
- It's typical of many ocean, climate, and forecast model data
products, with multiple variables on the same grid, and conforming
to the CF Conventions for representing a coordinate system.
- It has a non-trivial geospatial coordinate system, represented by
2D latitude and longitude arrays that are auxiliary coordinates for
other variables. The grid appears to be a subset of a tri-polar grid, but none of the parameters needed to derive a formula for which elements correspond to a specified coordinate are in the file.
Though not very relevant for the subject of this blog, the example
dataset also makes good use of compression, with fill values over land
for ocean data. The netCDF classic file is 102MB, but the equivalent
netCDF-4 classic model file with compression is only
30MB. Nothing in this post has anything to do with the netCDF-4
data model; it all applies to classic model netCDF-3 or netCDF-4
data.
Here's some output, relevant to the problem we'll present, from"ncdump -h" applied to the example file:
dimensions: MT = UNLIMITED ; // (1 currently)
Y = 850 ;
X = 712 ;
Depth = 10 ;
variables:
double MT(MT) ;
MT:units = "days since 1900-12-31 00:00:00" ;
double Date(MT) ;
Date:units = "day as %Y%m%d.%f" ;
float Depth(Depth) ;
Depth:units = "m" ;
Depth:positive = "down" ;
int Y(Y) ;
Y:axis = "Y" ;
int X(X) ;
X:axis = "X" ;
float Latitude(Y, X) ;
Latitude:units = "degrees_north" ;
float Longitude(Y, X) ;
Longitude:units = "degrees_east" ;
float temperature(MT, Depth, Y, X) ;
temperature:coordinates = "Longitude Latitude Date" ;
temperature:standard_name = "sea_water_potential_temperature" ;
temperature:units = "degC" ;
float salinity(MT, Depth, Y, X) ;
salinity:coordinates = "Longitude Latitude Date" ;
salinity:standard_name = "sea_water_salinity" ;
salinity:units = "psu" ;
Plotting all of the latitude-longitude grid would result in a big blob of
600,000 pixels, but here's a sparser representation of the grid,
showing every 25th line parallel to the X and Y axes:
![Sparse grid lines near Alaska]()
Examples of coordinate queries
The temperature variable has auxiliary Longitude and Latitude
coordinates, and we want to access data corresponding to their
values, rather than X and Y array indices. Can we
efficiently determine values on netCDF variables such as temperature or salinity at, for example, the grid point closest to 50
degrees north latitude and -140 degrees east longitude?
More generally, how can we efficiently
- determine the value of a variable nearest a specified location and time?
- access all the data on a subgrid defined within a specified
space-time coordinate bounding box?
Why use an iPython notebook for the example?
For this example, the code for reading netCDF data and mapping from
coordinate space to array index space is presented in an iPython
notebook. Why not C, Java, Fortran, or
some other language supported directly by Unidata?
The point is clarity. This is a pretty long blog, which has
probably already collected several "TLDR" comments. The
accompanying example code has to be short, clear, and easy to
read.
The iPython notebook, which grew out of a session on reading netCDF
data in the Unidata 2013 TDS-Python Workshop, is available in two
forms: a
non-interactive view in HTML that you can read through and follow
along, or an actual iPython notebook,"netcdf-by-coordinates" and associated data file from the 2013 Unidata
TDS-Python Workshop, with which you can run examples
and do timings on your own machine and even with your own
data!
If you choose the second method, you can see the results of
changing the examples or substituting your own data. To use the
notebook you'll need to have Python and some additional libraries
installed, as listed at the bottom of the README page on the referenced workshop site.
Just looking at the HTML version should be sufficient to see what's
going on.
Four Approaches
In the notebook, we implement and progressively improve several ways to access
netCDF data based on coordinates instead of array indices:
- slow, simple, and sometimes wrong
- fast, simple, and sometimes wrong
- not quite as fast, but correct
- fast, correct, and scalable for many queries on the same
set of data points
Naive, slow
approach
Since our concrete problem is to find which
of over 600,000 points is closest to the (lat,
lon) point (50, -140), we should first say
what is meant by"close".
A naive metric for distance squared
between (lat,lon) and (lat0, lon0) just uses the"Euclidean norm", because it's easy to
compute:
dist_squared = (lat - lat0)2 + (lon - lon0)2
So we need to find indicesx and y such that the
point
(Latitude[y, x],
Longitude[y, x])
is close to the desired point (lat0, lon0). The naive way to
do that is to use two nested loops, checking distance squared for
all 600,000+ pairs of y andx values, one point at a time.
The code for this version is presented as the functionnaive_slow and a few subsequent lines that
call the function on our explicit example.
Note that the code cell examples all initially open the example
file, read the data needed to find the desired values, and close
the file after printing enough information to verify that we got
the right answer. The work of opening the file, reading the
data, and closing the file is repeated for each function so that
the cells are independent of each other and can be run in any order
or re-run without getting errors such as trying to read from closed
files or closing the same file twice.
Faster with NumPy
The first approach uses conventional
explicit loops, which makes it resemble Fortran or C code. It can
be sped up by a factor of over 700 just by using NumPy, a Python
library supporting multidimensional arrays and array-at-a-time
operations and functions that often replace explicit loops with whole
array statements. We call this version of the functionnaive_fast.
Like naive_slow, it suffers from flaws in the use
of a "flat" measure of closeness.
Tunnel distance: still fast, but more correct
A problem with both of the previous versions, is that
they treat the Earth as flat, with a degree of longitude
near the poles just as large as a degree of
longitude on the equator. It also treats the
distance between points on the edges, such as (0.0, -179.99) and
(0.0, +179.99) as large, even though these points
are just across the International Date Line from
each other.
One way to solve both these problems is to use a
better metric, for example the length of a
tunnel through the Earth between two points as
distance, which happens to be easy to compute by just using a little
trigonometry.
This results in a more correct solution that works
even if longitudes are given in a different range,
such as between 0 degrees and 360 degrees instead of
-180 to 180, because we just use trigonometric functions
of angles to compute tunnel
distance. Wikipedia has the simple trig
formulas we can use. This also gives us
the same answer as the naive approach for our
example point, but avoids erroneous answers we would
get by treating the Earth as flat in the"naive" approaches.
Timing this approach, with the function we calltunnel_fast, because we still use fast NumPy arrays
and functions instead of explicit loops, is still about 200 times as
fast as the loopy naive
approach.
Using KD-trees for More Scalable Speedups
Finally, we use a
data structure specifically designed for quickly finding the closest
point in a large set of points to a query point: the KD-tree (also
called a "k-d tree"). Using a KD-tree is a two-step
process.
First you load the KD-tree with the set of points within which you
want to search for a closest point, typically the grid points or
locations of a set of ungridded data. How long this takes
depends on the number of points as well as their dimensionality, but
is similar to the time it takes to sort N numbers, namelyO(N log(N)). For the example dataset with over 600,000
points in the grid, this takes under 3 seconds on our laptop test
platform, but it's noticeably longer than the setup time for the
minimum tunnel-distance search.
The second step provides a query point and returns the closest
point or points in the KD-tree to the query point, where how"closest" is defined can be varied. Here, we use 3D
points on spherical Earth with unit radius. The combination of
setup and query are initially implemented in akdtree_fast function.
Although you can't tell by running the setup and query together,
the kdtree_fast query is significantly faster than in
the tunnel_fast search. Thus thekdtree_fast approach scales much better when we have
one set, S, of points to search and lots of query points for
which we want the closest point (or points) in S.
For example, we may want variable values on a 100 by 100 (lat,lon)
subgrid of domain, and as we'll see, the KD-tree provides a much
faster solution than either the tunnel_fast ornaive_fast methods. If the locations don't
change with data from multiple files, times, or layers, the setup time
for the KD-tree can be reused for many point queries, providing
significant efficiency benefits. For popular spatial datasets on
the server side, the same KD-tree could be
reused for multiple datasets that use
the same coordinates.
The KD-tree data structure can also be used more flexibly to
provide fast queries for the M closest points to a specified query
point, which is useful for interpolating values instead of just using
the value of a variable at a single closest point.
The KD-tree data structure is like a Swiss-army knife for
coordinate data, analogous in some ways to regular expressions for
strings. We have shown use of KD-trees for 2D lat/lon data
and for 3D coordinates of points on a sphere with a tunnel distance
metric. KD-trees can also be used with vertical coordinates or time,
if you specify a reasonable way to define "close" between
points in space-time.
Implementations of KD-trees are freely available for
C, Java, Fortran, C++, and other
languages than Python. A Java KD-tree implementation is used inncWMS, the Web Map Service included in TDS
servers but developed at the University of Reading. A C/C++
implementation of KD-trees is used in Fimex, a
library from the Norwegian Meteorological Institute that supports
interpolation and subsetting of geospatial data built around the
Unidata Common Data Model. The module we used in the iPython notebook
example, scipy.spatial.cKDTree, is a C implementation callable from
Python, significantly faster than the pure
Python implementation in
scipy.spatial.KDTree.
The last section of the notebook re-implements our four functions as
four object-oriented Python classes, with the setup code in a class
constructor and a single query method. This makes it easier to
time the initialization and query parts of each approach separately,
which leads to simple formulas for the time required for N points and
a prediction for when the extra setup time forkdtree_fast pays off.
And the Winner Is ...
The envelope, please. And the
winner is ... YMMV!
The reasons Your Mileage May Vary include:
- number of points in search set
- dimensionality of search set
- complexity of distance metric
- number of query points
- number of closest neighbors needed, e.g. for interpolation
- I/O factors, including caching, compression, and remote access effects
- accounting for non-spherical Earth shape
We've summarized our timing results for the four approaches to the
sample problem on the example dataset at the end of the iPython
notebook. Perhaps we can look at some of the factors listed above in
future blogs.
In case you didn't want to open the iPython notebook to see the
results, here's some times on a fast laptop platform:
Method | Setup (ms) | Query (ms) | Setup + 10000 queries (sec) |
---|
Naive_slow | 3.76 | 7790 | 77900 |
Naive_fast | 3.8 | 2.46 | 24.6 |
Tunnel_fast | 27.4 | 5.14 | 51.4 |
Kdtree_fast | 2520 | 0.0738 | 3.3 |
Determining when it's worth it to use a KD-tree forthis example resulted in:
- kd_tree_fast outperforms naive_fast above 1050 queries
- kd_tree_fast outperforms tunnel_fast above 490 queries
The advantage of using KD-trees is much greater for more search set
points, as KD-tree query complexity is O(log(N)), but the other
algorithms are O(N), the same as the difference between using binary
search versus linear search.
If you're interested in more on this subject,
a good short paper on various ways to regrid data using KD-trees and
other methods is Fast regridding of large, complex geospatial
datasets by Jon Blower and A. Clegg.