Quantcast
Channel: Unidata Developer's Blog
Viewing all 452 articles
Browse latest View live

GRIB renaming in 4.3

$
0
0

CDM version 4.3 is a complete overhaul of GRIB handling, including the use of the external tables, which map the codes in the file to the parameter names, units, and everything else that there is to know about the data in the GRIB file. Because there are no machine-readable, canonical tables for software to use, these tables are maintained by hand, and parameter names are sometimes tweaked in the actual tables that are used by centers, including NCEP. Since we use those tables to name the CDM variables, we have a problem when names change that are embeddded in scripts or in IDV bundles.  

(There are a bunch of other complexities about creating GRIB variable names that you can read about in previous blog posts. Previous versions of the CDM GRIB reading code made several important mistakes in how variables are named which must be corrected, so that a lot of old scripts have to break anyway.)

To solve this problem, the CDM played with using variable names only from information in the GRIB files themselves (resulting in lovely names like "VAR_58-0-2-11_L105") and encoding all the "human readable" information in the long_name attribute. While this solved the changing name problem, nobody was much happy with it. After much discussion at the recent user committee, we came up with this recommendation for the GRIB renaming issue:

Usercomm recognizes the tension between making GRIB variable names human readable vs keeping them stable, in the face of changing external GRIB parameter tables. We also recognize that the netCDF-Java library needs to fix incorrectly named GRIB variables. We recommend that:

1) GRIB variable names should be kept human readable.

2) A mapping between old and new names should be created so that users don't experience disruptive changes in IDV bundles and scripts.

3) A small, ongoing amount of "bundle breaking" is expected as GRIB parameter tables change.

4) The IDV should develop functionality to gracefully transition bundles and other scripts as GRIB variable names, URLS, and other external web resources change.

The net result is that GRIB variable names will stay human readable in 4.3  (assuming you think "wnd_vcmp_mn_trnc18_isobaric" makes sense), at the cost of occasional broken scripts when GRIB tables get modified. This is the state of the art. Hopefully we can continue to improve it.

      'Tis but thy name that is my enemy;
      Thou art thyself, though not a Montague.
      What's Montague? it is nor hand, nor foot,
      Nor arm, nor face, nor any other part
      Belonging to a man. O, be some other name!
      What's in a name? that which we call a rose
      By any other name would smell as sweet;
      So Romeo would, were he not Romeo call'd,
      Retain that dear perfection which he owes
      Without that title. Romeo, doff thy name,
      And for that name which is no part of thee
      Take all myself.  -- J.C.

 


An Essay on Domain Specific Models

$
0
0
A domain specific model is one which has constructs that are specific to a particular domain. Examples are the CDMCoordinate System and Scientific Feature Type models, which augment CDM with features such as lat-lon based indexing, grids, and point data.

It is often the case that a domain specific model is implemented by providing an API that in turn is implemented with respect to some underlying, more generic data model. Again, CDM is an example, where the Coordinate System and Scientific Feature Type models are implemented using the underlying CDM Data Access Layer model.

DAP4 also provides a good, generic model capable of supporting domain specific models. In fact, it should be possible to implement the equivalent of the CDM Coordinate System and Scientific Feature Type models on top of DAP4.

Figure 1. Notional Domain Model over DAP4 Architecture

Figure 1 shows a notional architecture for using DAP4 as the basis for domain specific modeling.

The client is given an API supporting a domain specific model. It is assumed here that an instance of the domain model meta-data is represented as a traversable abstract syntax tree. So, the model API would provide the following operations.

  1. Request the meta-data for a specific dataset. The result would be a reference to the root of the abstract tree for that meta-data. This is analogous to asking DAP2 for a DDS.
  2. Traverse the meta-data tree.
  3. Request some subset of the data associated with the dataset. The request would be in terms of the domain-specific model.

The domain-specific library implementing the API would be responsible for making requests to the server for information. Those requests would indicate to the server the domain model being used as well as the dataset being requested.

The reply from the server would be in the form of a DAP4 DDX and data with annotations (i.e. attributes and extra variables) sufficient to allow the client-side domain model library to convert the returned information to the domain model for presentation to the client.

The server side operation is similar. It is assumed that request from a client contains sufficient identifying information to allow the server (a servlet server such as Tomcat) to forward the request to the servlet capable of interpreting such a domain specific model request.

The domain specific model servlet is capable of translating the dataset into DAP4 as the reply to the client's request. As expected by the client, the reply is annotated with sufficient information to allow the generic DAP4 reply to be converted to the domain specific model.

Dennis Heimbigner

HDF5 Dimension Scales

$
0
0

When we created the netCDF-4 file format on top of HDF5, we asked the HDF group to add shared dimensions. They said no, and instead added dimension scales, which at that point were in the HDF4 data model, but not in HDF5. In retrospect, I think we should have worked harder to come to a mutual agreement. The lack of shared dimensions in HDF5 makes HDF5 not a strict superset of netCDF-4.

In this post I'm going to review dimensions and dimension scales. I'll try to convince you that the lack of shared dimensions in HDF5 means that you really should use netCDF-4 instead of HDF5 for earth science data.

In the netCDF data model, a variable is a container for a multidimensional array of data. The shape of the array is defined by the list of dimensions for the variable. A dimension has a name and a length. When more than one variable uses the same dimension, we say that the dimension is shared.

What does it mean for a dimension to be shared? A variable can be thought of as representing a function on a set of points called the domain of the function. A variable's set of dimensions are simply a representation of its domain. From mathematics, we represent the function like

f : D -> R 

meaning for each point in the domain set, D, a function assigns a value from the range set, R. How does this relate to netCDF variables? Let's use this example:

  dimensions
    lat = 180;
    lon = 360;
    z = 56;
    time = 365; 

  variables:
    float data(time, z, lat, lon);

The data variable is defined for a set of index values defined by the dimensions of the variable.  We call this the variable's domain in index space. (Index space is an abstract lattice of points in n-dimensional space, where n is the number of dimensions.) What we are usually more interested in is mapping data to locations on the physical earth. To do so, we usecoordinate functions. These are simply netCDF variables that play the role of assigning, for each data point, a location in coordinate space (in this example, physical space and time). In order to do this, these coordinate functions must have the same domain as the data variable, and so must have the same dimensions. So, in our example this might look like:

  dimensions
  lat = 180;
  lon = 360;
  z = 56;
  time = 365; 

variables:
  float data(time, z, lat, lon);
data:coordinates = "lat lon z time";
  float lon(lon);
  float lat(lat);
  float z(z);
  int time(time); 

Here we use the CF convention's coordinates attribute to assign coordinate variables to the data variable. For some random point, say, time=5, z =12, lat=22, lon=123, the location of data(5,12,22,123) is at time(5), z(12), lat(22), lon(123). Its pretty clear that coordinate functions can only have dimensions that are shared with the variable. For example, suppose you had lat(sample). Since the sample dimension doesn't appear in the data, there is no way to assign a unique lat value to a data point. On the other hand, there is no problem with the coordinate functions using only a subset of dimensions (as in this example) or even a scalar variable — for example if all the data was at a single time coordinate.

To summarize, the essence of shared dimensions is that they indicate that two variables have the same domain, and this is needed to assign coordinates for sampled functions. If you like UML (and who doesn't?) here is the CDM's UML diagram for coordinate systems.

Ok, let's get back to the HDF5 data model. HDF5 variables (aka datasets) don't use shared dimensions, but define their shape with a dataspace object, which is defined separately for each variable. So there is no formal way in the HDF5 data model to indicate that two variables share the same domain. As we'll see, dimension scales help some, but not enough.

Each variable in HDF5 defines its shape with adataspace, which is essentially a list of private dimensions for the variable.  A Dimension Scale is a special variable containing a set of references to dimensions in variables. Each referenced variable has a DIMENSION_LIST attribute that contains, for each dimension, a list of references to Dimension Scales. So we have a two-way, many-to-many linking between Dimension Scales and Dimensions:

  DimScale <------> Dimension

The HDF5 Dimension Scale API mostly just maintains this two way linking, plus allows the links to be named.

So it appears that by using Dimension Scales, we now have shared dimensions in HDF5: namely, all the dimensions that share the same Dimension Scale are ... the same!

Unfortunately nothing requires the "shared" dimensions to have the same length as the dimension scale, or have the same length as any of the other dimensions that are associated with the dimension scale, or that the dimension scale even has the same rank as an associated dimension. The HDF5 dimensions scale design doc is quite explicit that any other semantics are not part of the HDF5 data model, and must be added by other layers:

It is important to emphasize that the Dataspace of a Dataset has no intrinsic meaning except to define the layout in computer storage. Dimension Scales may be used to store application specific labels to the positions in the stored data array, i.e., to add application specific meaning to the dimensions of the dataspace. A Dimension Scale is an object associated with one dimension of a Dataspace. The meaning of the association is left to applications. The values of the Dimension Scale are set by the application to reflect semantics of the data, for example, to associate coordinates of a reference system with positions on the dimension.

All we get with dimension scales is a many-to-many association of a variable's private dimension with a specially marked variable called a Dimension Scale. It is up to the user (or a layer like netCDF-4) to add semantics and maintain consistency. This was a deliberate choice by the HDF Group, presumably in the name of generality.

Obviously, other application layers like netCDF-4 can layer shared dimensions on top of HDF5 Dimension Scales. The minimum requirements for shared dimensions are that:

  1. Dimensions are associated with only one Dimension Scale.
  2. A Dimension Scale is one dimensional.
  3. All dimensions have the same length as the shared Dimension Scale.

Those are the things that a program can check for. But the intention of the data writer is crucial, because the real requirement for shared dimensions is that the dimensions represent the domain of the function, and the dimension scale values represent the coordinates for that dimension.

But if the HDF5 data model does not include that meaning, if a data writer can make dimension scales mean anything they want, then in a strict sense, without knowing more, software can't assume shared dimensions, so you can't define CF-style coordinate functions. And that is why you should use the netCDF4 library — in order to get that essential functionality.

The HDF5 design does seem to recognize the use case of a 1-D Dimension Scale whose length matches the dimension:

A simple case is where the Dimension Scale s is a (one dimensional) sequence of labels for the dimension ix of Dataset d. In this case, Dimension Scale is an array indexed by the same index as in the dimension of the Dataspace. For example, for the Dimension Scale s, associated with dimension ix, the ith position of ix is associated with the value s[i], so s[i] is taken as a label for ix[i].

In my next blog post I will show how netCDF-4 uses Dimension Scales, and what can be done in a practical sense with HDF5 data files.

Next: HDF5 Dimension Scales - Part 2

HDF5 Dimension Scales - Part 2

$
0
0

Last time we said that in order for an HDF5 Dimension Scale to represent a shared dimension, the following must be true:

  1. Dimensions are associated with only one Dimension Scale.
  2. A Dimension Scale is one dimensional.
  3. All dimensions have the same length as the shared Dimension Scale.

The netCDF-Java and C libraries look for Dimension Scales that satisfy these conditions in any HDF5 file, not just those written with the netCDF-4 library. So applications can create shared dimensions using the HDF5 API directly. That's good news for those who are not allowed to use the netCDF-4 library. (But the application has to be a bit smarter, so you might as well let netCDF-4 do the work for you if possible).

You might hope that one could also use Dimension Scales to represent coordinate functions. For example, in our previous blog we had this example:

 dimensions
  lat = 180;
  lon = 360;
  z = 56;
  time = 365; 

variables:
  float data(time, z, lat, lon);
data:coordinates = "lat lon z time";
  float lon(lon);
  float lat(lat);
  float z(z);
  int time(time); 

And the Dimension Scale attribute associated with the data variable looks just like the CF coordinates attribute:

 DIMENSION_LIST = "time", "z", "lat", "lon";

It turns out that the example above is really a special case of separable coordinate functions, where each coordinate variable is 1 dimensional and each has a unique dimension associated with it. This is an example of the gridded data type, and is familiar to most of us. Unfortunately it is the only one that many data writers understand.

The previous blog covered the general form of coordinate functions, namely that they must have the same domain as the data variables, meaning they have the same (or a subset of the) dimensions of the data variable. As a canonical example, consider swath data:

 dimensions
    scan = 3253;
    xscan = 980;
 variables:
    float data(scan, xscan);
      data:coordinates = "lat lon alt time";
    float lon(scan, xscan);
    float lat(scan, xscan);
    float alt(scan, xscan);
    int time(scan); 

Here we have 2D coordinate functions lon, lat and alt, which are not associated with just one dimension. The problem with HDF5 Dimension Scales is that they are associated with a single dimension of the variable, rather than the set of dimensions for the variable, i.e. the variable's domain. This is a fatal flaw if you are trying to use Dimension Scales to represent coordinate functions is a general way. You are going to need another layer to specify and implement the correct semantics for coordinate functions.

The netCDF-4 data model alone doesn't have this either, but if you add the CF coordinates attribute convention on top of netCDF, that is sufficient. Note that its so simple even rocket scientists can do it:

 data:coordinates = "lat lon alt time";

Of course you can and should (are you listening satellite data providers?) use CF Conventions with HDF5, as long as you also implement shared dimensions (using Dimension Scales as described above). At this point the distinction between the netCDF-4 and HDF5 data files all but disappears, and finally, the taxpayers are getting their money's worth, and the heroic programmers sip well deserved Mai Tais on the tropical beach.

As long as we are living in alternate realities, what else might we ask for?

  1. The HDF5 library should provide shared dimensions. Using Dimension Scales as above would work fine. It's just a matter of saying that in an HDF5 file, if one uses unique, one dimensional Dimension Scales, then these represent shared dimensions. Such is the power of words that come from the right mouth.
  2. The HDF5 and netCDF-4 libraries should add generalized coordinate systems to their data model. The CFcoordinates attribute would work fine. This would just be a matter of saying that coordinates is a reserved attribute, describing its meaning, and adding diagnostics to ensure that the dimensions of the coordinate functions satisfy the constraints as defined above.

The HDF5 dimensions scale design doc brings up some use cases which aren't covered in the netCDF / CF coordinate data model. The first is that coordinates are often easily calculated, so it would be nice if they could be represented as an algorithm, instead of as a sampled function. Coordinates are often regularly spaced, so even a function as simple as a starting value and increment would very often be useful.

The other use case is when the sampling of coordinates stored are different from the sampling of the data values. The main use case is 2D satellite data, where the lat/lon points are stored every nth data point, presumably to save storage. The idea is to interpolate to the other points. The HDF-EOS library has this functionality. (Note that this can also be modeled as an algorithm.)

We are going to try adding this functionality to the CDM library just as soon as I finish this Pina Colada and get back from scuba diving. Stay tuned.

Next: HDF5 Dimension Scales - Part 3

HDF5 Dimension Scales - Part 3

$
0
0

In this post I will show how the HDF5 library implements dimension scales, and in the next post I will show how the netCDF-4 file format implements shared dimensions. We will look at the low-level objects stored in the file. The intention is to document these details for software that wants to read or write this information outside of the netCDF-4 C library; none of this is needed to use the HDF5 or netCDF-4 APIs.

Let's first look at the HDF5 API for dimension scales. To create a dimension scale, use H5DSset_scale:

herr_t H5DSset_scale(hid_t dsid, char *dimname)

The dataset dsid is converted to a Dimension Scale dataset, as defined above. This creates the CLASS attribute, set to the value "DIMENSION_SCALE" and an empty REFERENCE_LIST attribute, as described in "HDF5 Dimension Scale Specification" (PDF, see section 4.2). If dimname is specified, then an attribute called NAME is created, with the value dimname.

hid_t dsid;IN: the dataset to be made a Dimension Scale
char *dimname;    IN: the dimension name (optional), NULL if the dimension has no name.

The core of the functionality is in the HDF5attach_scale method:

herr_t H5DSattach_scale(hid_t didhid_t dsidunsigned int idx)

Define Dimension Scale dsid to be associated with dimension idx of Dataset did. Entries are created in the DIMENSION_LIST and REFERENCE_LIST attributes, as defined in section 4.2.

hid_t did;IN: the dataset
hid_t dsid;IN: the scale to be attached
unsigned int idx;    IN: the dimension of did that dsid is associated with.

If you look at a dimension scale with h5dump, you see something like this:

DATASET "time" {
 1# DATATYPE  H5T_STD_I32LE
   DATASPACE  SIMPLE { ( 100 ) / ( 100 ) }
 2# ATTRIBUTE "CLASS" {
      DATATYPE  H5T_STRING {
         STRSIZE 16;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "DIMENSION_SCALE"
      }
   }
 3# ATTRIBUTE "NAME" {
      DATATYPE  H5T_STRING {
         STRSIZE 5;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "time"
      }
   }
4# ATTRIBUTE "REFERENCE_LIST" {
      DATATYPE  H5T_COMPOUND {
         H5T_REFERENCE { H5T_STD_REF_OBJECT } "dataset";
         H5T_STD_I32LE "dimension";
      }
      DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
      DATA {
      (0): {
            DATASET 546 /data ,
            2
         },
      (1): {
            DATASET 1405 /_nc4_non_coord_sample ,
            0
         }
      }
   }
}

where the N# above are annotations that I've added:

  1. An integer dataset (aka variable) named "time" with 100 elements in it.
  2. An attribute named CLASS with value "DIMENSION_SCALE".
  3. An attribute named NAME with value "time".
  4. An attribute named REFERENCE_LIST with value a compound type with two elements.

These three attributes are added to the dataset by theHDF5set_scale method, turning it into a dimension scale. The values in the REFERENCE_LIST attribute are added by two calls to the HDF5attach_scale method. The first attaches the dimension scale to the second dimension of the dataset named"/data", and the next attaches it to the zeroth dimension of the dataset named"/_nc4_non_coord_sample".

If we h5dump one of the datasets in REFERENCE_LIST, we see for example:

DATASET "_nc4_non_coord_sample" {
   DATATYPE  H5T_STD_I32LE
   DATASPACE  SIMPLE { ( 100, 345 ) / ( 100, 345 ) }
     ATTRIBUTE "DIMENSION_LIST" {      
      DATATYPE  H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
      DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
      DATA {
      (0): (DATASET 830 /time ), (DATASET 1114 /sample )
      }
   }

This tells us that the _nc4_non_coord_sample dataset has type integer and shape (100, 345). It has an attribute called DIMENSION_LIST whose value is an array of references to other datasets, namely, the "time" and the "sample" dataset. These are none other than our dimension scales. The DIMENSION_LIST attribute, like the REFERENCE_LIST attribute, is maintained by the HDF5 dimension scale API.

In summary, the HDF5 dimension scale API allows you to create associations between a specially marked dataset called a dimension scale, and one of the dimension of any other arbitrary dataset. Because there are no restrictions on this association, we can't rely on this raw interface alone for defining shared dimensions. But next time we will see how the netCDF-4 format builds on this interface to implement shared dimensions.

Next: NetCDF-4 Dimensions and HDF5 Dimension Scales

NetCDF-4 Dimensions and HDF5 Dimension Scales

$
0
0

Last time we looked in detail at the internals of dimension scales in an HDF5 file. Now let's see what netCDF-4 shared dimensions look like at the file format level. Recall that in the netCDF data model, one defines Dimension objects, and uses those objects to define the shapes of variables; so dimensions are not optional in netCDF. Let's use the following example to illustrate the different ways shared dimension are implemented using dimension scales:

   dimension:
      nvec = 3;
      time = 100;
      sample = 345;
      ship = 14;
      ship_strlen = 80;
   variable:
     float data(ship, sample, time, nvec);
     int time(time);
     int sample(time, sample);
     char ship (ship, ship_strlen);

Let's go through each dimension in this example:

  1. nvec is a dimension with no coordinate variable, used, perhaps, for a vector component.
  2. time is both a dimension and a coordinate variable.
  3. sample is a dimension and a data variable. It's not a coordinate variable because coordinate variables must be 1-dimensional, with one exception described next.
  4. ship is a dimension and a coordinate variable. It's a coordinate variable because:
    1. it's a char variable,
    2. it's two dimensional, and
    3. the inner dimension does not have a coordinate variable.
    That it is two-dimensional is really an artifact of the fact that the netCDF classic model doesn't have a string data type. In the extended model, one could use string ship(ship).
  5. ship_strlen is the string length of the ship variable. It really shouldn't be a shared dimension, but is an artifact of the fact that the netCDF data model does not have anonymous (ie non-shared) dimensions. I'm hoping that anonymous dimensions will be added to the extended netCDF model in the future. The CDM data model does have anonymous dimensions, in which case the CDL would look like: char ship(ship, 80). But the (variable length) string form is preferable if you aren't reading punch cards with Fortran 4.

We've already looked in detail at how dimension scales are represented by examining h5dump output. Now we look at the HDF5 objects resulting from the above example, but using a more compact notation:

 float nvec(3);
  :REFERENCE_LIST = null
  :CLASS = "DIMENSION_SCALE"
  :NAME = "This is a netCDF dimension but not a netCDF variable.         3"

float data(14,345,100,3);
  :DIMENSION_LIST = "ship", "sample", "time", "nvec"

int time(100);
  :CLASS = "DIMENSION_SCALE"
  :NAME = "time"
  :REFERENCE_LIST = null, null

float sample(345);
  :REFERENCE_LIST = null, null
  :CLASS = "DIMENSION_SCALE"
  :NAME = "This is a netCDF dimension but not a netCDF variable.       345"

int _nc4_non_coord_sample(100,345);
  :DIMENSION_LIST = "time", "sample"

char ship(14,80);
  :REFERENCE_LIST = null
  :CLASS = "DIMENSION_SCALE"
  :NAME = "ship"
  :_Netcdf4Coordinates = 3, 4

float ship_strlen(80);
  :CLASS = "DIMENSION_SCALE"
  :NAME = "This is a netCDF dimension but not a netCDF variable.        80"

(Note that the above is not CDL, but just a shorthand notation for the objects in the HDF5 files. Note also that the REFERENCE_LIST attributes above are not actually null, that's a limitation of my dump output. Since we don't need the contents I haven't bothered to show them.)

Looking at each one of these in turn:

  1. nvec is a dimension scale, because CLASS = "DIMENSION_SCALE". It defines thenvec dimension, but because there is no associated coordinate variable, the netCDF-4 library sets the attribute NAME to start with "This is a dimension scale but not a netCDF variable". This tells the library not to expose a dataset nvec as a netCDF variable.
  2. data is a data variable. The DIMENSION_LIST attribute unambiguously lists the dimensions that it uses.
  3. time is a dimension scale, and since it is also a coordinate variable, the library does expose the dataset time as a netCDF variable.
  4. sample is a dimension scale, but not a coordinate, so the NAME attribute starts with "This is a dimension scale but not a netCDF variable".
  5. sample is a data variable with a name that conflicts with a dimension name. The netCDF4 library modifies the HDF5 dataset name by prepending the string _nc4_non_coord_, and removes this string when constructing the netCDF variablesample.
  6. ship is a dimension and a 2D char coordinate variable. Since a 2D coordinate can only be a char coordinate, where the second dimension represents the string length, we know that the dimension that it represents must be the first one, with length 14. I'm not yet clear how the _Netcdf4Coordinates attribute is used, or if its needed.
  7. ship_strlen is a dimension scale but not a coordinate variable.

So there you have the sanctus sanctorum of netCDF-4 dimension scales, as far as I have explored. If there are no further questions, you may resume your grooming, hurling fruit, and shrieking from the treetops.

WRF goes CF

$
0
0
The WRF model  has an idiosyncratic way of storing coordinate information in its netCDF output files, such that one needs WRF specific software to display WRF data. (Note that some users just deal with the GRIB output files from the WRF post-processor, which is a different animal).

Rich Signell made an NcML file that makes WRF files CF compliant. (See here for more info). What Rich is doing is adding coordinate system information, using CF attribute conventions. With these modification any CF-aware software can geolocate the data and, for example, display it with a map background, or overlay different data sources together (an IDV specialty).

The CDM has aCoordinate System Builder plug-in layer to identify coordinate systems, and it has such a plug-in for WRF netCDF output files. So any application that uses the CDM (like IDV and ToolsUI) can geolocate WRF data without the need for the CF modifications.

Rich's NcML uses the 2D lat/lon coordinates that are in the WRF output file. The CDM synthesizes projection coordinates using the projection parameters that are stored in the global attributes. Here are pictures of the two results:
Lat/Lon Coordinates
Figure 1: Lat/Lon Coordinates (above)
(Click to enlarge)

Projection Coordinates
Figure 2: Projection Coordinates
(Click to enlarge)

The first figure is on a lat/lon coordinate system, and one can see that the data covers a curved region in lat/lon coordinates. The second uses the data projection and therefore covers a rectangular region on that projection. We call this the natural projection of the data. It's often (but not always) also the coordinate system in which the numerical model does its computations. Please note that the data in the two figures is identical, we are just seeing differences in the way it is displayed.

The WRF group has expressed interest in making WRF output data CF compliant. Which way should they do it, with 2D lat/lon coordinates, or with 1D projection coordinates? If it was up to me, I'd use the 1D projection coordinates, because the lat/lon coordinates are derived from the projection coordinates, and because having the actual projection gives you more information than just having the lat/lon values of your grid. (One can also draw pictures faster when the pixels are square). To be CF-compliant, however, one has to have lat/lon coordinates, even if they are 2D, so that software that doesnt know how to deal with the projection can still work. So the best thing to do would be to add both.

 

WRF does CF - Part Two

$
0
0

The previous blog showed two different ways to add coordinate information to WRF netCDF output files, one using Rich Signell's CF modifications, and the other using the CDM library's internal WRFConvention Java code. Let's look at the details of both with an eye towards advising the WRF group on how to make their files CF compliant.

1. Horizontal Coordinates

Rich's solution uses the existing 2D lat/lon fields for the horizontal coordinate reference system. There are 3 such sets: XLAT/XLONG on the center points of the grid,  XLAT_U/XLON_U for the U wind component, XLAT_U/XLONG_U for the U wind component, XLAT_V/XLONG_V for the V wind component. These already have the correct CF units degree_north and degree_east, so the only thing one has to do is to reference these coordinates in the corresponding data variables. The NcML needed to do this looks like:

 <variable name="U" >
  <attribute name="coordinates" value="XLONG_U XLAT_U ZNU XTIME"/>
 </variable>
 <variable name="V" >
  <attribute name="coordinates" value="XLONG_V XLAT_V ZNU XTIME"/>
 </variable>
 <variable name="W" >
  <attribute name="coordinates" value="XLONG XLAT ZNW XTIME"/>
 </variable>
 ... 

Here we are taking 3 existing variables and adding the coordinates attribute to each. When added to the existing attributes, these variables now look like:

   float U(Time=1, bottom_top=27, south_north=60, west_east_stag=74);
     :FieldType = 104; // int
     :MemoryOrder = "XYZ";
     :description = "x-wind component";
     :units = "m s-1";
     :stagger = "X";
     :coordinates = "XLONG_U XLAT_U ZNU XTIME";

   float V(Time=1, bottom_top=27, south_north_stag=61, west_east=73);
     :FieldType = 104; // int
     :MemoryOrder = "XYZ";
     :description = "y-wind component";
     :units = "m s-1";
     :stagger = "Y";
     :coordinates = "XLONG_V XLAT_V ZNU XTIME";

   float W(Time=1, bottom_top_stag=28, south_north=60, west_east=73);
     :FieldType = 104; // int
     :MemoryOrder = "XYZ";
     :description = "z-wind component";
     :units = "m s-1";
     :stagger = "Z";
     :coordinates = "XLONG XLAT ZNW XTIME";

The CF coordinates attribute simply lists the coordinates for the variable. To be fully compliant, one must add a coordinates attribute to each data variable. So the above adds the correct horizontal coordinates to these 3 variables (we will cover the Z and time coordinate below), and the rest need to be done also.

The other possibility is to add the projection and projection coordinates to the file. The projection information is in the global attributes, and the CDM WRF Convention Builder (ucar.nc2.dataset.conv.WRFConvention.java on our GitHub source site) takes this approach. This code actually uses a variation of CF called the  _Coordinate Attribute Conventions:

   char Lambert;
     :grid_mapping_name = "lambert_conformal_conic";
     :latitude_of_projection_origin = 34.83000564575195; // double
     :longitude_of_central_meridian = -98.0; // double
     :standard_parallel = 30.0, 60.0; // double
     :earth_radius = 6371229.0; // double
     :_CoordinateTransformType = "Projection";
 1)  :_CoordinateAxisTypes = "GeoX GeoY";

   double x(west_east=73);
     :units = "km";
     :long_name = "synthesized GeoX coordinate from DX attribute";
 2)  :_CoordinateAxisType = "GeoX";
 3)  :_CoordinateAliasForDimension = "west_east";

   double x_stag(west_east_stag=74);
     :units = "km";
     :long_name = "synthesized GeoX coordinate from DX attribute";
 2)  :_CoordinateAxisType = "GeoX";
 3)  :_CoordinateAliasForDimension = "west_east_stag";

   double y(south_north=60);
     :units = "km";
     :long_name = "synthesized GeoY coordinate from DY attribute";
 2)  :_CoordinateAxisType = "GeoY";
 3)  :_CoordinateAliasForDimension = "south_north";

   double y_stag(south_north_stag=61);
     :units = "km";
     :long_name = "synthesized GeoY coordinate from DY attribute";
 2)  :_CoordinateAxisType = "GeoY";
 3)  :_CoordinateAliasForDimension = "south_north_stag";

The Lambert variable is a dummy "container variable" for the projection. The next 4 variables are projection coordinates synthesized by the code from the WRF global attributes. The above attributes have the following meanings:

  1. The _CoordinateAxisTypes means use this projection on any variable that has a GeoX and GeoY coordinate.

  2. The _CoordinateAxisType unambiguously identifies the type of the projection coordinate, in this case either GeoX or GeoY which just mean the projection x or y coordinate.

  3. The _CoordinateAliasForDimension means associate this coordinate with any variable that uses the named dimension. This is a convenience so we don't have to annotate all the data variables by hand.

Since we actually want to use CF Conventions instead of the _Coordinate Conventions, here's the equivalent projection coordinates using CF Conventions:

   char Lambert;
     :grid_mapping_name = "lambert_conformal_conic";
     :latitude_of_projection_origin = 34.83000564575195; // double
     :longitude_of_central_meridian = -98.0; // double
     :standard_parallel = 30.0, 60.0; // double
     :earth_radius = 6371229.0; // double
   double x(west_east=73);
     :units = "km";
     :long_name = "synthesized GeoX coordinate from DX attribute";
     :axis = "X";
   double x_stag(west_east_stag=74);
     :units = "km";
     :long_name = "synthesized GeoX coordinate from DX attribute"; 
     :axis = "X";
   double y(south_north=60);
     :units = "km";
     :long_name = "synthesized GeoY coordinate from DY attribute";
     :axis = "Y";
   double y_stag(south_north_stag=61);
     :units = "km";
     :long_name = "synthesized GeoY coordinate from DY attribute";
     :axis = "Y";

Each data variables then needs a coordinates attribute and a grid_mapping attribute:

   float U(Time=1, bottom_top=27, south_north=60, west_east_stag=74);
     :FieldType = 104; // int
     :MemoryOrder = "XYZ";
     :description = "x-wind component";
     :units = "m s-1";
     :stagger = "X";
     :grid_mapping = "Lambert";
     :coordinates = "y x_stag z Time";
   float V(Time=1, bottom_top=27, south_north_stag=61, west_east=73);
     :FieldType = 104; // int
     :MemoryOrder = "XYZ";
     :description = "y-wind component";
     :units = "m s-1";
     :stagger = "Y";
     :grid_mapping = "Lambert";
     :coordinates = "y_stag x z Time";
   float W(Time=1, bottom_top_stag=28, south_north=60, west_east=73);
     :FieldType = 104; // int
     :MemoryOrder = "XYZ";
     :description = "z-wind component";
     :units = "m s-1";
     :stagger = "Z";
     :grid_mapping = "Lambert";
     :coordinates = "y x z_stag Time";
   ...

You have to go through each data variable and add the grid_mapping attribute and the correct coordinates attribute (the z and time coordinates are covered below). The main difference between _Conventions and CF is that in CF, you must annotate each data variable separately. 

2. Vertical Coordinates

Rich's NcML adds these variables to the file:  

  float ZNU(bottom_top=27);
     :FieldType = 104; // int
     :MemoryOrder = "Z  ";
     :description = "eta values on half (mass) levels";
     :units = "layer";
     :stagger = "";
     :positive = "down";
     :standard_name = "atmosphere_sigma_coordinate";
     :formula_terms = "ptop: P_TOP sigma: ZNU ps: PSFC";

   float ZNW(bottom_top_stag=28);
     :FieldType = 104; // int
     :MemoryOrder = "Z  ";
     :description = "eta values on full (w) levels";
     :units = "level";
     :stagger = "Z";
     :positive = "down";
     :standard_name = "atmosphere_sigma_coordinate";
     :formula_terms = "ptop: P_TOP sigma: ZNW ps: PSFC";

These are CF Dimensionless Vertical Coordinates. The CF spec explains in some detail how to use them, and application code can use these formulas to calculate pressure or height coordinates.

The CDM code doesnt know enough to turn the z coordinates into CF Dimensionless Vertical Coordinates, instead it just creates generic Z coordinates:

  double z(bottom_top=27);
     :units = "";
     :long_name = "eta values from variable ZNU";
     :_CoordinateAxisType = "GeoZ";
     :_CoordinateAliasForDimension = "bottom_top";
   double z_stag(bottom_top_stag=28);
     :units = "";
     :long_name = "eta values from variable ZNW";
     :_CoordinateAxisType = "GeoZ";
     :_CoordinateAliasForDimension = "bottom_top_stag";
   double soilDepth(soil_layers_stag=4);
     :units = "units";
     :long_name = "soil depth";
     :_CoordinateAxisType = "GeoZ";
     :_CoordinateAliasForDimension = "soil_layers_stag"; 

However it includes the soilDepth z coordinate, which the NcML missed. So it looks like the right way to add CF vertical coordinates to this WRF files is:

 float ZNU(bottom_top=27);
     :FieldType = 104; // int
     :MemoryOrder = "Z  ";
     :description = "eta values on half (mass) levels";
     :units = "layer";
     :stagger = "";
     :positive = "down";
     :standard_name = "atmosphere_sigma_coordinate";
     :formula_terms = "ptop: P_TOP sigma: ZNU ps: PSFC";

   float ZNW(bottom_top_stag=28);
     :FieldType = 104; // int
     :MemoryOrder = "Z  ";
     :description = "eta values on full (w) levels";
     :units = "level";
     :stagger = "Z";
     :positive = "down";
     :standard_name = "atmosphere_sigma_coordinate";
     :formula_terms = "ptop: P_TOP sigma: ZNW ps: PSFC";

   double soilDepth(soil_layers_stag=4);
     :units = "units";
     :long_name = "soil depth";

and also put the appropriate z coordinate name into the data variables'coordinates attribute.

3. Time coordinate

The existing time coordinate in the WRF file looks like:

   float XTIME(Time=1);
     :FieldType = 104; // int
     :MemoryOrder = "0  ";
     :description = "minutes since simulation start";
     :units = "";
     :stagger = "";
  data:
  {720.0}

For CF, we want to turn this into a udunits compatible date variable, by adding a units attribute:

 <variable name="XTIME">
  <attribute name="units" value="minutes since 2000-01-24 12:00:00"/>
 </variable>

Coordinate variables are preferred, so i would recommend renaming the variable to Time:

   float Time(Time=1);
     :units = "minutes since 2000-01-24 12:00:00";
     :FieldType = 104; // int
     :MemoryOrder = "0  ";
     :description = "minutes since simulation start";
     :stagger = "";

 

4. Udunit compatible units

CF requires that data variables' units be udunit compatible. The CDM checks the units of the data variables and converts, if possible. This particular file had only a few such corrections:

 1.  "-" indicating dimensionless units must be an empty string in udunits 

 2.  "W m{-2}" on the NOAHRES variable gets converted to "W m-2"

 

In conclusion,we have outlined the changes needed to make WRF into CF-compatible netCDF. Next time we will look at some of the features of WRF that aren't well handled in CF and/or by the CDM library.


Integrating a SAX Parser with the Bison Parser Generator

$
0
0

A SAX (Simple API for XML) parser is a particular mechanism for parsing XML documents. Using a SAX parser has the advantage over the DOM-based parser in that it is not necessary to build the explicit DOM tree. On the other hand, it can be difficult to build a SAX parser because it requires management of complex state.

Combining SAX parsing with a GNU Bison generated parser is appealing because it allows the Bison parser to manage all of the state. Additionally, the .y file encapsulates the equivalent of a DTD but in a much more readable form. The combination makes using SAX parsing a lot simpler.

The SAX parser operates by generating events representing "tokens" from the XML document. Consider for example, this document.

<element1>
<element2>
</element2>
</element1>
The SAX parser would typically generate the following events:
  1. startDocument
  2. startElement for element1
  3. startElement for element2
  4. endElement for element2
  5. endElement for element1
  6. endDocument
In practice, the set of possible events is more extensive, although for any given class of XML document, many of these events can be ignored.

The controlling SAX parser invokes callback procedures in a user-supplied handler class. For each event type, a specific method is called in the handle class. For an example of such a handler class, see this example.

http://www.saxproject.org/apidoc/org/xml/sax/helpers/DefaultHandler.html

The Gnu Bison parser generator (http://www.gnu.org/software/bison/) has a notion called "push-parsing" where the parser is fed tokens one by one. This is essentially the same model as a SAX parser where the SAX events serve the role of tokens and the SAX parser is, in effect, a push-parser in Bison terms.

The current Bison parsers (versions 2.6.3 and earlier) do not support Java push parsing. However, the author has contributed a Java push parser skeleton to the Bison project that supports Java push parsing. This skeleton is not yet in any distributed release, but will appear at some point.

In the meantime, the required skeleton can be obtained at this URL.

http://www.unidata.ucar.edu/staff/dmh/lalr1.push
and can be used with Bison version 2.6.4 using the "bison -S" flag.

The primary interface to the Bison generated push-parser uses this method call.

public int push_parse(int yylextoken, Object yylexval)
where the yylextoken is the integer representing some Bison parser token and yylexval is some state information about that token.

SAX Parser Operation

The notional architecture stack for the SAX plus Bison parser looks like the following.
     -------------------
Main
-------------------
SAX Parser
-------------------
Event Handler
-------------------
Bison push-parser
-------------------

Using a SAX parser with a Bison Java push-parser operates by creating a SAX parser and giving it a event handler specific to the Bison generated parser. The SAX parser is invoked, and as the SAX parser processes the XML document, it creates events and the handler is called for each event.

In turn, the handler does two things.

  1. The handler converts each event into a Bison token and an associated state object.

  2. The Bison parser push_parser method is invoked with that Bison token and that object.

The push_parser method can return one of three values:

  • YYABORT – this signals that the parse failed. In turn, the handler should throw an exception to cause the SAX parser to stop.

  • YYMORE – this signals that the parse needs more input to continue.

  • YYACCEPT – this signals that the parse is complete. If the SAX parser continues to send events after the parser has accepted, then this should be treated as a parse error.

Any exceptions generated by either the event handler or the Bison push parser should be rewrapped as a SAXException or an IOException so that they propagate up the stack correctly.

Mapping Events to Tokens plus State

Consider this Bison grammar.
%token A_ _A B_ _B ATTR1 ATTR2
%start a
%%
a: A_ b _A ;

b: B_ attrlist _B

attrlist: /*empty*/ | attrlist ATTR1 | attrlist ATTR2 ;

The token A_ is intended to come from an occurrence of the element <A>. Similarly, the token _A is intended to come from an occurrence of the element </A>.

An example of an associated XML document might be as follows.

<A>
<B attr1="value">
</B>
</A>

Now consider the startElement event handling method. It might look like this.

01 public void
02 startElement(String nsuri, String name, String qualname, Attributes attributes)
03 throws SAXException
04 {
05 if(parseaccepted)
06 throw new SAXException("Events occur after parse is complete");
07 int token = 0;
08 if(name.equals("A"))
09 token = A_;
10 else if(name.equals("B"))
11 token = B_;
12 if(token == 0)
13 throw new SAXException("Unexpected attribute");
14 Lval state = new Lval();
15 state.name = name;
16 state.qualname = qualname;
17 state.nsuri = nsuri;
18 switch (bisonparser.push_parse(token,state)) {
19 case YYMORE: break;
20 case YYACCEPT: parseaccepted = true; break;
21 case YYABORT:
22 throw new SAXException("Parser aborted");
23 }
24 if(token = B_) {
25 // pass any attributes as tokens
26 int nattr = attributes.getLength();
27 for(int i=0;i<nattr;i++) {
28 token = 0;
29 String attributename = attributes.getLocalName(i);
30 if(attributename.equals("attr1"))
31 token = ATTR1;
32 else if(attributename.equals("attr2"))
33 token = ATTR2;
34 if(token == 0)
35 throw new SAXException("Unexpected attribute");
36 String value = attributes.getValue(i);
37 state = new Lval();
38 state.attributename = attributename;
39 state.value = value;
40 switch (bisonparser.push_parse(token,state)) {
41 case YYMORE: break;
42 case YYACCEPT: parseaccepted = true; break;
43 case YYABORT:
44 throw new SAXException("Parser aborted");
45 }
46 }
47 }
48 }

The first action is to test a flag (parseaccepted). If that flag is set, then the Bison parser has indicated that it thinks the parse is complete. Since that appears to differ from the "opinion" of the SAX parser, it is necessary to signal an exception (lines 5-6).

The next action (lines 8-11) is to convert the element name to a specific Bison token. The linear search shown can be improved using, say, a hash table to map the element name to a specific Bison token integer. If there is no name match, then an exception is thrown (lines 12-13).

To go along with the token, a state object (of class Lval, not shown) is created and any arguments to startElement that are of interest are copied into that Lval object (lines 14-17).

Now there is a token and associated state, so the bison push_parser method can be called (line 18).

In lines 19-23, the return value from thepush_parser method is tested. If YYABORT is returned, then an exception should be thrown. If YYACCEPT is returned, then the parseaccepted flag is set to prevent further calls to the event handler. If YYMORE is returned, then all is well, and the parse will continue.

There is an additional complication: the element may have associated attributes passed in as an argument to startElement. For each such attribute, a token and state object must be created and passed into the Bison push parser.

As with the element name, the attribute name is used to determine the corresponding Bison token (lines 29-33). If no match is found, then an exception is thrown (lines 34-35) Again, similarly, the attribute state is constructed from information taken from the attribute object (lines 36-39). Finally, the push parser is invoked with the attribute token and state as arguments (line 40) and the result checked (lines 41-44).

Other event handler are similar in structure, although usually simpler. It is instructive to look at the endDocument event handling method.

01 public void endDocument()
02 throws SAXException
03 {
04 if(parseaccepted)
05 throw new SAXException("Events occur after parse is complete");
06 switch (bisonparser.push_parse(EOF,null)) {
07 case YYACCEPT: parseaccepted = true; break;
08 case YYMORE:
09 throw new SAXException("End Document: parser is asking for more input");
10 case YYABORT:
11 throw new SAXException("Parser aborted");
12 }
13 }

Since there should be no more events, the push parser is invoked with the EOF token to tell the Bison parser that this is the end of the token stream (line 6). The state is null because there are no arguments to endDocument.

The return result from the push parser must be YYACCEPT. Any other value indicates an error and an exception is thrown (lines 8-11).

At this point the originally invokes SAX parser should return with an indication of success (or it throws and exception).

Grammar File Format

The grammar file (the .y file) will require some initial declarations to support the use of a Bison push parser.
01 %define api.push-pull push
02 %code lexer {
03 public Object getLVal() {return null;}
04 public int yylex() {return 0;}
05 public void yyerror(String s) {System.err.println(s);}
06 }

Line 1 tells Bison that it should produce a push parser. Specifically, this will cause the generation of this method.

public int push_parse(int yylextoken, Object yylexval)

Lines 2-6 require some explanation. Technically, the Bison parser has no need for access to a lexer because any output from a lexer is passed into the Bison parser through the push_parser method arguments. However, and because of a quirk in the way Bison constructs its parsers, the parser does need access to the lexer's yyerror procedure, so we do need at least a stub lexer to the parser. This is handled by defining the lexer using the Bison%code lexer {...} mechanism. The only method in the lexer interface that will be called isyyerror, so the others are stubs. The actual action of yyerror can, of course, be defined as desired.

Invoking bison

Given a .y grammar file as specified above, the only special flag that is needed is -S. So bison is invoked as follows.
    bison -Ljava -S lalr1.push <.y file name>

Conclusion

Now that Bison can generate Java push parsers, SAX parsing becomes a lot simpler. Traditional SAX parsers require maintenance of complex state to track the state of the parser. Here, the Bison generated parse does all that automatically. The result is a much more compact definition and implementation of a SAX parser.

Chunking Data: Why it Matters

$
0
0

What is data chunking? How can chunking help to organize large multidimensional datasets for both fast and flexible data access?  How should chunk shapes and sizes be chosen?  Can software such as netCDF-4 or HDF5 provide better defaults for chunking? If you're interested in those questions and some of the issues they raise, read on ...

Is anyone still there?  OK, let's start with a real-world example of the improvements possible with chunking in netCDF-4.  You may be surprised, as I was, by the results.  Maybe looking at examples in this post will help provide some guidance for effective use of chunking in other similar cases.

First let's consider a single large 3D variable from the NCEP North American Regional Reanalysis representing air temperature (if you must know, it's at the 200 millibars level,  at every 3 hours, on a 32.463 km resolution grid,  over 33 years from 1979-01-01 through 2012-07-31):

dimensions:
	y = 277 ;
	x = 349 ;
	time = 98128 ;
variables:        float T(time, y, x);

Of course the file has lots of other metadata specifying units, coordinate system, and data provenance, but in terms of size it's mostly just that one  big variable: 9.5 billion values comprising 38 GB of data.

This file is an example of PrettyBigData (PBD).  Even though you can store it on a relatively cheap flash drive, it's too big to deal with quickly. Just copying it to a 7200 rpm spinning disk takes close to 20 minutes. Even copying to fast solid state disk (SSD) takes over 4 minutes. For a human-scale comparison, its close to the storage used for a blu-ray version of a typical movie, about 50 GB.  (As an example of ReallyBigData (RBD), a data volume beyond the comprehension of ordinary humans, consider the 3D, 48 frame per second version of"The Hobbit, Director's Cut".)

Access Patterns make a difference

For now, let's ignore issues of compression, and just consider putting that file on a server and permitting remote access to small subsets of the data.  Two common access patterns are:

     
  1. Get a 1D time-series of all the data from a specific spatial grid point. 
  2.  
  3. Get a 2D spatial cross section of all the data at a specific time. 

The first kind of access is asking for the 1D array of values on one of the red lines, pictured on the left, below; the second is asking for the 2D array of values on one of the green planes pictured on the right.

With a conventional contiguous (index-order) storage layout, the time dimension varies most slowly, y varies faster, and x varies fastest.  In this case, the spatial access is fast (0.013 sec) and the time series access is slow (180 sec, which is 14,000 times slower). If we instead want the time series to be quick, we can reorganize the data so x or y is the most slowly varying dimension and time varies fastest, resulting in fast time-series access (0.012 sec) and slow spatial access (200 sec, 17,000 times slower). In either case, the slow access is so slow that it makes the data essentially inaccessible for all practical purposes, e.g. in analysis or visualization. 

But what if we want both kinds of access to be relatively fast?  Well, we could punt and make two versions of the file, each organized appropriately for the kind of access desired.  But that solution doesn't scale well.  For N-dimensional data, you would need N copies of the data to support optimal access along any axis, and N! copies to support optimal access to any cross-section defined by some subset of the N dimensions.

A better solution, known since at least 1983, is the use of chunking, storing multidimensional data in multi-dimensional rectangular chunks to speed up slow accesses at the cost of slowing down fast accesses.  Programs that access chunked data can be oblivious to whether or how chunking is used.  Chunking is supported in the HDF5 layer of netCDF-4 files, and is one of the features, along with per-chunk compression, that lead us to jointly propose using HDF5 as a storage layer for netCDF-4 in 2002. 

Benefits of Chunking

I think the benefits of chunking are under-appreciated.

Large performance gains are possible with good choices of chunk shapes and sizes.  Chunking also supports efficiently extending multidimensional data along multiple axes (in netCDF-4, this is called"multiple unlimited dimensions") as well as efficient per-chunk compression, so reading a subset of a compressed variable doesn't require uncompressing the whole variable.

So why isn't chunking more widely used?  I think reasons include at least the following:

     
  1. Advice for how to choose chunk shapes and sizes for specific patterns of access is lacking.
  2.  
  3. Default chunk shapes and sizes for libraries such as netCDF-4 and HDF5 work poorly in some common cases.
  4.  
  5. It's costly to rewrite big datasets that use conventional contiguous layouts to use chunking instead.  For example, even if you can fit the whole variable, uncompressed, in memory, chunking a 38GB variable can take 20 or 30 minutes.

This series of posts and better guidance in software documentation will begin to address the first problem.  HDF5 already has a start with a white paper on chunking.

The second reason for under-use of chunking is not so easily addressed.  Unfortunately, there are no general-purpose chunking defaults that are optimal for all uses.  Different patterns of access lead to different chunk shapes and sizes for optimum access.  Optimizing for a single specific pattern of access can degrade performance for other access patterns.

Finally, the cost of chunking data means that you either need to get it right when the data is created, or the data must be important enough that the cost of rechunking for many read accesses is justified.  In the latter case, you may want to consider acquiring a computing platform with lots of memory and SSD, just for the purpose of rechunking important datasets.

What a difference the shape makes

We have claimed that good choices of chunk shapes and sizes can make large datasets useful for access in multiple ways. For the specific example we've chosen, how well do the netCDF-4 library defaults work, and what's the best we can do by tailoring the chunking to the specific access patterns we've chosen, 1D time series at a point and 2D spatial access at a specific time?

Here's a table of timings for various shapes and sizes of chunks, using conventional local 7200 rpm spinning disk with 4096-byte physical disk blocks, the kind of storage that's still prevalent on desk-top and departmental scale platforms:

Storage layout,
chunk shapes
Read time
series (sec)
Read spatial
slice (sec)
Performance bias
(slowest / fastest)
Contiguous favoring time range0.01318014000
Contiguous favoring spatial slice2000.01217000
Default (all axes equal) chunks, 4673 x 12 x 161.43424
36 KB chunks, 92 x 9 x 11 2.41.71.4
8 KB chunks, 46 x 6 x 81.41.11.2

We've already seen the timings in the first two rows of this table, showing huge performance bias when using contiguous layouts. The third row shows the current netCDF-4 default for chunking this data, choosing chunk sizes close to 4 MB and trying to equalize the number of chunks along any axis.  This turns out not to be particularly good for trying to balance 1D and 2D accesses.  The fourth row shows results of smaller chunk sizes, using shapes that provide a better balance between time series and spatial slice accesses for this dataset.

I think the last row of this table supports the main point to be made in this first posting on chunking data. By creating or rewriting important large multidimensional datasets using appropriate chunking, you can tailor their access characteristics to make them more useful.  Proper use of chunking can support more than one common query pattern.

That's enough for now.  In part 2, we'll discuss how to determine good chunk shapes, present a general way to balance access times for 1D and 2D accesses in 3D variables, say something about generalizations to higher dimension variables, and provide examples of rechunking times using the nccopy and h5repack utilities.

In later posts, we plan to offer opinions on related issues, possibly including

  • chunk size tradeoffs: small chunks vs. large chunks
  • chunking support in tools
  • chunking and compression
  • complexity of the general rechunking problem
  • problems with big disk blocks
  • worst case performance
  • space-filling curves for improving access locality

In the meantime, Keep On Chunkin' ...

A note about timings

The times we quoted above are averaged from multiple consecutive runs on a desktop Linux system (2.27GHz Xeon CPU, 4 cores, 24 GB of memory, 7200 rpm disk on a SATA-3 bus).  Each timing uses a different set of chunks, so we are not exploiting chunk caching.  Before reading a subset of data from a file, we run a command to flush and clear all the disk caches in memory, so running the timing repeatedly yields nearly the same time.  The command to clear disk caches varies from system to system.  On Linux, it requires privileges to set the SUID bit on a shell script:

#!/bin/sh
# Flush and clear the disk caches.
sync
sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
On OSX, the only privilege required is knowledge of the "purge" command, which we wrap in a shell script to make our benchmarks work the same on OSX as on Linux:
#!/bin/sh
# Flush and clear the disk caches.
purge

 

Chunking Data: Choosing Shapes

$
0
0

In part 1, we explained what data chunking is about in the context of scientific data access libraries such as netCDF-4 and HDF5, presented a 38 GB 3-dimensional dataset as a motivating example, discussed benefits of chunking, and showed with some benchmarks what a huge difference chunk shapes can make in balancing read times for data that will be accessed in multiple ways.

In this post, I'll continue looking at that example dataset to see how we derived good chunk shapes, generalize to other datasets, look at how long it can take to rechunk a multidimensional dataset, and consider the use of Solid State Disk (SSD) for both accessing and rechunking data.

Chunk shapes and sizes: You can't always get what you want

Tiling of two-dimensional data is often used to explain chunking, because it's easy to understand and to illustrate with simple figures. Common patterns of subgrid access for a 2D array include:

  1. accessing data by rows
  2. accessing data by columns
  3. accessing a rectangular subgrid of data from somewhere in the middle of the array

If data is stored contiguously by rows, then accessing by rows is very fast, accessing by columns is slow, and the speed of accessing subgrids depends on details of how many rows and columns are involved and the shape of the subgrid.  That's all still true if you interchange the words "row" and "column".

For 2-dimensional data, if you want to support equally frequent access by either rows or columns, then a natural solution is chunking the data into rectangular chunks (also known as tiles) so that reading a row requires the same number of disk accesses as reading a column. You can treat each chunk as if it were a single disk block that must be read completely to access any of its data values. An optimum solution is to make the chunks similar in shape to the entire array, so that the same number of chunks are required to read an entire row or an entire column.

For example, if you have a 277 x 349 array of values (the shape of a horizontal slice in the example dataset in part 1), you could use chunks of size 28 x 35, so that 10 chunks would be adequate to read all the values in any row or any column.  Notice there's a little overhang in the last column and last row of chunks, which really cover 280 x 350 values, but that's typically handled in the library and not something you have to worry about.  To get all the values in a row, you're really reading more data than you want, since each chunk has values from 28 rows, but that's OK because it's typically the number of disk accesses that make I/O slow, not the number of values read.  Also, if you're reading successive rows, caching the 35 chunks for a row in memory means you only have to read each chunk once.

If your chunks are smaller than a disk block you will still have to read a whole disk block, so it doesn't make much sense to choose chunks significantly smaller than a disk block. The number of bytes in a 28 x 35 block of floats (4 bytes each), is 3920 bytes, which is actually pretty close to the 4096 byte blocks used in many desktop file systems, so 28 x 35 is a pretty good shape and size for the parameters of this problem (we're ignoring compressed chunks for now).

Two-dimensional examples are often where the discussion of chunk shapes and sizes ends. However, the 2D case is too simple to generalize to 3D or higher, because equalizing access time along each axis is not necessarily the most common access pattern to be optimized in higher dimensions.  The real-world example presented in part 1 involved a (time, y, x) floating-point variable of shape 98128 x 277 x 349, in which we wanted to balance access to time series of shape 98128 x 1 x 1 and to horizontal slices of shape 1 x 277 x 349 so that neither type of access was ridiculously slow. If we just use chunk shapes of similar shape to the 3D variable, we might try 9813 x 28 x 35 chunks.  For that chunk shape, a time series will need 10 chunks but a horizontal slice will need 100 chunks, and take 10 times longer if the number of disk accesses are the predominant time (as they usually are).

A little algebra can be applied to this 3D case, setting the number of chunks accessed for a 1D time series equal to the number of chunks accessed for a 2D horizontal slice, leading to chunks of shape

98128/N2  by  277/N  by  349/N

where N4 is the total number of chunks used to partition the array. It turns out not to matter how the chunks are distributed along the x and y axes, as long as their product, the number of chunks in a horizontal slice, is the same as the number of chunks in a time series. So the most general formula for optimal chunk shapes for this access pattern has an arbitrary positive constant C, with chunk shapes

98128/N2  by  C*277/N  by  (1/C)*349/N  

Here's source code for a little Python function, chunk_shape_3D, that computes a good shape for this access pattern of equally frequent 1D and 2D accesses of a 3D variable. You provide a variable's shape as a list of dimension sizes, a desired chunk size that the resulting chunks should be close to without exceeding, and the external size in bytes of each element of the variable.  The function returns a "good" chunk shape. The function handles some details not described here, such as converting ideal shapes with fractional dimensions to practical shapes with whole-number dimensions. 

Examples of chunk shapes returned by the function are given in the following table for our 98128 x 277 x 349 variable, for various power-of-two chunk sizes that include the common cases of size 4 KB, 8KB, 1MB, and 4 MB, and assuming the variable values are 4 bytes each.


Desired chunk size (bytes)Actual chunk size (bytes)Chunk shape (values)
4096396033 x 5 x 6
8192 7728 46 x 6 x 7
16384 16384 64 x 8 x 8
10485761032000 516 x 20 x 25
41943044189920 1032 x 29 x 35

 

A Generalization

Generalizing the 3-dimensional case to n-dimensional variables, I've written another little Python function named chunk_shape that returns good shapes for balanced access to 1D data along a distinguished dimension (such as for time series at an (n-1)-dimensional point) and (n-1)-dimensional slabs at specific values of the distinguished 1D dimension.

The complexity of the algorithm is exponential in the rank of the variable, so don't expect an answer in your lifetime for variables with 100 dimensions.  It seems to determine good shapes fast enough for variables with a reasonable number of dimensions, but I'm always interested in suggestions for improvements.

Rechunking? How Long Will That Take?

If you have some important datasets that are heavily accessed in complementary ways, such as the 1D and (n-1)D pattern presented here, and you're convinced that rechunking the data might be a good idea, the good news is that tools for rechunking are available: nccopy for netCDF-4 data andh5repack, which works for both HDF5 and netCDF-4 data.  Note that you can specify chunking (and compression) for classic-model netCDF data, so you don't have to make use of features of the netCDF-4 data model to get the benefits of chunking.

The bad news is that rechunking BigData, or even only PrettyBigData, can take a lot of time. The problem is analogous to transposing a matrix that's too big to fit in memory all at once. But don't despair. The time it takes to rechunk a big dataset is often only a small multiple of time to copy the data from one disk file to another, especially when you have a lot of memory. Maybe some day rechunking datasets will be a cloud service, which would be especially convenient if your big datasets are already in the cloud.

Another issue to consider is compression. If your data is compressed in netCDF-4 or HDF5, it has to be chunked, because a chunk is the atomic unit of compression as well as disk access. Rechunking compressed data means reading each compressed chunk, uncompressing it, rechunking the uncompressed data, recompressing the new chunks, and writing them out to disk. Believe it or not, that can be faster than doing the same rechunking with uncompressed data, due to savings in disk I/O for data that is very compressible.

Here are some benchmarks for rechunking our example dataset, using nccopy or h5repack: 

Source chunks Destination chunks 

 nccopy: disk, SSD (minutes)

 h5repack: disk, SSD (minutes)

1032 x 29 x 35 516 x 20 x 25  7,  499,  38
1032 x 29 x 35 64 x 8 x 8 10,  10134, 43
1032 x 29 x 35 46 x 6 x 7  11,  10-, 46
1032 x 29 x 35 33 x 5 x 6 12,  14 -, 49

Though nccopy is faster than h5repack for netCDF files, it could probably be sped up significantly through use of  parallel-netcdf or HDF5 parallel I/O, both of which are available in netCDF-4.  Might be a good project for a summer intern ...

If you're going to be a frequent rechunker, you'd be wise to get a machine with lots of memory and maybe lots of SSD instead of spinning disk. But there are some surprises with SSD, as we'll see in the next section.

Running these tests have helped to provide some tips on rechunking that weren't obvious to me.  First, if you want to rechunk data quickly, what form of source is best, in terms of chunking and compression?  Possibilities include unchunked contiguous data, a few large chunks, or lots of small chunks, and in each case, is it better to use compressed or uncompressed data as the source?  Benchmarks with the 38 GB example dataset suggest a few answers.  

First, rechunking from a contiguous layout is slow unless you can read the entire input file into memory.  Typically that's not practical, but you may have to start with contiguous data to get a chunked dataset with a few large chunks that you can use as source for experimenting with creating files with better chunk shapes.

It doesn't seem to matter much whether the input or output data is compressed, as I/O time will dominate the rechunking, as long as you have enough memory dedicated to chunk cache to hold both the compressed input and compressed output in memory.  If you lack that much memory, use of chunk cache may have to be tuned for optimum rechunking.  This is still somewhat of a dark art, but nccopy lets you specify how much memory to use for chunk caches as a command-line option.

Even with SSD, rechunking data takes a relatively long time.  Is it ever worth it?  I think it is, for important datasets that will be written once but read many times.  It's similar to justification for developing the new zlib-compatible zopfli compression algorithm, which is 100x slower than zlib for compression, but compresses 5% better, so it saves time on every access and is a win after about 20 accesses.  With the huge bias in access times discussed in part 1, rechunking is a win if it replaces only a few accesses in the "slow" order.

If Memory Serves: What's Going On with SSD?

If you do I/O intensive manipulations such as rechunking and if you can afford it, equip your machines with SSD in addition to spinning disk (but make sure your SSDs are designed to deal with power faults). How do using SSD compare with using conventional spinning disks?  We've tried them for:

  • accessing contiguous data
  • accessing chunked data
  • rechunking data

Serial access with SSD can easily be 4 or 5 times faster than spinning disks, but that's not the only speedup SSD provides. Using SSD with traditional contiguous storage can make chunking the data unnecessary, because random access is so much faster in SSD than spinning disk. Here's an example of average time to access time series and horizontal slabs on our example dataset:


Storage layout,
chunk shapes
Read time
series (sec)
Read spatial
slice (sec)
Performance bias
(slowest / fastest)
Contiguous favoring time range0.000030.000041.3
Contiguous favoring spatial slice530.00318000
1032 x 29 x 351.21.01.2
64 x 8 x 80.50.31.5
46 x 6 x 70.60.22.4
33 x 5 x 60.60.32.4

There's a bit of a mystery here. Why does using SSD make only one form of access (in red) to unchunked data slow, when spinning disk is slow for 2 forms of access to contiguous layout. Stay tuned, or register your explanation as feedback below.

 And remember to keep your APIs separated from your IPAs ...

Introducing IDV 5.0 - Lynx!

$
0
0

Java3D has enabled beautiful 3D visualizations in the IDV. Unfortunately, the project is very lightly maintained and has put the IDV in a quite the situation - if Java3D no longer works, then the IDV no longer displays. This has caused quite a bit of hand-wringing here at Unidata. Developer Julien Chastang was once quoted as saying "If we cannot find a replacement for Java3D, I'm going back to paper...origami, perhaps."

With that said, the IDV team is pleased to announce the IDV 5.0 release (code name Lynx). Yes, we understand we just released IDV 4.0u1 last week, but we've had a major breakthrough in visualization technology we use on the backend which warrants a major version increase (talk about API breaks!).  We've successfully bolted on a new backend to the VisAD layer that will live on forever, while still perserving all of the functionality our users are used to. For example, with the Java3D backend, our user would see the following 3D global image:

 

 IDV 4.0u1 with Java3D

 

Now, with the Lynx release, we get beautiful images such as this without relying on Java3D (same visualization as above): 

 

IDV 5.0 Lynx preview

 

Note that we are still working on the scaling algorithm, but this should be fixed in 5.0u1. The new zoom capabilities afforded by the new backend are fantastic:

 

 

 

For users that struggled with the colorbars in previous version of IDV, we feel they will agree that the Lynx backend results are far superior. Jeff Weber (Unidata TV personality) commented that the new colorscale is much more intuitive: "The new visualization backend is much easier to use. For example, see that the little pocket of moisture over the desert southwest? Not the '=~~=', but the '??=z?7'? This is much better than a magenta color. I really think our users will dig this new interface! Think of all the new meteorological jargon this will create. How many new acronyms and initializations will be birthed from IDV Lynx?! I think we just kicked it to the next level!"

It should be noted that lead IDV developer Yuan Ho was a bit more reserved in his enthusiasm for the new release, but he said "while this new backed is great, I feel it's greatest strength is that it will usher in a new era of teletype driven analysis. Retro is the thing, and at leaset we are on the bleeding edge of the movement." Developer Sean Arms loves the new IDV, but is really stuck on the code name Lynx: "Cats are so overdone these days. Since this release is dubbed as being 'reto', I felt like this release should be name IDV Hipster...I mean, come on, our community was using teletype before it was cool...most students have probably never heard of it. I could see IDV tee-shirts based on this hipster theme...it's golden! But, you know, Lynx are pretty cool cats, I guess...but Ligers are better."

Speaking of retro, we would also welcome feedback on a new feature that we would like to include in the IDV 5.1 release (currently scheduled to be released next year, 04-01-2014) - we are calling GarpDV, a GARP inspired gui front-end for the IDV. Thoughts, comments, and suggestions should be directed to support-idv-AT-unidata.ucar-DOT-edu.

Happy April 1st!

The IDV Development Team 

An Essay on Domain Specific Models

$
0
0
A domain specific model is one which has constructs that are specific to a particular domain. Examples are the CDMCoordinate System and Scientific Feature Type models, which augment CDM with features such as lat-lon based indexing, grids, and point data.

It is often the case that a domain specific model is implemented by providing an API that in turn is implemented with respect to some underlying, more generic data model. Again, CDM is an example, where the Coordinate System and Scientific Feature Type models are implemented using the underlying CDM Data Access Layer model.

DAP4 also provides a good, generic model capable of supporting domain specific models. In fact, it should be possible to implement the equivalent of the CDM Coordinate System and Scientific Feature Type models on top of DAP4.

Figure 1. Notional Domain Model over DAP4 Architecture

Figure 1 shows a notional architecture for using DAP4 as the basis for domain specific modeling.

The client is given an API supporting a domain specific model. It is assumed here that an instance of the domain model meta-data is represented as a traversable abstract syntax tree. So, the model API would provide the following operations.

  1. Request the meta-data for a specific dataset. The result would be a reference to the root of the abstract tree for that meta-data. This is analogous to asking DAP2 for a DDS.
  2. Traverse the meta-data tree.
  3. Request some subset of the data associated with the dataset. The request would be in terms of the domain-specific model.

The domain-specific library implementing the API would be responsible for making requests to the server for information. Those requests would indicate to the server the domain model being used as well as the dataset being requested.

The reply from the server would be in the form of a DAP4 DDX and data with annotations (i.e. attributes and extra variables) sufficient to allow the client-side domain model library to convert the returned information to the domain model for presentation to the client.

The server side operation is similar. It is assumed that request from a client contains sufficient identifying information to allow the server (a servlet server such as Tomcat) to forward the request to the servlet capable of interpreting such a domain specific model request.

The domain specific model servlet is capable of translating the dataset into DAP4 as the reply to the client's request. As expected by the client, the reply is annotated with sufficient information to allow the generic DAP4 reply to be converted to the domain specific model.

Dennis Heimbigner

Chunking Data: Choosing Shapes

$
0
0

In part 1, we explained what data chunking is about in the context of scientific data access libraries such as netCDF-4 and HDF5, presented a 38 GB 3-dimensional dataset as a motivating example, discussed benefits of chunking, and showed with some benchmarks what a huge difference chunk shapes can make in balancing read times for data that will be accessed in multiple ways.

In this post, I'll continue looking at that example dataset to see how we derived good chunk shapes, look at how long it can take to rechunk a multidimensional dataset, and consider the use of Solid State Disk (SSD) for both accessing and rechunking data.

Chunk shapes and sizes: You can't always get what you want

Tiling of two-dimensional data is often used to explain chunking, because it's easy to understand and to illustrate with simple figures. Common patterns of subgrid access for a 2D array include:

  1. accessing data by rows
  2. accessing data by columns
  3. accessing a rectangular subgrid of data from somewhere in the middle of the array

If data is stored contiguously by rows, then accessing by rows is very fast, accessing by columns is slow, and the speed of accessing subgrids depends on details of how many rows and columns are involved and the shape of the subgrid.  That's all still true if you interchange the words "row" and "column".

For 2-dimensional data, if you want to support equally frequent access by either rows or columns, then a natural solution is chunking the data into rectangular chunks (also known as tiles) so that reading a row requires the same number of disk accesses as reading a column. You can treat each chunk as if it were a single disk block that must be read completely to access any of its data values. An optimum solution is to make the chunks similar in shape to the entire array, so that the same number of chunks are required to read an entire row or an entire column.

For example, if you have a 277 x 349 array of values (the shape of a horizontal slice in the example dataset in part 1), you could use chunks of size 28 x 35, so that 10 chunks would be adequate to read all the values in any row or any column.  Notice there's a little overhang in the last column and last row of chunks, which really cover 280 x 350 values, but that's typically handled in the library and not something you have to worry about.  To get all the values in a row, you're really reading more data than you want, since each chunk has values from 28 rows, but that's OK because it's typically the number of disk accesses that make I/O slow, not the number of values read.  Also, if you're reading successive rows, caching the 35 chunks for a row in memory means you only have to read each chunk once.

If your chunks are smaller than a disk block you will still have to read a whole disk block, so it doesn't make much sense to choose chunks significantly smaller than a disk block. The number of bytes in a 28 x 35 block of floats (4 bytes each), is 3920 bytes, which is actually pretty close to the 4096 byte blocks used in many desktop file systems, so 28 x 35 is a pretty good shape and size for the parameters of this problem (we're ignoring compressed chunks for now).

Two-dimensional examples are often where the discussion of chunk shapes and sizes ends. However, the 2D case is too simple to generalize to 3D or higher, because equalizing access time along each axis is not necessarily the most common access pattern to be optimized in higher dimensions.  The real-world example presented in part 1 involved a (time, y, x) floating-point variable of shape 98128 x 277 x 349, in which we wanted to balance access to time series of shape 98128 x 1 x 1 and to horizontal slices of shape 1 x 277 x 349 so that neither type of access was ridiculously slow. If we just use chunk shapes of similar shape to the 3D variable, we might try 9813 x 28 x 35 chunks.  For that chunk shape, a time series will need 10 chunks but a horizontal slice will need 100 chunks, and take 10 times longer if the number of disk accesses are the predominant time (as they usually are).

A little algebra can be applied to this 3D case, setting the number of chunks accessed for a 1D time series equal to the number of chunks accessed for a 2D horizontal slice, leading to chunks of shape

98128/N2  by  277/N  by  349/N

where N4 is the total number of chunks used to partition the array. It turns out not to matter how the chunks are distributed along the x and y axes, as long as their product, the number of chunks in a horizontal slice, is the same as the number of chunks in a time series. So the most general formula for optimal chunk shapes for this access pattern has an arbitrary positive constant C, with chunk shapes

98128/N2  by  C*277/N  by  (1/C)*349/N  

Here's source code for a little Python function, chunk_shape_3D, that computes a good shape for this access pattern of equally frequent 1D and 2D accesses of a 3D variable. You provide a variable's shape as a list of dimension sizes, a desired chunk size that the resulting chunks should be close to without exceeding, and the external size in bytes of each element of the variable.  The function returns a "good" chunk shape. The function handles some details not described here, such as converting ideal shapes with fractional dimensions to practical shapes with whole-number dimensions. 

Examples of chunk shapes returned by the function are given in the following table for our 98128 x 277 x 349 variable, for various power-of-two chunk sizes that include the common cases of size 4 KB, 8KB, 1MB, and 4 MB, and assuming the variable values are 4 bytes each.


Desired chunk size (bytes)Actual chunk size (bytes)Chunk shape (values)
4096396033 x 5 x 6
8192 7728 46 x 6 x 7
16384 16384 64 x 8 x 8
10485761032000 516 x 20 x 25
41943044189920 1032 x 29 x 35

 

Rechunking? How Long Will That Take?

If you have some important datasets that are heavily accessed in complementary ways, such as the 1D and (n-1)D pattern presented here, and you're convinced that rechunking the data might be a good idea, the good news is that tools for rechunking are available: nccopy for netCDF-4 data andh5repack, which works for both HDF5 and netCDF-4 data. Note that you can specify chunking (and compression) for classic-model netCDF data, so you don't have to make use of features of the netCDF-4 data model to get the benefits of chunking.

The bad news is that rechunking BigData, or even only PrettyBigData, can take a lot of time. The problem is analogous to transposing a matrix that's too big to fit in memory all at once. But don't despair. The time it takes to rechunk a big dataset is often only a small multiple of time to copy the data from one disk file to another, especially when you have a lot of memory. Maybe some day rechunking datasets will be a cloud service, which would be especially convenient if your big datasets are already in the cloud.

Another issue to consider is compression. If your data is compressed in netCDF-4 or HDF5, it has to be chunked, because a chunk is the atomic unit of compression as well as disk access. Rechunking compressed data means reading each compressed chunk, uncompressing it, rechunking the uncompressed data, recompressing the new chunks, and writing them out to disk. Believe it or not, that can be faster than doing the same rechunking with uncompressed data, due to savings in disk I/O for data that is very compressible.

Here are some benchmarks for rechunking our example dataset, using nccopy or h5repack: 

Source chunks Destination chunks 

 nccopy: disk, SSD (minutes)

 h5repack: disk, SSD (minutes)

1032 x 29 x 35 516 x 20 x 25  7,  499,  38
1032 x 29 x 35 64 x 8 x 8 10,  10134, 43
1032 x 29 x 35 46 x 6 x 7  11,  10144, 46
1032 x 29 x 35 33 x 5 x 6 12,  14 -, 49

Though nccopy is faster than h5repack for netCDF files, it could probably still be sped up significantly .  Might be a good project for a summer intern ...

If you're going to be a frequent rechunker, you'd be wise to get a machine with lots of memory and maybe lots of SSD instead of spinning disk. But there are some surprises with SSD, as we'll see in the next section.

Running these tests have helped to provide some tips on rechunking that weren't obvious to me.  First, if you want to rechunk data quickly, what form of source is best, in terms of chunking and compression?  Possibilities include unchunked contiguous data, a few large chunks, or lots of small chunks, and in each case, is it better to use compressed or uncompressed data as the source?  Benchmarks with the 38 GB example dataset suggest a few answers.  

First, rechunking from a contiguous layout is slow unless you can read the entire input file into memory.  Typically that's not practical, but you may have to start with contiguous data to get a chunked dataset with a few large chunks that you can use as source for experimenting with creating files with better chunk shapes.

It doesn't seem to matter much whether the input or output data is compressed, as I/O time will dominate the rechunking, as long as you have enough memory dedicated to chunk cache to hold both the compressed input and compressed output in memory.  If you lack that much memory, use of chunk cache may have to be tuned for optimum rechunking.  This is still somewhat of a dark art, but nccopy lets you specify how much memory to use for chunk caches as a command-line option.

Even with SSD, rechunking data takes a relatively long time.  Is it ever worth it?  I think it is, for important datasets that will be written once but read many times.  It's similar to justification for developing the new zlib-compatible zopfli compression algorithm, which is 100x slower than zlib for compression, but compresses 5% better, so it saves time on every access and is a win after about 20 accesses.  With the huge bias in access times discussed in part 1, rechunking is a win if it replaces only a few accesses in the "slow" order.

If Memory Serves: What's Going On with SSD?

If you do I/O intensive manipulations such as rechunking and if you can afford it, equip your machines with SSD in addition to spinning disk (but make sure your SSDs are designed to deal with power faults). How does using SSD compare with using conventional spinning disks? We've tried them for:

  • accessing contiguous data
  • accessing chunked data
  • rechunking data

Serial access with SSD can easily be 4 or 5 times faster than spinning disks, but that's not the only speedup SSD provides. Using SSD with traditional contiguous storage can make chunking the data unnecessary, because random access is so much faster in SSD than spinning disk. Here's an example of average time to access time series and horizontal slabs on our example dataset:


Storage layout,
chunk shapes
Read time
series (sec)
Read spatial
slice (sec)
Performance bias
(slowest / fastest)
Contiguous favoring time range0.000030.000041.3
Contiguous favoring spatial slice530.00318000
1032 x 29 x 351.21.01.2
64 x 8 x 80.50.31.5
46 x 6 x 70.60.22.4
33 x 5 x 60.60.32.4

There's a bit of a mystery here. Why does using SSD make only one form of access (in red) to unchunked data slow, when spinning disk is slow for 2 forms of access to contiguous layout. Stay tuned, or register your explanation as feedback below.

 And remember to keep your APIs separated from your IPAs ...

Testing 13.2.1 Unified Grib Decoder on CONDUIT GFS

$
0
0

Last month we received the first version of AWIPS II which included the new unified grib decoder (13.1.2). The install procedure for 13.1.2 was more complicated than usual - we needed the full 13.1.1 installation plus a 13.1.2 "update" - so around 8 GB of RPMs to manage. 

If you're unfamiliar with what the unified grib decoder is, here's a quick rundown: before 13.1.2, the D2D perspective (for WFOs) and the National Centers Perspective (for NCEP centers) required separate data decoders and database tables for grib messages. D2D used a decoder called grib, while NCP used a decoder called ncgrib. If you didn't want to bog down your system, you could only run one at a time, meaning: depending on your server configuration, gridded model data would only be visible in one perspective, not both.

This first version of the unified grib decoder had a number of problems, likely from the complicated installation procedure (we do it a little differently here compared to a operational forecast office).

Now with the full AWIPS II 13.2.1 release from the NWS, I can finally test our method for increasing the number of decoder threads on CONDUIT data. This is the same method used to handle ingest of the entire NEXRAD3 feed (~190 radar sites), and is described in detail on the page linked at the bottom of this post.

As delivered, AWIPS II will take roughly 4 hours to decode the high-resolution CONDUIT 0.5 degree GFS with 4 threads, and the message broker bottleneck would get worse with products simultaneously being decoded by other plugins (obs, satellite, radar, etc.)

Unidata's solution to this is to increase the allocated number of threads for the grid (formerly called grib) decoder, a task which involves either re-building the EDEX core RPMs, or editing the already-installed plugin-specific jar archives.    I managed to incorporate the new decoder thread settings into an add-on RPM which I have added to the Unidata AWIPS II release (not available to the public at this time, sorry).

Initial results are promising: the total time to ingest and decode the 0.5 degree GFS on CONDUIT (25k files, ~3.6 GB) was just over 1 hour.


I let this run for a few days to make sure the Qpid message queue remained active as in the past I've noticed high-volume grib message decoding would sometimes bottle-neck the system even though the dataflow through-put was low.  Here, it seems, the system can more than handle the 0.5/2.5 degree CONDUIT GFS.

Of course ingesting CONDUIT grids alone is one thing, ingesting them alongside point, satellite and radar data is another. That's the next step.

For more info on modifying the number of decoder threads, see: http://www.unidata.ucar.edu/staff/mjames/awips2/docs/threads.html

-Michael


Update on 13.2.1 Grib Decoder Threads

$
0
0

Since the last update, which involved testing only the ingest and decoding of CONDUIT 0.5/2.5 degree GFS, I've opened up the NGRID and NEXRAD3 feeds, as well as text and satellite products from the WMO and NIMAGE feeds, respectively.

 The goal is to compare the speed of the grid decoder on high-resolution CONDUIT GFS runs alone versus running in parallel with the full nationwide NEXRAD3 feed and other products.

 So far so good.  I was using 8 grid decoder threads up until roughly April 26 0000 UTC, after which I increased the count to 12.  While there is a noticeable decrease in the decoding latency (maxing out in the 1000-2000 second range rather than 2000-3000 seconds), this could just as well be caused by a reduction in file size of the GFS given the absence of activity in the atmosphere since April 26 compared to the previous week.

  

 Regardless of the reason, we have shown that the payoff from increasing the thread count to 12 from 8 is diminishing: we're being limited more by the Raw Data Store write times than we are by the time it takes the grid decoder to process files.

Even so, the total decoding time is good: if 1500 seconds (25 minutes) is the longest latency time for the high-resolution GFS, we're in pretty good shape.  For comparison, using GEMPAK grib2 decoders with the LDM, the typical setup for GEMPAK users these days, the entire 0.5 degree GFS run takes about 70 minutes to be fully decoded and available on disk.  For AWIPS II EDEX, this time is roughly 90 minutes.  That is encouraging. 

Displaying Ensemble Grids

$
0
0
Displaying Ensemble Grids

What types of Ensemble data are there?

  • control (initial analysis and/or forecast)
  • member (perturbations of the control run)
  • average (computed from members)
  • spread (computed from members)
  • probabilistic (computed from members)

Special ensemble functions

GEMPAK provides a special set of functions, all of which are named beginning with ENS_, to do specific calculations over multiple members of an ensemble. The constitution of the ensemble is specified as a GDFILE entry by listing file names and aliases, separated by commas and enclosed in curly brackets {}. For specific functions available see the GPARM online documentation.

Example

The GFS model provides deterministic output at the 72 hour forecast time for 6 hour accumulated precipitation (P06M) and boundary layer CAPE as shown in GDPLOT2 using:

GDFILE   = gfs004
GDATTIM  = f072
GLEVEL   = 0 ! 180:0
GVCORD   = none ! pvbl
SCALE    = 0
GDPFUN   = p06m ! cape
TYPE     = f ! c
CONTUR   = 3/3
CINT     = 300
LINE     = 2/1/2
FINT     = .25;2.5;6.35;12.7;19.05;25.4;31.75;38.1;44.45;50.8;63.5;76.2;101.6;127;152.4;177.8 
FLINE    = 0;21-30;14-20;5 
GAREA    = us
PROJ     = STR/90;-100;0  

The figure above shows several large areas of precipitation with low CAPE values. We also see several areas with large cape values and little precipitation.

By utilizing the ensemble members, we can quantify the probability of precipitation exceeding .25mm (red contour lines), and CAPE values exceeding 500 J Kg^-2 (yellow shading) using 20 members of the global ensemble forecast system (gefs) in GDPLOT2 using:

GDFILE   = {gefs}
GDATTIM  = f072
GLEVEL   = 0 ! 180:0
GVCORD   = none ! pvbl
SCALE    = 0
GDPFUN   = ens_prob(gt(p06m,.25)) ! ens_prob(gt(cape,500))
TYPE     = c ! f
CONTUR   = 3/3
CINT     = 0.2
LINE     = 2/1/2
FINT     =  ! .5;1.2
FLINE    =  ! 0;5/7 

Observing the plot above, we can visually detect several regions where precipitation probability and CAPE values might suggest likely areas of thunderstorm activity where the two contour regions intersect. We can quantify the combined probability by using the logical operator AND() to compute the combined probability of both conditions as shown below:

GDFILE   = {gefs}
GDATTIM  = f072
GLEVEL   = 0
GVCORD   = none
SCALE    = 0
GDPFUN   = ens_prob(and(gt(p06m,.25),gt(cape@180:0%pvbl,500)))
TYPE     = f
CONTUR   = 3/3
CINT     = 
LINE     = 
FINT     = .1/.1
FLINE    = 0;23-13/7

Creating and Using a Region-Of-Interest Mask

$
0
0

A vgf file is created using the interactive product generation tools within NMAP2. A closed line is drawn enclosing the region of interest, and a text label is grouped with the line assigning a value of 1 to the contour.

Once the region of interest is defined, a grid file can be created using the GRPHGD standalone program (It can also be created using the graph-to-grid option in NMAP2).

Create grid from VGF file with GRPHGD

GEMPAK-GRPHGD>
GDOUTF   = 2007071900_cmask.grd
GUESS    =  
PROJ     = MER
GRDAREA  =  
KXKY     = 10;10
MAXGRD   = 200
CPYFIL   = #A218
ANLYSS   =  
CNTRFL   = cmask.info
GDATTIM  = 070719/0000f000
GFUNC    = cmask
GLEVEL   = 0
GVCORD   = none
KEYCOL   =  
KEYLINE  =  
OLKDAY   =  
GGLIMS   =  
HISTGRD  = NO
BOUNDS   =  
TYPE     = C
GAMMA    = 0.3
SEARCH   = 20
NPASS    = 2
QCNTL    =  
GUESFUN  =  
CATMAP   =  
DISCRETE =  
DLINES   = yes;no|-0.5
GGVGF    = conusoutline_cmask.vgf
EDGEOPTS =
GEMPAK-GRPHGD>r

Check with GDINFO

GEMPAK-GDINFO>
GDFILE   = 2007071900_cmask.grd
LSTALL   = YES
OUTPUT   = T
GDATTIM  = all
GLEVEL   = 0
GVCORD   = none
GFUNC    = cmask
GEMPAK-GDINFO>r

GRID FILE: cmask.gem                                                                                           

GRID NAVIGATION: 
    PROJECTION:          LCC                 
    ANGLES:                25.0   -95.0    25.0
    GRID SIZE:          614 428
    LL CORNER:              12.19   -133.46
    UR CORNER:              57.33    -49.42

Number of grids in file:     1

 NUM       TIME1              TIME2           LEVL1 LEVL2  VCORD PARM
   1     070719/0000F000                          0         NONE CMASK       
GEMPAK-GDINFO>

The parameter CMASK is created where values greater than 1 are enclosed by the contour. DLINES defines the epsilon of (here -0.5 is used since the contour was drawn counterclockwise) to add to vlaues on either side of the single contour to be define greater / less than the contour value of 1.0.

The grid point values of SGT(cmask,1) (greater than 1.0) as shown here:

Plot with GDPLOT2

The resultant grid can ge used with the MASK() function and logical operators to define mask or clipping regions of interest. As an example, the 24 hour precipitation in the top panel is masked by the region of interest in the lower panel so that only data within the region of interest will be considered.

GDFILE  = nam12 + 2007071900_cmask.grd                                                                                 
GDATTIM  = f030
GLEVEL   = 0
GVCORD   = none
PANEL    = t     ! b
SKIP     = 0
SCALE    = 0
GDPFUN   = p24i  ! mask(p24i,sgt(cmask^070719/0000f000+2,1))
TYPE     = f
CONTUR   = 3/3
CINT     = 0
LINE     = 2/1/2
FINT     = .01;.1;.25;.5;.75;1;1.25;1.5;1.75;2;2.5;3;4;5;6;7;8;9
FLINE    = 0;21-30;14-20;5
HILO     =  
HLSYM    =  
CLRBAR   = 1
WIND     = BM1
REFVEC   =  
TITLE    = 1
TEXT     = 0.7/2/SW
CLEAR    = YES
GAREA    = us
IJSKIP   =  
PROJ     = STR/90;-100;0

Using GDCSV to find local maximas

The HIGH() function can be used to obtain local maximas from the grid region of interest. The GWFS() gaussian weighted smoothing function can be used to reduce higher frequency features in the grid and focus on broader areas of interest. By using the GDCSV program, the locations of HIGH() output are output to a file for use in determining local mesoscale model domain centers. By masking the region, the model domains are ensured to be within the desired region.

GEMPAK-GDCSV>l
GDATTIM  = f030
GDFILE   = nam12 + 2007071900_cmask.grd
GLEVEL   = 0
GAREA    = grid
PROJ     = def
GVCORD   = none
GFUNC    = high(mask(gwfs(p24i,40),sgt(cmask^070719/0000f000+2,1)),30)
SCALE    = 0
OUTPUT   = f/p24i_highs.dat
GEMPAK-GDCSV>r

OUTPUT is to the text file p24i_highs.dat. You can use the sort command to return the top two local maxima:

sort -t, -k 5bnr p24i_highs.dat | head -2

 508,     244,        42.3889,       -72.2385,        2.53370
 338,     236,        42.9560,       -96.3721,        0.58850

These two values are now scriptable (use |head -1 for the first and | head -2 | tail -1 for the second) both for plotting regional areas of interest and for WRF domains. A WRF lesson is outside the reach of this document, but as an example of what is run at Unidata, here is are the primary and secondary regions

Plotting again with GDPLOT2 and GPANOT

The locations of the 2 greatest local maximas (42.3889,-72.2385 and 42.9560,-96.3721) are shown along with the precipitation forecast field. The primary and secondary domains can be drawn using GPANOT to overlay the boxes.

First plot masked precipitation, same as above, but change PANEL = 0 and CLEAR = n

GDFILE   = nam12 + 2007071900_cmask.grd                                                                                 
GDATTIM  = f030
GLEVEL   = 0
GVCORD   = none
PANEL    = 0
SKIP     = 0
SCALE    = 0
GDPFUN   = mask(p24i,sgt(cmask^070719/0000f000+2,1))
TYPE     = f
CONTUR   = 3/3
CINT     = 0
LINE     = 2/1/2
FINT     = .01;.1;.25;.5;.75;1;1.25;1.5;1.75;2;2.5;3;4;5;6;7;8;9
FLINE    = 0;21-30;14-20;5
HILO     =  
HLSYM    =  
CLRBAR   = 1
WIND     = BM1
REFVEC   =  
TITLE    = 1
TEXT     = 0.7/2/SW
CLEAR    = YES
GAREA    = us
IJSKIP   =  
PROJ     = STR/90;-100;0
CLEAR    = n
r

NetCDF4 use of dimension scales

$
0
0

In a previous  blog, I described the internals of an HDF5 file that uses netCDF4 shared dimensions. While that description remains valid, I've discovered some holes in my implementation, as well as some new thoughts on how shared dimensions could be done in a simpler way. Please turn off your cell phones.

 My current algorithm for finding shared dimensions goes like this:

  1. Find all variables (datasets in HDF5 parlance) with attribute CLASS = "DIMENSION_SCALE". These are the dimension scales, and they correspond 1-1 with netCDF4 dimensions. So for each dimension scale, make a dimension using the variable's name as the dimension name, and using the variable's shape(0) as dimension length.
  2. Find all variables with attribute CLASS = "DIMENSION_LIST". These are the data variables, and the DIMENSION_LIST contains a list of the shared dimension names used by that data variable.
  3. When a dimension scale is also a coordinate variable, some special processing has to happen, because the "DIMENSION_LIST" is not present.  There are two cases:
    1. The dimension scale has rank 1 (is 1 dimensional). This is the easy and common case, since it means that it has one dimension with the same name as itself.
    2. The dimension scale has rank 2. These are type char coordinate variables. Its first  dimension has the same name as itself, but its second dimension is tricky to find. We know the length of it, so the algorithm I'm using is to look through the dimensions and match on length. If its unique, use that dimension. If not unique, then create an anonymous dimension. I may modify this later to find the real shared dimension.

If you actually read that last part, you may note that the complication is mostly with finding the second dimension of a 2D dimension scale. There are a lot of ironies with this. The first is that this case probably should just be handled by creating a seperate 1D dimension scale that is not also a variable, and a data variable that is not a dimension. The second is that the 2D char coordinate variables are usually really string valued coordinates, and if there were string types in the classic model, then we wouldn't need the second troublesome dimension. The third irony is that this second dimension never needs to be shared and should be anonymous, which netCDF4 does not currently have.

Another wrinkle is that the netCDF4 library uses "dimension ids" to identify the dimensions used, using _Netcdf4Dimid and/or _Netcdf4Coordinates internal attributes. However AFAICT the ids are not well defined, in the sense they are not stored in the file format, but are the index into a list of dimensions that rely on some (apparently undocumented) ordering that the HDF5 C library imposes. So unless I am misunderstanding this, I can't use those ids in a pure Java library that doesn't have access to the HDF5  C library. If I am correct, then this is an example where the file format and the reference library have gotten confused, a mistake we library writers often make. 

After working with this issue again, I realized that in principle an easier thing to do is to just put a DIMENSION_LIST on any variable that is supposed to be a data variable. For completeness, here is a proposal to implement netCDF4 shared dimensions with HDF5 dimension scales:

  1. an object may have one or both "DIMENSION_SCALE" and "DIMENSION_LIST" attributes. 
  2. an object that has the "DIMENSION_SCALE" attribute defines a dimension of length shape(0).  
  3. an object that has the "DIMENSION_LIST" attribute defines a data variable.
  4. a dimension must exist in the same group or a parent group of a variable that uses it.

I think this would cover the matter, and is significantly simpler than what is implemented now. 

However, this proposal doesn't try to capture the creation order of the dimensions, which is one of the purposes of the current  _Netcdf4Dimid and _Netcdf4Coordinates internal attributes. The netCDF java library currently ignores creation order, while the netCDF C library preserves creation order. We are debating whether that's an acceptable state of affairs.


Accessing netCDF Data by Coordinates

$
0
0

Should array indices start at 0 or 1? My compromise of 0.5 was rejected without, I thought, proper consideration. --Stan Kelly-Bootle

Introduction

Library software like netCDF or HDF5 provides access to multidimensional data by array indices, but we would often rather access data by coordinates, such as points on the Earth's surface or space-time bounding boxes.

The Climate and Forecast (CF) Metadata Conventions provide agreed-upon ways to represent coordinate systems, but do not address making access by coordinates practical and efficient. In particular, use of coordinate systems such as those described in the CF Conventions section "Two-Dimensional Latitude, Longitude, Coordinate Variables" may not provide a direct way to find array indices from latitude and longitude coordinates for various reasons. Examples of such data include:

  • use of curvilinear grids that follow coastlines
  • satellite swaths georeferenced to lat-lon grids
  • coordinate systems with no CF grid_mapping_name attribute
  • ungridded "station" data, such as point observations and soundings

Using a real-world (well, OK, an idealized spherical Earth) example, we'll see several ways to access data by coordinates, as well as how the use of appropriate software and data structures can greatly improve the efficiency of such access.

An example dataset

The example dataset we use is from the HYbrid Coordinate Ocean Model (HYCOM).

This dataset makes a good example for a couple of reasons:

  • It's typical of many ocean, climate, and forecast model data products, with multiple variables on the same grid, and conforming to the CF Conventions for representing a coordinate system.
  • It has a non-trivial geospatial coordinate system, represented by 2D latitude and longitude arrays that are auxiliary coordinates for other variables. The grid appears to be a subset of a tri-polar grid, but none of the parameters needed to derive a formula for which elements correspond to a specified coordinate are in the file.

Though not very relevant for the subject of this blog, the example dataset also makes good use of compression, with fill values over land for ocean data.  The netCDF classic file is 102MB, but the equivalent netCDF-4 classic model file with compression is only 30MB. Nothing in this post has anything to do with the netCDF-4 data model; it all applies to classic model netCDF-3 or netCDF-4 data.

Here's some output, relevant to the problem we'll present, from"ncdump -h" applied to the example file:

dimensions:
   MT = UNLIMITED ; // (1 currently)
   Y = 850 ;
   X = 712 ;
   Depth = 10 ;
variables:
   double MT(MT) ;
           MT:units = "days since 1900-12-31 00:00:00" ;
   double Date(MT) ;
           Date:units = "day as %Y%m%d.%f" ;
   float Depth(Depth) ;
           Depth:units = "m" ;
           Depth:positive = "down" ;
   int Y(Y) ;
           Y:axis = "Y" ;
   int X(X) ;
           X:axis = "X" ;
   float Latitude(Y, X) ;
           Latitude:units = "degrees_north" ;
   float Longitude(Y, X) ;
           Longitude:units = "degrees_east" ;
   float temperature(MT, Depth, Y, X) ;
           temperature:coordinates = "Longitude Latitude Date" ;
           temperature:standard_name = "sea_water_potential_temperature" ;
           temperature:units = "degC" ;
   float salinity(MT, Depth, Y, X) ;
           salinity:coordinates = "Longitude Latitude Date" ;
           salinity:standard_name = "sea_water_salinity" ;
           salinity:units = "psu" ;

Plotting all of the latitude-longitude grid would result in a big blob of 600,000 pixels, but here's a sparser representation of the grid, showing every 25th line parallel to the X and Y axes:

Sparse grid lines near Alaska

Examples of coordinate queries

The temperature variable has auxiliary Longitude and Latitude coordinates, and we want to access data corresponding to their values, rather than X and Y array indices. Can we efficiently determine values on netCDF variables such as temperature or salinity at, for example, the grid point closest to 50 degrees north latitude and -140 degrees east longitude?

More generally, how can we efficiently

  • determine the value of a variable nearest a specified location and time?
  • access all the data on a subgrid defined within a specified space-time coordinate bounding box?

Why use an iPython notebook for the example?

For this example, the code for reading netCDF data and mapping from coordinate space to array index space is presented in an iPython notebook. Why not C, Java, Fortran, or some other language supported directly by Unidata?

The point is clarity. This is a pretty long blog, which has probably already collected several "TLDR" comments. The accompanying example code has to be short, clear, and easy to read.

The iPython notebook, which grew out of a session on reading netCDF data in the Unidata 2013 TDS-Python Workshop, is available in two forms: a non-interactive view in HTML that you can read through and follow along, or an actual iPython notebook,"netcdf-by-coordinates" and associated data file from the 2013 Unidata TDS-Python Workshop, with which you can run examples and do timings on your own machine and even with your own data!

If you choose the second method, you can see the results of changing the examples or substituting your own data.  To use the notebook you'll need to have Python and some additional libraries installed, as listed at the bottom of the README page on the referenced workshop site.

Just looking at the HTML version should be sufficient to see what's going on.

Four Approaches

In the notebook, we implement and progressively improve several ways to access netCDF data based on coordinates instead of array indices:

  • slow, simple, and sometimes wrong
  • fast, simple, and sometimes wrong
  • not quite as fast, but correct
  • fast, correct, and scalable for many queries on the same set of data points

Naive, slow approach

Since our concrete problem is to find which of over 600,000 points is closest to the (lat, lon) point (50, -140), we should first say what is meant by"close".

A naive metric for distance squared between (lat,lon) and (lat0, lon0) just uses the"Euclidean norm", because it's easy to compute:

dist_squared = (lat - lat0)2 + (lon - lon0)2

So we need to find indicesx and y such that the point (Latitude[y, x], Longitude[y, x]) is close to the desired point (lat0, lon0). The naive way to do that is to use two nested loops, checking distance squared for all 600,000+ pairs of y andx values, one point at a time.

The code for this version is presented as the functionnaive_slow and a few subsequent lines that call the function on our explicit example.

Note that the code cell examples all initially open the example file, read the data needed to find the desired values, and close the file after printing enough information to verify that we got the right answer. The work of opening the file, reading the data, and closing the file is repeated for each function so that the cells are independent of each other and can be run in any order or re-run without getting errors such as trying to read from closed files or closing the same file twice.

Faster with NumPy

The first approach uses conventional explicit loops, which makes it resemble Fortran or C code. It can be sped up by a factor of over 700 just by using NumPy, a Python library supporting multidimensional arrays and array-at-a-time operations and functions that often replace explicit loops with whole array statements. We call this version of the functionnaive_fast.

Like naive_slow, it suffers from flaws in the use of a "flat" measure of closeness.

Tunnel distance: still fast, but more correct

A problem with both of the previous versions, is that they treat the Earth as flat, with a degree of longitude near the poles just as large as a degree of longitude on the equator. It also treats the distance between points on the edges, such as (0.0, -179.99) and (0.0, +179.99) as large, even though these points are just across the International Date Line from each other.

One way to solve both these problems is to use a better metric, for example the length of a tunnel through the Earth between two points as distance, which happens to be easy to compute by just using a little trigonometry.

This results in a more correct solution that works even if longitudes are given in a different range, such as between 0 degrees and 360 degrees instead of -180 to 180, because we just use trigonometric functions of angles to compute tunnel distance. Wikipedia has the simple trig formulas we can use. This also gives us the same answer as the naive approach for our example point, but avoids erroneous answers we would get by treating the Earth as flat in the"naive" approaches.

Timing this approach, with the function we calltunnel_fast, because we still use fast NumPy arrays and functions instead of explicit loops, is still about 200 times as fast as the loopy naive approach.

Using KD-trees for More Scalable Speedups

Finally, we use a data structure specifically designed for quickly finding the closest point in a large set of points to a query point: the KD-tree (also called a "k-d tree"). Using a KD-tree is a two-step process.

First you load the KD-tree with the set of points within which you want to search for a closest point, typically the grid points or locations of a set of ungridded data. How long this takes depends on the number of points as well as their dimensionality, but is similar to the time it takes to sort N numbers, namelyO(N log(N)). For the example dataset with over 600,000 points in the grid, this takes under 3 seconds on our laptop test platform, but it's noticeably longer than the setup time for the minimum tunnel-distance search.

The second step provides a query point and returns the closest point or points in the KD-tree to the query point, where how"closest" is defined can be varied. Here, we use 3D points on spherical Earth with unit radius. The combination of setup and query are initially implemented in akdtree_fast function.

Although you can't tell by running the setup and query together, the kdtree_fast query is significantly faster than in the tunnel_fast search. Thus thekdtree_fast approach scales much better when we have one set, S, of points to search and lots of query points for which we want the closest point (or points) in S.

For example, we may want variable values on a 100 by 100 (lat,lon) subgrid of domain, and as we'll see, the KD-tree provides a much faster solution than either the tunnel_fast ornaive_fast methods. If the locations don't change with data from multiple files, times, or layers, the setup time for the KD-tree can be reused for many point queries, providing significant efficiency benefits. For popular spatial datasets on the server side, the same KD-tree could be reused for multiple datasets that use the same coordinates.

The KD-tree data structure can also be used more flexibly to provide fast queries for the M closest points to a specified query point, which is useful for interpolating values instead of just using the value of a variable at a single closest point.

The KD-tree data structure is like a Swiss-army knife for coordinate data, analogous in some ways to regular expressions for strings. We have shown use of KD-trees for 2D lat/lon data and for 3D coordinates of points on a sphere with a tunnel distance metric. KD-trees can also be used with vertical coordinates or time, if you specify a reasonable way to define "close" between points in space-time.

Implementations of KD-trees are freely available for C, Java, Fortran, C++, and other languages than Python. A Java KD-tree implementation is used inncWMS, the Web Map Service included in TDS servers but developed at the University of Reading. A C/C++ implementation of KD-trees is used in Fimex, a library from the Norwegian Meteorological Institute that supports interpolation and subsetting of geospatial data built around the Unidata Common Data Model. The module we used in the iPython notebook example, scipy.spatial.cKDTree, is a C implementation callable from Python, significantly faster than the pure Python implementation in scipy.spatial.KDTree.

The last section of the notebook re-implements our four functions as four object-oriented Python classes, with the setup code in a class constructor and a single query method. This makes it easier to time the initialization and query parts of each approach separately, which leads to simple formulas for the time required for N points and a prediction for when the extra setup time forkdtree_fast pays off.

And the Winner Is ...

The envelope, please. And the winner is ... YMMV!

The reasons Your Mileage May Vary include:

  • number of points in search set
  • dimensionality of search set
  • complexity of distance metric
  • number of query points
  • number of closest neighbors needed, e.g. for interpolation
  • I/O factors, including caching, compression, and remote access effects
  • accounting for non-spherical Earth shape

We've summarized our timing results for the four approaches to the sample problem on the example dataset at the end of the iPython notebook. Perhaps we can look at some of the factors listed above in future blogs.

In case you didn't want to open the iPython notebook to see the results, here's some times on a fast laptop platform:

       
MethodSetup (ms)Query (ms)Setup + 10000
queries (sec)
Naive_slow3.76779077900
Naive_fast3.82.46 24.6
Tunnel_fast27.45.14 51.4
Kdtree_fast25200.07383.3

Determining when it's worth it to use a KD-tree forthis example resulted in:  

  • kd_tree_fast outperforms naive_fast above 1050 queries
  • kd_tree_fast outperforms tunnel_fast above 490 queries

The advantage of using KD-trees is much greater for more search set points, as KD-tree query complexity is O(log(N)), but the other algorithms are O(N), the same as the difference between using binary search versus linear search.

If you're interested in more on this subject, a good short paper on various ways to regrid data using KD-trees and other methods is Fast regridding of large, complex geospatial datasets by Jon Blower and A. Clegg.

Viewing all 452 articles
Browse latest View live