Quantcast
Channel: Unidata Developer's Blog
Viewing all 452 articles
Browse latest View live

High Performance Netcdf-4 Proposal

$
0
0

This documents outlines a proposal to create an alternate Netcdf-4 file format targeted to high-performance, READ-ONLY, access. For the purposes of this document, this format will be called NCX.

Limitations of the Existing Netcdf-4 format

It is currently the case the Netcdf-4 file format uses the existing HDF5 file format to store its data. From a high-performance point of view, the HDF5 format is limited in a number of ways.

  1. It does not support multi-threaded access; currently all API calls must be serialized using a single global lock.
  2. MPIO support is provided, but is totally embedded in the HDF5 library. There is no ability for user control and optimization.
  3. The HDF5 file format is completely fixed and opaque and there is limited support for performance-specific organizations. The two exceptions are:
    • Chunking parameterization is allowed to control how data is co-located.
    • Compression (on a per-chunk basis) allows data to be compressed thus supporting faster reads.

Rationale for a New NCX Format

What is being proposed is a new format for read-only access to "Netcdf4-like" files that provide the following capabilities.

  1. A simple-as-possible file format with a specification independent of any implementation.
  2. Keeping the existing Netcdf-4 data-model.
  3. Some ability to re-arrange the data in the file to support specific access patterns. This would include keeping the HDF5 chunking and compression concepts.
  4. Support for community developed tools that can re-organize

In addition, NCX is intended to be sufficiently simple that multiple, independent implementations can be constructed in a variety of programming languages. This is in contrast to the situation with HDF5 where the file format is so complex that there only exists one complete implementation exists: the one provided by the HDF group.

A Draft File Format

The NCX format proposed in this section is preliminary. Alternative proposals are encouraged.

The basic format builds on the concept of a single-file file system format (aka SFFS).

The basic idea is that a single file is organized to contain a file system including a root plus inodes plus data blocks, all within a single file that is treated as as if it was a heap.

The SFFS approach has a number of useful properties.

Simplicity: The basic SFFS layout is relatively simple. As with an on-disk file system, it uses a superblock plus a set of inodes each of which points to a tree of data blocks. Such an organization avoids the complexity of e.g. the HDF5 b-trees while providing a very general data layout.

Dynamicity: As with a normal file system, a file in the SFFS can be extended (or shortened) in size dynamically at the end of the file.

Annotation: Since the SFFS simulates a file system, it is possible to add information about existing information in the SFFS. In effect, one can create a file that provides "annotations" about other files in the SFFS: These annotations can include, for example, indices pointing into an existing file.

Capability for Reorganization: As long as the basic inode structure is maintained, it is possible to move chunks of data around to support better IO performance. One could even redivide the existing data into larger or smaller data chunks.

Mapping Netcdf4 to an SFFS.

Meta-data: The meta data about the netcdf-4 file can itself be contained in a single, virtual file in the SFFS.

Primitive-Typed Variables: Consider a variable consisting of primitive types, of fixed size: ints of various sizes and unsigned or signed or enums or chars. Assume the dimensions are all fixed size (not unlimited).

Such a variable can be easily laid out in a contiguous format, possibly using hdf5 style chunking and compression.

Unlimited Dimensions Case 1 (Initial Unlimited): Extending the previous case, a primitive-typed variable might have one or more unlimited dimensions. For the case of a single, initial unlimited, it can be kept exactly as with a variable with no unlimited. This is because it is possible to dynamically extend a file to accommodate changes in the size of the unlimited dimension.

Unlimited Dimensions Case 2 (Multiple Unlimited): Consider the following. dimensions: d1=..., d2=..., d3=..., du=UNLIMITED; variables: int v(d1,d2,du,d3);

For this case, we have a number of options. One option (assuming read-only as we are) is to start the file containing v with n intra-file offsets pointing to the subparts of the variable defined by the unlimited dimension. That is, for this example, we have an initial index of d1 x d2 offsets, where each offset points to the start of each of the subarrays of size du x d3. This case generalizes to multiple unlimited dimensions in the obvious(?) way.

Note how this differs from the netcdf-3 case where all variables with an unlimited dimension are co-mingled. However also note that we could re-organize this in a variety of ways to support parallel IO for specific access patterns.

String Typed Variables: This is fairly easy in that one can store each string with a preceding count and the strings are stored linearly with some form of index pointing to the offset of each string.

Even simpler, and again because it is read-only, is to store each string using the maximum size string. This produces internal fragmentation, but allows us to treat string as fixed size object.

Opaque Typed Variables: This is essentially the same situation as strings.

VLEN Typed Variables: One approach is to treat each vlen object as a separate file of its own length. Another approach is to use the String approach because we know the maximum size of all the vlens.

Compound Typed Variables: Again we have some options: we could store each compound object as in field order (as with a C struct) with each field following the next.

Alternately, we could store in the equivalent of "column order" where all instances of the first field (assuming an array of compounds) are stored one after another. Then all instances of second field are stored, an so on.

Misc. Notes

Why Read-Only?

The "process" implied here is as follows.

  1. The data file is created using the existing read-write model of the netcdf-c library.
  2. A special program (e.g. nccopy) is used to take original file as a whole and convert it to the NCX format.

The point being that at the point that the NCX file is created, the whole of the dataset is available. This means that, for example, specialized layout of variable-length data (strings, vlens, unlimited) can be achieved because the totality of the data is available. If an attempt was made to write the original dataset piecemeal using the NCX format, the whole of the dataset would not be available, hence it would not be possible to do certain kinds of layout optimizations.

Use of Docker

I considered using docker (esp. docker commit) as an alternative. This has the advantage that one could even include programs into the `file'. However, security considerations made this approach untenable until docker sand-boxing is completely reliable and trusted.


Implementing Thread-safe Access to the netCDF-C Library

$
0
0

Thread-Safe Access to the netcdf-c API

Initial Draft: 2017-2-21
Last Revised: 2017-5-30
Author: Dennis Heimbigner, Unidata

Table of Contents

Introduction

This document proposes an architecture for implementing thread-safe access to the netcdf-c library. Here, the term "thread-safe" means that multiple threads can access the netcdf-c library safely (i.e. without interference or deadlock or race conditions). This does not mean that the library is itself multi-threaded. Rather, access to the library is serialized so that only one thread at a time is executing the library code.

It is proposed that thread-safe operation is to be implemented such that all calls to the netcdf-c API are protected by a binary semaphore using a lock-unlock protocol. This means that all calls to the API are "serialized" in the sense that each API call is completed before any other call to the API can be executed. This means that in a multi-threaded environment, it is possible for all threads to safely access the netcdf-c library.

This approach comes with some caveats.

  1. If two different threads attempt to access the same file, then interference is still possible.
  2. Using thread-safe access simultaneously with MPI parallelism may not be safe. This is still unresolved

Architectural Considerations

At the moment, the implementation of the netcdf-c API resides in files in the libdispatch directory. Basically, all the code in libdispatch falls into the following categories.

  1. Dispatch functions -- These functions directly invoke methods in the dispatch table and typically have this form.

    int nc_xxx(...)
    {
        NC* ncp;
        int stat = NC_check_id(ncid,&ncp);
        if(stat != NC_NOERR) return stat;
        return ncp->dispatch->XXX(...);
    }
    
  2. Auxiliary functions -- These functions just invoke some other function in the API, but possibly with some special values for the arguments of the called function. Here is an example.

    int nc_inq_varname(int ncid, int varid, char *name)
    {
           return nc_inq_var(ncid, varid, name, NULL, NULL, NULL, NULL);
    }
    
  3. Complex functions -- These functions do complex computation including calling a variety of internal functions.

  4. Internal functions -- All other code in libdispatch is considered internal.

Functions in classes 1 and 3 are considered to be part of the API core. The followig Figure shows the notional relationship between the function classes.

Locking Regime

The simplest approach to thread-safety is to surround all calls to API functions with a LOCK/UNLOCK protocol. This is how the HDF5 library operates, for example.

Our proposal is to implement locking using a single, global binary semaphore. This is extremely simple and is well-supported under all versions of ~nix~ (using libpthreads) as well as Windows (built-in).

One consequence of this decision is that there must be no recursive calls to locked functions. If it happens, it will cause a deadlock. This means specifically that core functions and internal functions cannot invoke core functions (directly or transitively).

An example of adding locking to a core function is shown in this example.

    int nc_xxx(...)
    {
        NC* ncp;
        int stat = NC_NOERR;
        LOCK();
    if((stat=NC_check_id(ncid,&ncp)) != NC_NOERR) goto done;
        stat = ncp->dispatch->XXX(...);
    done:
        UNLOCK();
        return stat;
    }

The done label is used to provide a single exit to ensure that UNLOCK is always invoked before exiting the function.

Note that we do not need to add locking to our class 1 (Auxiliary) functions since they just invoke a core function (class 2 or 3) that does the actual locking. Because of this, it will pay to try to convert as many API calls as possible to be auxiliary functions. Currently, there are a number of class 2/3 functions that could be converted with small effort by revising the set of core functions.

Note also that we assume that all internal functions will be invoked either by other internal functions or by core API functions that use a locking protocol. Hence these internal functions do not need to use a locking protocol. In fact, if they did, it could cause a deadlock.

Problem 1: Mostly Auxiliary Functions

It turns out that there are a few functions that are mostly auxiliary functions except that they invoke some internal functions to get information not available through the standard netcdfd-c API. One example is the NCDEFAULTgetvars function. It invokes two internal functions: * NCisrecvar * NC_getshape

The solution is to "expose" these internal functions in the core API by providing wrappers for them that use the locking regime. Using this approach, it should be possible to increase the number of auxiliary functions that do not need to directly use locking.

Note, that exposing these functions does not mean that they are part of the public netcdf-c library API; only that they are accessible to our external functions.

Problem 2: Internal Functions calling Core Functions

This is the big problem is implementing thread-safety. It turns out that some internal code invokes core API functions. This mostly occurs inside the libdap2 and libdap4 code. This is a problem because it violates the no recursive call rule and will lead to deadlock.

The simplest solution to this problem is to change all recursive calls from the internal code to the core API code to no longer call the core API. Instead, the direct calls can, in most cases, be changed to call directly into the dispatch layer. The cost is increased complexity in the internal code. To some degree, this complexity can be mitigated by using macros to hide the complexity. In a few cases, some extra internal functions may have to be introduced into the libdispatch code to make this change possible or to simplify the required changes.

Steps to Implementing Proposed Architecture

The key to implementing the proposed architecture is to slowly refactor the code in libdispatch to properly segregate the auxiliary functions from the core API from the internal code.

The following sequence of actions is proposed.

  1. Create two new files: libdispatch/daux.c and libdispatch/dapi.c.
  2. Move auxiliary functions into daux.c and the core api functions into dapi.c.
  3. Add extra functions in dapi.c to expose functions like NCgetshape_ (see above).
  4. Move, where possible, code from dapi.c to daux.c using the exposed functions in #3.
  5. Identify the recursive calls in internal code. This can be accomplished by temporarily renaming the functions in dapi.c and dextend.c and then recompiling. That should flush out all such recursive calls.
  6. Convert the calls identified in #5 to call through the dispatcher instead.
  7. Add locking to dapi.c.
  8. Test and fix the resulting code to look for missed recursive calls.

Conclusion

Assuming the above approach is correct, then we should be able to make the netcdf-c library thread-safe with a straightforward, if tedious, sequences of changes.

The Death of Server-Side Computing

$
0
0

(OK, that is hyperbole, but...)

Initial Draft: 2017-5-28
Last Revised: 2017-6-5
Author: Dennis Heimbigner, Unidata

Table of Contents

Introduction

For a number of years, the Unidata Thredds group has been in the process of "implementing" server-side computation Real-Soon-Now (as the saying goes).

Server-side computing embodies the idea that it is most efficient to physically co-locate a computation with the datasets on which it is operating. As a rule, this meant having a server execute the computation because the dataset was controlled by that server. Server-side computing servers for the atmospheric community have existing in various forms for a while now: GRADS, DAP2 servers, and ADDE, for example.

One -- and perhaps The -- major stumbling block to server-side computing is defining and implementing the programming language in which the computation is coded. In practice, server-side systems have developed their own language for this purpose. This is a problem primarily because it is very difficult to define and implement a programming language. Often the "language" started out as some form of constraint expression (e.g. DAP2, DAP4, and ADDE). Over time, it would accrete other capabilities: conditionals, loops, etc. In time, it grew into a badly designed but more complete programming language. Since it was rarely implemented by language/compiler experts, it usually was quirky and presented a significant learning curve for users.

The advantage to using such a home grown language was that it could be tailored to the dataset models supported by the server. It also allowed for detailed control of programs. This made certain other issues easier: access controls and resource controls, for example.

The author recognized the language problem early on and was reluctant to go down that path. As the primary "pusher" for server-side computing at Unidata, this has delayed implementation for an extended period.

The Alternative: Jupyter

Fortunately, about three years ago, project Jupyter [1] was created as an offshoot of the IPython Notebook system. It provided a multi-user, multi-language compute engine in which small programs could be executed. With the advent of Jupyter, IPython then refactored its computation part to use Jupyter.

From the point of view of Unidata, Jupyter provides a powerful alternative to traditional server-side computing. It supports multiple, "real" programming languages. It is a server itself, so it can be co-located with an existing Thredds server. And, most importantly, it is designed to execute small programs written in any of its supported languages.

In the rest of this document, the term "program" will, as a rule, refer to programs executing within a Jupyter server.

The Language: Python

In order to avoid the roll-your-own language problem, it was decided to adopt wholesale an existing modern programming language. This meant that the language was likely to be complete right from the start. Further, the learning curve would be reduced because a significant amount of supporting documentation and tutorials would be available.

We have chosen Python as our preferred language. We made this choice for several reasons.

  1. Python is rapidly being adopted by the atmospheric sciences community as its language of choice.
  2. There is a very active community that is developing packages for use by the scientific community and more specifically for the atmospheric sciences community. Examples are numerous, including numpy, scipy, metpy, and siphon.
  3. It is one of the languages supported by Jupyter.

To the extent that Jupyter supports other languages, it would be possible to write programs in those languages. However, I would not expect Unidata to expend any significant resources on those other languages. The one possible exception is if/when Jupyter supports Java.

The Notional Architecture

Notional Architecture

The notional architecture we now espouse is shown in Figure 1. Basically, a standard Thredds server runs along side a Jupyter server. A program executing in the Jupyter server has access to the data on the Thredds server either using the file system or using some streaming protocol (e.g. DAP2). File access is predicated on the assumption that the two servers are a co-located and share a common file system.

The Thredds server currently requires and uses some form of servlet engine (e.g. Tomcat). We exploit that to provide a front-end servlet to act as intermediary between a user and the Jupyter server (see below).

So now, instead of sending a program to the Thredds server, it is sent to the Jupyter server for execution. That executing program is given access to the Thredds server using a variety of packages (e.g. Siphon [2]). Once its computation is completed, its resulting products can be published within a catalog on Thredds to make it accessible to user programs. Once in the catalog. that product can be accessed by external clients using existing streaming protocol services. In some cases, it may also be possible to access that product using a shared file system.

This discussion assumes the existence of a single Jupyter server, but it will often be desirable to allow mutltiple such servers. Examples of the utility of multiple servers will be discussed in subsequent sections.

Accessing the Jupyter Server

Access to the Jupyter server will be supported using several mechanisms. Each mechanism has a specific use case.

IPython Access

Though not shown in Figure 1, it is assumed that existing IPython access to Jupyter is available. This path is, of course, well documented elsewhere in the IPython+Jupyter literature.

Web-based Access

Another use-case is to provide access for scientists with limited programming skills or for other users requiring simple and occasional computations.

The servlet box in Figure 1 illustrates this. For this case. Client web browsers would carry out forms based computations via the front-end servlet running under some Apache Tomcat (other other servlet engine).

Programmatic Access

Scientists will still write standalone programs that need to process computed data. Others will write value-added wrapper programs to provide, for example, additional capabilities such as plotting or other graphical presentation.

These use cases will require the ability to upload and execute programs from client-side programs. The simplest approach here is to build on the web-based version. That is, the client side program would also access the servlet, but using a modified and stream-lined interface.

Asynchronous Operation

Some computations will take a significant amount of time to complete. Submitting such a computation through the Thredds server interface is undesirable because it requires either blocking of the client for long periods of time or complicating the Thredds server to make it support asynchronous execution. The latter usually involves returning some kind of token (aka future) to the client that it can interrogate to see if the computation is complete. Or alternatively, providing some form of server to client event notication mechanism. In any case, such mechanisms are complicated to implement.

Direct client to Jupyter communication (see previous section) can provide a simple and effective alternative to direct implementation of asynchronous operation. Specifically, the client uploads the program via IPython or via a web browser to the Jupyter server. As part of its operation the program uploads its final product(s) to some catalog in the Thredds server. The client is then responsible for detecting that the product had been uploaded, which then enables further processing of that product as needed.

Thredds Value Added

Given the approach advocated in this document, on what should Unidata focus to support this approach.

Accessing Thredds Data

First and foremost, we want to make it easy, efficient, and fast for programs to access the data within a co-located Thredds server.

Thredds currently provides a signficant number of "services" [3] through which metadata and data can be extracted from a Thredds server. These include at least the following: DAP2 (OpenDAP), DAP4, HTTPServer, WCS, WMS, NetcdfSubset, CdmRemote, CdmrFeature, ISO. NCML, and UDDC.

The cost to access to data via some commonly supported protocols, such as DAP2 or CdmRemote, is relatively independent of co-location, so using such protocols is probably not the most efficient method.

File Download

The most efficient inter-server communication is via a shared file system accessible both to the Thredds server and the Jupyter server.

As of Thredds 5 it is possible to materialize both datasets and (some kinds of) streams as files: typically netcdf-3 (classic) or netcdf-4 (enhanced). One defines a directory into which downloads are stored. A special kind of request is made to a Thredds server that causes the result of the query to be materialized in the specified directory. The name of the materialized file is then returned to the client.

Siphon

The Siphon project [2,4] is designed to wrap access to a Thredds server using a variety of Thredds services. As such, it will feature prominently in our system. Currently, siphon supports the reading of catalogs, and data access using the Thredds netcdf subset service (NCSS), CdmRemote, and Radar Data.

Operators

The raison d'etre of server side computation is to input datasets, apply operators to them and produce new product datasets. In order to simplify this process, it is desirable to make available many high-level operators so that a computation can be completed by the composition of operators.

Often, server-side computation is illustrated using simple operations such as sum and average. But these kinds of operators are likely to only have marginal utility; they may be useful, but will not be the operators doing the heavy lifting of server side computation.

Accumulating useful operators is possibly another place where Unidata can provide added value. Unidata can both provide a common point of access, as well as providing some form of vetting for these operators.

One example is Pynco [5]. This is a Python wrapping of the netCDF Operators (NCO) [6]. NCO is currently all command line, so Pynco wraps them to allow programmatic invocation of the various operators.

As part of the operator support, Unidata might wish to create a repository (using conda channels or Github?) to which others can contribute.

Publication (File Upload)

When a program is executed within Jupyter, it will produce results that need to be communicated to others -- especially the client originating the computation. The obvious way to do this is to used the existing Thredds publication facilities, namely catalogs.

As of Thredds 5, it is possible to add a directory to some top-level catalog. Uploading a file into that directory causes it to appear in the specified catalog. Uploading can be accomplished either by file system operations or via a browser forms page.

Specialized Capabilities

Another way to add value is to make libraries available that support specialized kinds of computations.

GPU Support

The power of Graphics Processing Units (GPUs) has significantly increased over the last few years. Libraries now exist for performing computations on GPUs. To date, using a GPU on atmospheric data is uncommon. It should be possible improve the situation by making operators available that use a GPU underneath to carry out the computation.

Machine Learning Support

Artificial Intelligence, at least in the form of machine learning, is another example of a specialized capability. Again, use of AI to process atmospheric data is currently not common. It should be possible to build quite sophisticated subsystems supporting the construction of AI systems for doing predictions and analyses on such data.

Access Controls

There is a clear danger in providing a Jupyter server open to anyone to use. Such a server is a potential exploitable security hole if it allows the execution of arbitrary code. Further, there are resource issues when anyone is allowed to execute a program on the server.

Much of the support for access controls will depend on the evolving capabilities implemented by the Jupyter project. But we can identify a number of access controls that will be needed to protect a Jupyter server.

Sandboxing

The most difficult problem is to prevent the execution of arbitrary code on the Jupyter server. Effectively, such code must be sandboxed to control what facitilities are made available to executing programs. Two sub-issues arise.

  1. Some Python packages must be suppressed. Arbitrary file operations and sub-process execution are two primary points of concern.

  2. Various packages must be useable: numpy, metpy, siphon, for example. They are essential for producing the desired computational products. However, the security of the system depends on the security of those packages. If they provide accessible security-flaws, then security as a whole is compromised.

Authentication

Strong authentication mechanisms will need to be implemented so that only authorized users can utilize the resources of a Jupyter server. Jupyter authentication may need to be coordinated with the Thredds server so that some programs executed on Jupyter can have access to otherwise protected datasets on the Thredd server. This is one case where multiple Jupyter servers (and even multiple Thredds servers) may be needed to support specialized access to controlled datasets by using isolated Jupyter servers.

Resource Control Mechanisms

Uncontrolled execution of code can potentially be a significant performance problem. Additionally, it can result in significant costs (in dollars) being charged to the server's owner.

For many situations, it will be desirable to force clients to stand-up their own Jupyter server co-located with some Thredds server in a way that allows the client to pay for the cost of the Jupyter server. Cloud computing is the obvious approach. Clients will pay for their own virtual machine that is as "close" to the Thredds server as their cloud system will allow. The client can then use their own Jupyter server on their own virtual machine to do the necessary computations and for which they will be charged.

Planned Activities

A preliminary demonstraion of communication between Thredds and Jupyter was created by Ryan May under the auspices of the ODSIP grant [7] funded from the NSF Earthcube program.

We anticipate starting from the ODSIP base demonstration and extending it over time. Subject to revision, the current plan involves the following steps.

Step 1. Servlet-Based Access

The first step is to build the servlet front-end. This may be viewed as a stripped-down mimic of IPython. This servlet will support both forms-based access as well as programmatic access.

Step 2. Operators

An initial set of operator libraries will need to be collected so that testing, experimentation, and tutoring can proceed. This will be an ongoing process. One can hope that some form of repository can be established and that a critical mass of operators will begin to form.

Step 3. Configuration

The next step is to make it possible, though not necessarily easy, for others to stand up their own Jupyter + Thredds. One approach would be to create a set of Docker instructions for this purpose. This would allow others to directly instantiate the Docker container as well as provide a recipe for non-Docker operation.

Step 4. Examples

As with any new system, external users will have difficulties in using it. So a variety of meaningful examples will need to be created to allow at least cutting-edge users to begin to experiment with the system. Again, this will be an on-going activity.

Step 5. Access Controls

At least an initial access control regime cannot be delayed for very long. Some external users can live without this in the short term. But for more widespread use, the users must have some belief in the security of the systems that they create. As with operators, this will be an on-going process.

Step 6. Workshops and Tutorials

At some point, this approach must be presented to the larger community. For Unidata, this is usually done using our Workshops. Additionally, video tutorials and presentations will need to be created.

References

[1] https://en.wikipedia.org/wiki/IPython#Project_Jupyter
[2] https://github.com/Unidata/siphon
[3] https://www.unidata.ucar.edu/software/thredds/v4.3/tds/tds4.3/reference/Services.html [4] https://unidata.github.io/siphon/ [5] https://github.com/nco/pynco [6] http://nco.sourceforge.net/ [7] https://www.nsf.gov/awardsearch/showAward?AWD_ID=1343761

MetPy Mondays #1 - Conda Installation

$
0
0

Welcome to MetPy Mondays! Each Monday the MetPy and Python developers at Unidata will bring you a bite sized tutorial (always less than 10 minutes) with tips, tricks, and tutorials on getting up and running with Unidata Python software. Posts could be screencasts, text blogs, or even snippets talking with the developers. Please check back every Monday for more content and suggest topics you would like to hear about.

This week we're going to start off by getting your Python environment up and running. We will install Miniconda, a product of Continuum Analytics. Conda is a Python environment and package manager that makes managing various versions of Python and packages a breeze.

MetPy Mondays #2 - Conda Forge

$
0
0

This week we continue setting up our Python environments by learning about Conda channels. We’ll add the Conda Forge channel and see how to install and update packages. We'll round out the screencast by using Conda to install the most recent version of Unidata's MetPy package.

MetPy Mondays #3 - Conda Environments

$
0
0

Have you ever had to manually change your path to switch between Python 2.7 and Python 3? Have you broken your research environment by installing a new package to try? Have you ever wanted to take a snapshot and backup your current Python environment? If so, you'll love Conda environments! This week we show you how to setup your own Python environments and switch between them. We also cover how to create a file defining your environment so others can recreate it. It's another video MetPy Monday!

MetPy Mondays #4 - Units in MetPy

$
0
0

I remember the first class I took in which the professor required that we include units by every quantity in every step of every calculation we did… or it was wrong. I thought this policy was a bit harsh, but after one or two assignments, I was getting the hang of it. By the end of the semester I realized that it was insane to work any other way. In science, we are dealing with physical quantities that represent things in the real world – and things in the real world have units. Keeping track of units throughout a calculation caught many errors I made while solving and rearranging equations. If keeping track of units on paper is a good idea, why is computing any different? In this MetPy Monday, we'll look at how MetPy uses units and how to convert between different units.

To get started, open up a terminal and activate an environment with MetPy installed. Checkout out the Conda Environments MetPy Monday if you need help with this. I'll be using the Unidata workshop environment. I'll also be working in a Jupyter notebook.

MetPy relies on a Python library called pint to handle our units. By using some special code under the hood, MetPy will even tell you if you pass a function values with incorrect units. But before we get to that, we need to understand how to attach units to a number. First, we need to import the units registry from metpy.units:

Let's create a temperature value and assign it the units of degrees Celcius. The first part of this assignment looks perfectly normal, but then we multiply the value by the appropriate units we want to attach. Printing the value out we can see that units are indeed attached.

Converting between units is as simple as using the .to command:

Easy! But what about quantities with more complex units? The units library is capable of parsing a string for the correct units. Let's create u and v wind components with different units:

Units can parse m/s as meters/second and assign the unit. We can now call some of MetPy's calculation routines and let them worry about making sure everything is the correct unit before performing any calculations.

Nice! The u and v components were converted to a common unit before the calculation was performed. We can also convert the result to whatever we desire, even the absurd.

What if we pass the incorrect dimensions to a calculation? In other words there is no unit conversion to make things work out correctly and something has gone horribly wrong. Maybe we accidently reassigned a variable somewhere and what we think is u component wind is really a temperature.

MetPy gives us an error message telling us that there are incorrect dimensions for the calculation. Using units can make your calculations more reliable and means you have to write fewer lines of code. Never again will you write a function to convert degrees Fahrenheit to degrees Celsius, nor should you! Thanks for joining us on this week's MetPy Monday!

MetPy Mondays #5 - Python Resources

$
0
0

Don't know where to start on your Python journey? We’ve developed a lot of material here at Unidata to help you out! This week, we’ll go over the resources and show you how to use the example galleries and documentation pages to explore the capabilities of MetPy and the other Unidata Python products.


MetPy Mondays #6 - Making a Basic Map with Cartopy

$
0
0

One of the most common tasks we do as geoscientists is make maps. Maps are a great way to look at massive amounts of data and synthesize it. In Python, Cartopy is the most current mapping package available and is what we use in all of our MetPy Gallery and Python Gallery examples. In this week’s MetPy Monday, we’ll look at the fundamentals of mapping with Cartopy and create a couple of simple base maps that data can be plotted on.

Event Notification for Thredds Servers

$
0
0

Initial Draft: 2017-08-05
Last Revised: 2017-08-05
Author: Dennis Heimbigner, Unidata

Table of Contents

Introduction

Periodically some of the Thredds servers run by Unidata get seriously overloaded. One cause is because external users poll the Thredds server to see what has changed. If the polling rate is too high then the performance of the Thredds server can seriously deteriorate.

I am proposing here to mitigate this problem by allowing Thredds servers to generate events that signal changes that might be of interest to users. Then, instead of polling, these users can watch for specific changes events and use that information to update their local databases (or whatever).

The cost tradeoff for Unidata is the cost of periodic "hammering" versus the maintainence of an event server to distribute change events to users.

Looking ahead, it is also possible that this proposal can facilitate inter-server communications. This means that multiple Thredds servers could communicate useful information. This is speculative for now, but should be kept in mind.

Architecture

I are proposing a pretty standard publish-subscribe system for use by a Thredds server. In this architecture, there are hooks in various places in the Thredds code that send short messages to a separate "broker" server.

On the client (user) side, each client registers with the broker to tell it the kinds of messages in which it is interested.

So the flow is:

  1. The server generates a change message
  2. The message is received by the broker
  3. The broker forwards the message to all clients that are registered as interested in that kind of message.

Requirements

In order to be useful to Unidata and its community, I require certain capabilities for the event system.

Topic-based Messages

In event systems, there are typically two ways to identify messages: by queue and by topic.

A topic based message is one that has an associated structured string used to classify the message. Often, the structure of the string is a tree represented by the format field.field.field... where each field is some identifier. This format can be used, for example, to mark the message as referring to some file in a tree structured file system. Thus a file /f1/f2/f3 might be mapped to the topic f1.f2.f3.

A queue-based identification is one in which a message is sent to a specific named queue. It is isomorphic to a topic system as far as a sender is concerned because each distinct topic string can be the name of a queue. I will not consider queue-based system further.

Topic Wildcards

On the client side, the client must be able to register for messages by specifying a pattern indicating the message topics in which it is interested. It is desirable to allow a client to register for a number of different topics by specifying a pattern containing wildcards (as is common in e.g. Unix file specifications).

If, for example, our client was interested in events about all files within /f1/f2, it should be possible to specify a topic pattern such as "f1.f2.*".

Durability

Suppose a client is not active or not registered with the broker at the time an event is received by the broker. If later the client registers, it will not see that previously generated message. This is a problem because a client will be forced to again access the Thredds server to see what happened while it was offline.

To deal with this problem, I require that our broker support "durable" messages. In the event community, this means that the broker will cache messages for some period of time. When a client registers, it will receive any cached messages that match its pattern. Supporting durability is tricky because of issues such as message retention policies and message duplication. Nontheless, this is an essential requirement in order to avoid polling as much as possible.

Persistence (Optional)

In this context, persistence means that cached messages are maintained even if the broker crashes. It is closely related to durability if the durable messages are stored in some kind of file-based database system. I do not require persistence, although it may come for free with some brokers.

Transactions (Not Required)

Transactions in event systems are similar to those of database systems. It essentially means that a message is guaranteed to be delivered or it appears as if it is not delivered at all. It is unlikely that transactions will be required for our event system.

Initial Event Set

The initial set of events of interest will be catalog change events. Specifically these. * Insert - when a catalog has a new entry * Delete - when a catalog has an existing entry removed * Modify - when a catalog has an existing entry modified.

Modification is tricky since it effectively tested by looking at the modification date for a file. This may not be an actual change. Note also that it may turn out that modifications will actually look like a deletion followed by an insertion.

Note that modification false positives are generally acceptable if not too frequent. This is because the cost to the client is that it retrieves a catalog entry that has in fact not been modified.

False negatives are much less acceptable. This will occur if an insertion, deletion, or modification occurs but no corresponding event is generated. If a client cannot trust the server in this regard, then the event system will not be used.

Note also that change events should also apply to the creation of new catalogs as well as the files within a catalog.

Topic Space Design

Topic Space Design is an important aspect of this proposal. That is, we need to set of a topic tree such that clients can specify what they want with a reasonable degree of specificity.

From LDM, we know that this is important in order to avoid clients just subscribing to everything. I would hope that using a wildcard system -- as opposed to e.g. regular expressions -- is sufficiently simple that clients will not be tempted to ask for everyting. Realistically, this is probably a forlorn hope because it is likely that pollers want to know everything about a specific server.

My initial thought is that the root of our topic space is "Unidata.Thredds". From there, I would like to specify a particular server via a DNS name + port. There is a problem since DNS names contain dots. It may be necessary to use encoding, the url %xx for example, to change the dots and semicolons in the DNS name So one might say "Unidata.Thredds.motherlode%2eucar%2eedu%3a8080."

From there, the obvious choice is to encode the catalog path as the rest of the topic. So, for example:

Unidata.Thredds.thredds%2eucar%2eedu%3a8080.catalog.grib.NCEP.GEFS.Global%2e1p0deg%2eEnsemble.members-analysis.GEFS%2eGlobal%2e1p0deg%2eEnsemble%2eana%2e20170723%2e0000.grib2.*

The character escaping issues needs some thought.

Implementation

Currently, Thredds does not immediately detect changes to its underlying file system. Rather, it dynamically rebuilds some catalogs when accessed. The dynamically generated catalog will, of course, reflect changes to that catalog since it was last accessed.

John Caron left nascent code (CatalogWatcher) in Thredds to actually operate at the time a change occurred (as opposed to when a catalog is retrieved). This is more or less the eager vs lazy issue.

So, in order to make this proposal work, I need to do (at least) the following. * Complete the existing code pieces such as CatalogWatcher * Add code to convert a file system reference to a catalog system url. * Add an event sender to Thredds

This appears to be a straightforward set of modifications. Of course Murphy's law says that I am forgetting something.

Relation to LDM

My current broker of choice is Apache ActiveMQ. But I cannot help but notice that LDM functionality is related to this proposal. So is fair to ask if LDM could be adapted to serve as the broker. Here are some issues that would need to be considered. Note that my knowledge of the current capabilities of LDM may be out of date.

  1. Message Size: LDM ships files, not short messages. In this sense, LDM is overkill.
  2. Volume: my speculation is that the volume of small messages would not be all that large; it might even be similar to the volume of distinct files shipped by LDM
  3. Multiple Languages: whatever broker we use, it must be possible to write clients in a variety of programming languages: C, Python, Java at least.

While it would be nice to make use of other Unidata technology, I currently do not plan to pursue this path. As noted, my current target is an ActiveMQ broker with JMS publishers and subscribers.

Performance Costs

There is a performance cost in having the Thredds server generate events. But it is difficult to assess these costs without knowing how many events are being generated.

It should be possible to experimentally measure the number of events by adding counters to the CatalogWatcher code and running it for an extended period of time. Periodically, the counters can be dumped to the logs.

Ideally we would have R X 3 x 24 x 31 counters so we could compute the number of events per hour, day, week, and month per catalog root (assuming R roots). This is not a large amount of space . So at the end of every hour, day, and month, a log entry would be generated with a terse listing of the counters. The counters would then be reset.

The big problem is that these counters need to be executed on one of the Unidata productions systems (aka motherlode) in order to get realistic numbers. It is not clear if this is possible.

Note that the measurements would be taken over all the catalog roots. It is possible that other parts of the file system such as the GRIB indices would also need to be included.

Security

The question is: how open should our broker be to arbitrary clients? The short answer is that it probably should have the same access controls as the associated Thredds server. Again, it is not clear what the requirements here should be.

Summary

The original motivation for this proposal was to try to mitigate clients hammering our production servers. The idea is that if they have access to changes events, they do not ned to poll our servers (or least quite so often).

Will it work? It depends on several factors:

  1. Is is easy for clients to use?
  2. Are they willing to change?
  3. Is it sufficiently reliable that clients will not feel they are loosing events?

Appendix A. Miscellaneous Notes

Multiple Servers Per Broker

The above discussion assumes that there is the server - broker association is one to one. Other arrangements are possible such as multiple servers sharing a single broker or a server using multiple brokers. The relative merits of these alternatives is unclear to me, but the possibility is worth noting.

Thredds Persistence

I could implement durability in the Thredds server by keeping a queue of changed directories and allowing clients to ask for everything since . In effect, I would be subsuming the broker as part of Thredds. Possible and maybe an acceptable approach.

MetPy Mondays #7- Contouring a Field on a Map

$
0
0

Last week we looked at how to create a simple base map with Cartopy. In this week’s MetPy Monday, we learn about contouring a field on the map and some of the idiosyncrasies of cyclic points. In the end, we will have a plot of the globe with the Coriolis parameter contoured. You can use this functionality to create height maps and more!

We’ll start off with importing the tools we will use: matplotlib, MetPy calculations, MetPy units, and numpy. We’re also using the magic %matplotlib inline so figures show up in the notebook instead of in separate windows.

Next, I create an array of latitudes from -90 to 90 degrees and then using MetPy’s calculation module to calculate the Coriolis parameter at each of these latitudes.

To verify things to ourselves, I made a quick plot of the Coriolis parameter as a function of latitude. We see the non-linear behavior we expect, with the absolute value of the parameter increasing towards the poles.

Next, I create an array of longitudes from 0-359 degrees and broadcast the Coriolis parameter we calculated into that shape. The wrinkle comes from the fact that longitudes are cyclic. We roll from 359 degrees back to 0 and start going around the globe again. The contouring algorithm isn’t equipped to understand this by default. If you just contour what we have now, there is a break at 0 degrees longitude in every contour. We can use the cartopy utility addcyclicpoint to create a cyclic coordinate system that will contour correctly. We pass the data and coordinates to the function and get back data and coordinates with a cyclic element.

Now we’re ready to contour! We go about making the base map in the now familiar way. (If you need a refresher, checkout MetPy Monday #6.) We use matplotlib’s contour method to calculate and draw the contours, drawing 20 contour lines in total. Don’t forget to specify the transform so everything plots on the map! I grab the contours as CS and set their label properties to be inline and a sensible single point after the decimal. By default, the contours of negative values are dashed. I didn’t like the look of that, so I set the contour.negative_linestyle parameter to solid.

The resulting map looks pretty good for only a few lines of code! It’s worth spending some time exploring the matplotlib documentation for contour and contourf (filled contours). There are a lot of customizations that can be done to make your map look however you wish.

My notebook is available here. Thanks for following along on another MetPy Monday!

Adding Compression to NetCDF-3 (Classic)

$
0
0

Initial Draft: 2017-08-14
Last Revised: 2017-08-14
Author: Dennis Heimbigner, Unidata

Table of Contents

Introduction

This document proposes adding compression to the netcdf-3 (aka classic) file format. The proposal has numerous limitations because of the nature of the existing netcdf-3 format. It is also not clear if the effort involved would be worth it.

The biggest problems with compression are these.

  1. A variable whose data is to be compressed needs to be divided into some set of same-size blocks. This is to avoid having to uncompress the whole variable before accessing any part of it.
  2. Compressing each of the blocks of a variable produces a new set of blocks of varying sizes. Some forms of compression (e.g. scale+offset) do produce compressed blocks all of the same size, but this is rare. Most standard compression algorithms produce varying size blocks.

The algorithm and data format proposed here requires re-writing an existing netcdf-3 file to move it to the new format. In effect, the re-written file becomes archival (read-only). Some special cases would exist but I will defer their consideration.

Revised NetCDF-3 File Format

The key is to assume that a variable's data has been divided into equal size blocks (except possibly the last one). The compressor will be applied to each block to produce a new block to be written to a specific location in the output file.

The obvious problem with this is that because the compressed blocks are of varying size, it is not easy to provide direct access to the start of the i'th block.

The solution proposed here is to create an index table with file offsets pointing to the start of each compressed block. This index is a map from block# =>start of block. This index table is easily constructed in one of two ways.

  1. Create the index and store as part of the metadata for the variable in the file. This can then be read at one shot into memory (if not too large), or it can be accessed as needed if the index is itself too large to read into memory.
  2. Do an initial pass of the data in the variable. If we assume that each block is preceded by its size, then a series of probes into the variable's data will allow us to construct the index by random-access reads to a small number of points in the file.

Building such a structure is pretty straightforward. Each input block is read, compressed, and then written to the end of the output file. The file offset of that block is stored in the index table. This process, unfortunately, cannot be easily parallelized.

Reading Compressed Data

Reading from a compressed variable requires the following steps.

  1. Compute the set of blocks that need to be accessed to cover the requested data.
  2. Locate and read each block using the index table.
  3. Uncompress the just read block.
  4. Extract the data of interest.

I speculate that using caching on blocks might provide some speed-up, but that requires testing.

The Unlimited Dimension

The above algorithm is relatively simple for variables that do not have an UNLIMITED (aka record) first dimension. So, how do we handle variables with an initial UNLIMITED Dimension? Internally, the netcdf-c library assumes that all variables that use the unlimited dimension are co-mingled. That is, the data for all variables for unlimited-dimension == 1 are written, then all the data for all variables for unlimited-dimension == 2 are written, and so on. Also note that the unlimited dimension is assumed to be at the end of the file so that it can be extended when the unlimited dimension is increased. Note also that there can be holes in that data if various variables are not written when the unlimited dimension increases.

If we assume the output, compressed file is read-only, then we know the actual size of the UNLIMITED dimension. Thus we can treat it as a normal variable of some known size. This represents a significant change to the netcdf-3 file format vis-a-vis the unlimited dimension. This needs further thought.

Block versus Chunks

I deliberately used the term "block" instead of "chunk" in this discussion because the two are not the same. In order to understand "block", we need to look at a conceptual layout of a variable's data in linear order. Assuming the standard row-major order, we envision laying out the individual elements of the variable sequentially using that ordering. Given that layout, we then divide that sequential order into equal size blocks, except possibly the last (short) block. Blocking can be specified with a single number indicating the block size.

This differs significantly from netcdf-4/HDF5 "chunking" because a chunk is a subset of the variable along each dimension. Specifying the chunks for a rank R variable requires specifying R numbers: one for each dimension.

This difference comes from different goals. The netcdf-4 chunk is intended to address two issues: (1) speed up by access pattern and (2) division of the data for compression.

The netcdf-3 block, as described here, is purely to divide the data for compression purposes. The access pattern is still the same as with an ordinary netcdf-3 file.

Evaluation

The clear question is: is all of this worth it? The answer depends on a number of factors.

  1. Read-only: Is the read-only/archival assumption acceptable? It assumes that there are a large number of datasets that are written once and then never again. Note that infrequent modifications are possible by copying the archived file to a standard format netcdf-3 file, modifying it, and then re-compressing it.

  2. Implementation: How hard is it to implement? The answer is that as described above and assuming read-only is acceptable, the implementation is not all that difficult.

  3. Compression Only: Is it acceptable to provide compression but not provide improved support for access patterns. As noted, this is the big difference with netcdf-4 chunking, which provides both.

Extensions

Once one thinks of adding index tables to netcdf-3, one can extend the idea more generally. Note that if we treated a netcdf-3 file as a big malloc style heap, many desirable capabilities become possible for netcdf-3 that are currently only available for netcdf-4. See especially reference #2.

References

  1. http://www.unidata.ucar.edu/blogs/developer/en/entry/netcdf_compression
  2. http://www.unidata.ucar.edu/blogs/developer/en/entry/high-performance-netcdf-4-proposal

MetPy Mondays #8 - Interactive Dewpoint Calculator

$
0
0

I recently installed a weather station in my back yard. Every day I look at the display and see the temperature, wind, rain, and humidity, but the dewpoint is not displayed by default! As it turns out, dewpoint is a tricky thing to directly measure. The only way to directly measure it is with a fogged mirror sensor. Otherwise a hygrometer or psychrometer can be used to measure humidity or wet-bulb temperature, and then the dewpoint can be calculated. MetPy has the calculation functions to do both of these conversions. In this week’s MetPy Monday I’ll show you how to use the Jupyter Notebook’s interactive widgets to make a dewpoint calculator with slider widgets. This is a great way to get students to interact with formulas and get an intuitive sense of how they work!

MetPy Monday #9 - 2017 Total Solar Eclipse

$
0
0

Last Monday was a big day for folks in the geoscience and astrosciences — the 2017 total solar eclipse! Many of those on the Unidata team made the drive to be in the path of totality, where the sun was completely blocked for a period of up to two and a half minutes. In this MetPy Monday post, we will take a look at some animations made in Python and posted by the team just after the eclipse.

The eclipse ran roughly from 15-21 UTC, beginning in the northern Pacific Ocean and ending in the low latitudes in the Atlantic Ocean. At 21:30 UTC, a set of automated scripts began to run, pulling down all sixteen chanels of GOES-16 imagery for the CONUS, as well as all ASOS observations from the IEM archive. The scripts then proceeded to produce animations of these data that were then tuned and posted on the Unidata YouTube channel that evening. All of the code behind gathering the data and making the animations is enough to fill a few months worth of MetPy Monday posts, so I want to give you a high level over-view of how we accomplished this. The code is publicly available in a GitHub repository for you to download, adapt, and play with.

Let’s start by looking at the simplest of the animations - the animation of the path of the eclipse. We used the NASA-produced eclipse shape files in conjunction with Cartopy to make an animation of the umbra path with time stamps. We first make the basic map as we’ve discussed in past MetPy Mondays. We then use the shapereader module from cartopy.io to read in the shape files for the umbra path, center path, and individual umbra footprints. We also plot the static features like the state outlines and the umbra path/center.

Next we run through a loop that iterates through the umbra footprints. The NASA shapefile has a umbra footprint for every second of the event, but we just plot every fifteen seconds to speed up the animation. We catch the artist that matplotlib returns from that plotting command and, together with a text time stamp, append them to a list of artists that we will animate using matplotlib’s ArtistAnimation class.

To make the satellite animations, we used the Siphon package to download data from the GOES-16 Advanced Baseline Imager (ABI). In this case we archived all data locally since we wanted to preserve the data from the event, but in a more real-time setting the data could be opened without the need to save it to disk. There are a few more details on plotting satellite imagery in our Python workshop satellite notebooks, but in the end it comes down to using an imshow command and using a list of artists, as before, to produce the animation. We overlaid the umbra path and center line to help guide the eye during the animation.

Personally, I found the temperature change during the eclipse to be the most interesting animation. Earlier this summer I looked at data from an eclipse in 1994 and found that even on hourly observations the temperature drop was very clear and propagated along the path of totality very nicely. I was excited about this eclipse since it cuts across the entirety of CONUS and intersects a lot of stations. My home weather station even observed a roughly 10 C temperature drop during the 95% coverage it experienced.

To get the ASOS observations, we used a modified version of the fetch script from the IEM archive GitHub repository. Each station was an individual HTTP GET request, so the data gathering ran for quite a bit of time. The data were then put into a single file that we parse during the animation process.

Since stations could report at slightly different times, we wrote a small helper that groups stations with data reported in some interval, in this case 10 minutes on either side of the plot time. This does of course introduce some smearing across time, but gives us a more robust plot with as much relevant data as possible plotted at once. We used Pandas to do the data munging as it was the fastest to develop with and had a lot of useful functionality already built in. The same animation technique was used, but this time with a colored scatter plot.

Finally, notice the MetPy logo in the corner of the animations? That’s a sneak peak into MetPy 0.6 which is slated to release next month. There are a lot of exciting new features in 0.6. Once of the new capabilities is the ability to easily add the MetPy or Unidata logos to your plots!

What interesting data visualizations did you see after the eclipse? We've seen everything from traffic maps to Twitter traffic to Google search trends. On next week’s MetPy Monday we will dig into using widgets to explore live GOES-16 data available on Unidata’s THREDDS Data Server!

MetPy Monday #10 - Harvey and Irma

$
0
0
Seisminc noise from hurricane Irma
Hurricane Irma

Wow! We’ve had a very active couple of weeks in the Atlantic and I wanted to break the planned series of MetPy Monday posts with a bit of timely data analysis and some interesting animations. The new (and still experimental/non-operational) GOES-16 satellite has provided us with some incredible views of hurricanes Harvey and Irma, and likely will with Jose as well.

The first animation I wanted to share is an RGB composite. This one is a bit tricky to put together for multiple reasons. First off, the GOES-16 imager (the ABI) has 16-channels, but not one of them is a true “green” channel! Thanks to advice from community member Pete Pokrandt, I was able to make a “fake” green channel by combining the red, blue, and “veggie” bands. The second challenge was that the red channel is sampled at a different resolution than everything else. You can find the script I used here and see exactly how the resampling and linear combining of channels was done.

Next, I wanted to combine a seismic dataset with the GOES-16 imagery. Seismologists have long been able to see seismic noise from storms and even back project the source location. Dr. Charles Ammon at Penn State has been making graphics showing this for decades. I wanted to adapt that idea into a movie with imagery and a single seismic station. Using the ObsPy library, I was able to download the seismic data from a seismometer in San Juan, Puerto Rico. I down sampled everything so that the movie was a reasonable length, filtered low-frequency drift and noise out of the seismometer data, and created an animation!

Again, you can see the full script here. The biggest challenge here came from the fact that due to the experimental nature of the data, the projection was actually being modified. Hence, there is a block checking the projection name and modifying the map projection accordingly. Hopefully we will be able to make that step much easier in the future. You’ll also notice that the frames of the animation are written out as images. I then made them into the movie using FFMPEG afterwards. A slightly different technique than using the matplotlib animation module as done in the RGB animation.

While I don’t want to dive into the details of any specific part of these scripts, I did want to share them and the resulting animations with you. Working with MetPy and Siphon to make these animations and products for events and MetPy Monday posts helps us find the rough edges and things that we’d like to smooth out for our users. Next on our list is making plotting hurricane paths, recon flight data, and buoy data easier. To help with this, we’ll be archiving data from Harvey and Irma as case studies and be working to make more interesting visualizations to share with you.


MetPy Mondays #11 - Plotting GOES-16 Data with Widgets

$
0
0

Everyone has been really excited about exploring the incredibly high resolution GOES-16 imagery that is now available in an experimental capacity. We host some of this data on out THREDDS test server and it can be ingested with siphon and plotted in Python! In this week’s MetPy Monday we’ll go over how to use interactive widgets to select the region and channel to plot and produce images from the data.

In a past MetPy Monday we’ve used the notebook interactive widgets to create a simple calculator utility. Everything I’m doing is available in a notebook from our Unidata Python Workshop and can be downloaded here. There’s a lot going on here, but I want to focus on the interactivity this week.

To start out with, we need to import a slew of tools. We’re going to use datetime for generating datetime objects to request data and make nice timestamps, cartopy for mapping, the interactive widgets for UI, matplotlib for the image rendering, netCDF4 to read the data, and siphon to request data.

It’s nice to break the problem up into smaller chunks that become functions. Our first function takes the datetime object, channel, dataset index, and region. It then generates a request to the THREDDS data server and returns a dataset.

The next function is what we’ll actually call. It calls open_dataset to request the data, pulls out the important variables and does the map creation. Here you can see we use the projection variable from the satellite data to specify the globe that cartopy will use for its mapping. We also use that projection information to setup a Lambert Conformal projection. Finally, we get state boundaries much like we did in the mapping with cartopy tutorial.

Next, the function sets up a figure and axis object to plot on and removes any old images plotted there. It also adds the state boundaries and country borders to the map. We use matplotlib’s imshow function to plot the data and let cartopy take care of all the projection issues. We clip the map to the extents of the data by finding the maximum and minimum values of the x and y coordinates.

Finally, we dress up the plot a bit by adding annotations of the image time, channel number, and the notation that these data are still experimental. We use the string parse function (strptime) to create a datetime object from the dataset’s timestamp and then the string format function (strftime) to write it back out as a string in a more readable format. Matplotlib’s text function takes care of plotting the text, but we apply some path effects to outline the white text in black lettering, to help increase the readability on different backgrounds. Finally, we show the figure.

With all the heavy lifting done, all that’s left is creating the interactive parts of the notebook. First, we create a dictionary with the name of the channels on the satellite as the key and the channel number as the value.

We then populate a channel dropdown selector with that dictionary and create a region selector. The GOES satellite transmits images for full disk, CONUS, and two mesoscale regions at different time resolutions. With the widgets made, we just need to hook them up to the plotting function!

We get the current time, then wrap the plotGOES16channel function with interact, just like in the dew point example.

The output we get is a map with two drop down menus allowing us to select and view our data in real-time! With a relatively small amount of code, we were able to create a tool that could be used in the classroom or weather discussions. Try using widgets in your classroom and let us know what you’re doing with notebooks and Unidata technologies!

MetPy Mondays #12 - Citing MetPy

$
0
0

When writing research papers, we all cite other papers that we read, datasets that we use, but do you cite the software you use? Just like papers, software can be cited if it was used in your research. Citing MetPy has never been easier! We have a Digital Object Identifier (DOI), prebuilt BibTeX entry, and AMS citation. Using the Crosscite tool you can also generate a citation in just about any format imaginable! You can also tell the software engineers how you use MetPy with our Say Thanks page.

Contributor License Agreement for Unidata Projects

$
0
0

Unidata hosts a variety of Open Source software projects on GitHub. We use the Open Source model because we believe strongly that broad participation in all aspects of Unidata's work is essential to achieving the Unidata community's goals. Developing software that focuses on community needs is one of our main objectives, and participation by community members in all aspects of the development process — from coding to testing, documenting, and commenting — is incredibly valuable.

As community participation in Unidata's Open Source efforts grows, we are facing increasingly complex situations surrounding contributions made to Unidata-hosted projects. As a result, we have decided to begin requiring that community members who wish to contribute code to Unidata projects on GitHub agree to the Unidata Contributor License Agreement (CLA). This agreement is based on a template from the Harmony Agreements project, whose goal is to standardize CLA's within the Open Source community.

What is a Contributor License Agreement?

A Contributor License Agreement (CLA) defines the terms under which intellectual property is contributed to a project. In Unidata's case, the CLA covers software code contributed to Unidata-hosted projects distributed under Open Source licenses. While the Unidata CLA is a legal document, and you should read it carefully before agreeing to it, the two main ideas are:

  • You retain the ownership of the copyright to your contribution
  • You grant Unidata the right to use your contribution in perpetuity

We feel it is important to use a CLA on Unidata-hosted projects for several reasons. Most important among them is the idea that once a Unidata software product has been released to users, the users can be confident that they have the right to continue using the software. Adding some formality to the software contribution process helps the Unidata community make well-informed choices about the benefits of using Open Source software.

The CLA process Unidata is adopting is becoming more and more common in the Open Source community. We don't think that asking contributors to sign the agreement will present a significant barrier to participation, and we believe that adding this layer of formality will benefit the Unidata community as a whole.

How it Works

When contributing using a Pull Request on GitHub, the following message will present itself, using CLA Assistant, as a comment on the pull request:

Unsigned CLA

Contributors can click on the yellow "CLA not signed yet" badge, which will take them to a copy of the CLA. Contributors are asked to provide a little bit of information about themselves (for legal purposes):

CLA page

Once the "I Agree" button is clicked, the browser will return to the original pull request page, but now the comment has been updated:

Signed CLA

Contributors will only be asked to electronically sign once (unless the CLA is updated), and the agreement applies to all GitHub repositories hosted under the Unidata organization.

For more information about CLAs, see these resources:

About the Contributor License Agreement

Unidata's CLA comes from Project Harmony, which is a community-centered group focused on contributor agreements for free and open source software.

The document you are reading now is not a legal analysis of the CLA. If you want one of those, please talk to your lawyer. This is a description of the purpose of the CLA.

Why is a signed CLA required?

The license agreement is a legal document in which you state you are entitled to contribute the code/documentation to Unidata and are willing to have it used in distributions and derivative works. This means that should there be any kind of legal issue in the future as to the origins and ownership of any particular piece of code, we have the necessary forms on file from the contributor(s) saying they were permitted to make this contribution.

The CLA also ensures that once a contribution has been made, contributors cannot try to withdraw permission for its use at a later date. People and companies can therefore use Unidata open source projects, confident that they will not be asked to stop using pieces of the code at a later date.

Lastly, the CLA gives Unidata permission to change the license under which the project, including the various contributions from many developers, is distributed in the future. The CLA states that this license needs to be one that has been approved by the Open Source Initiative, including both copyleft and permissive licenses. This gives Unidata the freedom to adjust licenses in the future if needed (e.g. some clause of the current license is found to be invalid; change to a standard license), so long as the license remains open source.

Am I giving away the copyright to my contributions?

No. This is a pure license agreement, not a copyright assignment. You still maintain the full copyright for your contributions. You are only providing a license to Unidata to distribute your code without further restrictions. This is not the case for all CLA's, but it is the case for the one we are using.

Can I withdraw permission to use my contributions at a later date?

No. This is one of the reasons we require a CLA. No individual contributor can hold such a threat over the entire community of users. Once you make a contribution, you are saying we can use that piece of code forever.

Can I submit patches without having signed the CLA?

No. We will be asking all new contributors and patch submitters to sign before they submit anything.

This CLA explanation is based on Django Contributor License Agreement Frequently Asked Questions (copyright Django Software Foundation. CC-BY) The content has been modified slightly to reflect situations specific to Unidata.

MetPy Mondays #13 - Temperature Units

$
0
0

One of the most common support questions we get regarding MetPy is why temperature calculations fail. As it turns out, temperature units are a bit strange as they have an offset relative to an absolute value and a scaling factor. Learn how to properly handle temperature in your calculations with this week's MetPy Monday!

MetPy Mondays #14 - Adding a logo and timestamp

$
0
0

This week, we'll talk about some handy functions in MetPy that let you decorate your plots with the MetPy or Unidata logos. We'll also add time of creation timestamps to help you keep track of what's what in that pile of PNG files in your manuscript folder!

Viewing all 452 articles
Browse latest View live