(OK, that is hyperbole, but...)
Initial Draft: 2017-5-28
Last Revised: 2017-6-5
Author: Dennis Heimbigner, Unidata
Table of Contents
For a number of years, the Unidata Thredds group has been in the
process of "implementing" server-side computation
Real-Soon-Now (as the saying goes).
Server-side computing embodies the idea that it is most
efficient to physically co-locate a computation with the
datasets on which it is operating. As a rule, this meant having
a server execute the computation because the dataset was
controlled by that server. Server-side computing servers for
the atmospheric community have existing in various forms for a
while now: GRADS, DAP2 servers, and ADDE, for example.
One -- and perhaps The -- major stumbling block to server-side
computing is defining and implementing the programming language
in which the computation is coded. In practice, server-side
systems have developed their own language for this purpose.
This is a problem primarily because it is very difficult to
define and implement a programming language. Often the
"language" started out as some form of constraint expression
(e.g. DAP2, DAP4, and ADDE). Over time, it would accrete other
capabilities: conditionals, loops, etc. In time, it grew into a
badly designed but more complete programming language. Since it
was rarely implemented by language/compiler experts, it usually
was quirky and presented a significant learning curve for users.
The advantage to using such a home grown language was that it
could be tailored to the dataset models supported by the
server. It also allowed for detailed control of programs. This
made certain other issues easier: access controls and resource
controls, for example.
The author recognized the language problem early on and was
reluctant to go down that path. As the primary "pusher" for
server-side computing at Unidata, this has delayed implementation
for an extended period.
Fortunately, about three years ago, project Jupyter [1]
was created as an offshoot of the IPython Notebook
system. It provided a multi-user, multi-language compute engine
in which small programs could be executed. With the advent
of Jupyter, IPython then refactored its computation part to
use Jupyter.
From the point of view of Unidata, Jupyter provides a powerful
alternative to traditional server-side computing. It supports
multiple, "real" programming languages. It is a server itself, so
it can be co-located with an existing Thredds server. And, most
importantly, it is designed to execute small programs written in any
of its supported languages.
In the rest of this document, the term "program" will, as a
rule, refer to programs executing within a Jupyter server.
In order to avoid the roll-your-own language problem, it was decided
to adopt wholesale an existing modern programming language. This meant
that the language was likely to be complete right from the start. Further,
the learning curve would be reduced because a significant amount
of supporting documentation and tutorials would be available.
We have chosen Python as our preferred language. We made this
choice for several reasons.
- Python is rapidly being adopted by the atmospheric sciences
community as its language of choice.
- There is a very active community that is
developing packages for use by the scientific community
and more specifically for the atmospheric sciences community.
Examples are numerous, including numpy, scipy, metpy, and siphon.
- It is one of the languages supported by Jupyter.
To the extent that Jupyter supports other languages, it would be
possible to write programs in those languages. However, I would not expect
Unidata to expend any significant resources on those other languages.
The one possible exception is if/when Jupyter supports Java.
![Notional Architecture]()
The notional architecture we now espouse is shown in Figure 1.
Basically, a standard Thredds server runs along side a
Jupyter server. A program executing in the Jupyter server has
access to the data on the Thredds server either using the file
system or using some streaming protocol (e.g. DAP2). File access
is predicated on the assumption that the two servers are a
co-located and share a common file system.
The Thredds server currently requires and uses some form of
servlet engine (e.g. Tomcat). We exploit that to provide a
front-end servlet to act as intermediary between a user and the
Jupyter server (see below).
So now, instead of sending a program to the Thredds server, it
is sent to the Jupyter server for execution. That executing
program is given access to the Thredds server using a variety of
packages (e.g. Siphon [2]). Once its computation is completed,
its resulting products can be published within a catalog on
Thredds to make it accessible to user programs.
Once in the catalog. that product can be accessed by external
clients using existing streaming protocol services. In some cases,
it may also be possible to access that product
using a shared file system.
This discussion assumes the existence of a single Jupyter server,
but it will often be desirable to allow mutltiple such servers.
Examples of the utility of multiple servers will be discussed
in subsequent sections.
Access to the Jupyter server will be supported using several
mechanisms. Each mechanism has a specific use case.
IPython Access
Though not shown in Figure 1, it is assumed that existing
IPython access to Jupyter is available. This path is, of course,
well documented elsewhere in the IPython+Jupyter literature.
Web-based Access
Another use-case is to provide access for scientists with
limited programming skills or for other users requiring
simple and occasional computations.
The servlet box in Figure 1 illustrates this.
For this case. Client web browsers would carry out forms based
computations via the front-end servlet running under some
Apache Tomcat (other other servlet engine).
Programmatic Access
Scientists will still write standalone programs that need to
process computed data. Others will write value-added wrapper
programs to provide, for example, additional capabilities such
as plotting or other graphical presentation.
These use cases will require the ability to upload and execute
programs from client-side programs. The simplest approach here
is to build on the web-based version. That is, the client side
program would also access the servlet, but using a modified
and stream-lined interface.
Some computations will take a significant amount of time to
complete. Submitting such a computation through the Thredds
server interface is undesirable because it requires
either blocking of the client for long periods of time or
complicating the Thredds server to make it support
asynchronous execution. The latter usually involves
returning some kind of token (aka future) to the client
that it can interrogate to see if the computation is
complete. Or alternatively, providing some form of server to
client event notication mechanism. In any case, such mechanisms
are complicated to implement.
Direct client to Jupyter communication (see previous section)
can provide a simple and effective alternative to direct
implementation of asynchronous operation. Specifically, the client
uploads the program via IPython or via a web browser to the
Jupyter server. As part of its operation the program
uploads its final product(s) to some catalog in the Thredds server.
The client is then responsible for detecting that the product
had been uploaded, which then enables further processing of that product
as needed.
Given the approach advocated in this document,
on what should Unidata focus to support this approach.
Accessing Thredds Data
First and foremost, we want to make it easy, efficient, and fast
for programs to access the data within a co-located Thredds server.
Thredds currently provides a signficant number of "services" [3]
through which metadata and data can be extracted from a Thredds server.
These include at least the following: DAP2 (OpenDAP), DAP4, HTTPServer,
WCS, WMS, NetcdfSubset, CdmRemote, CdmrFeature, ISO. NCML, and UDDC.
The cost to access to data via some commonly supported
protocols, such as DAP2 or CdmRemote, is relatively independent of
co-location, so using such protocols is probably not the most
efficient method.
File Download
The most efficient inter-server communication is via a shared file
system accessible both to the Thredds server and the
Jupyter server.
As of Thredds 5 it is possible to materialize both datasets and
(some kinds of) streams as files: typically netcdf-3 (classic)
or netcdf-4 (enhanced). One defines a directory into which
downloads are stored. A special kind of request is made to a
Thredds server that causes the result of the query to be
materialized in the specified directory. The name of the
materialized file is then returned to the client.
Siphon
The Siphon project [2,4] is designed to wrap access to a Thredds
server using a variety of Thredds services. As such, it will
feature prominently in our system. Currently, siphon supports
the reading of catalogs, and data access using the Thredds
netcdf subset service (NCSS), CdmRemote, and Radar Data.
Operators
The raison d'etre of server side computation is to input
datasets, apply operators to them and produce new product
datasets. In order to simplify this process, it is desirable
to make available many high-level operators so that
a computation can be completed by the composition of operators.
Often, server-side computation is illustrated using simple
operations such as sum and average. But these kinds of operators
are likely to only have marginal utility; they may be useful,
but will not be the operators doing the heavy lifting of server
side computation.
Accumulating useful operators is possibly another place where
Unidata can provide added value. Unidata can both provide a
common point of access, as well as providing some form of vetting
for these operators.
One example is Pynco [5]. This is a Python wrapping of
the netCDF Operators (NCO) [6]. NCO is currently all command line,
so Pynco wraps them to allow programmatic invocation of the various
operators.
As part of the operator support, Unidata might wish
to create a repository (using conda channels or Github?) to
which others can contribute.
Publication (File Upload)
When a program is executed within Jupyter,
it will produce results that need to be communicated to others --
especially the client originating the computation.
The obvious way to do this is to used the existing Thredds
publication facilities, namely catalogs.
As of Thredds 5, it is possible to add a directory to some top-level
catalog. Uploading a file into that directory causes it to appear
in the specified catalog. Uploading can be accomplished either
by file system operations or via a browser forms page.
Another way to add value is to make libraries available
that support specialized kinds of computations.
GPU Support
The power of Graphics Processing Units (GPUs) has significantly
increased over the last few years. Libraries now exist for
performing computations on GPUs. To date, using a GPU on
atmospheric data is uncommon. It should be possible improve
the situation by making operators available that use a GPU
underneath to carry out the computation.
Machine Learning Support
Artificial Intelligence, at least in the form of machine learning,
is another example of a specialized capability. Again,
use of AI to process atmospheric data is currently not common.
It should be possible to build quite sophisticated subsystems
supporting the construction of AI systems for doing predictions
and analyses on such data.
There is a clear danger in providing a Jupyter server
open to anyone to use. Such a server is a potential
exploitable security hole if it allows the execution of arbitrary
code. Further, there are resource issues when anyone is allowed
to execute a program on the server.
Much of the support for access controls will depend on the
evolving capabilities implemented by the Jupyter project. But
we can identify a number of access controls that will be needed
to protect a Jupyter server.
Sandboxing
The most difficult problem is to prevent the execution of
arbitrary code on the Jupyter server. Effectively, such code
must be sandboxed to control what facitilities are made
available to executing programs. Two sub-issues arise.
Some Python packages must be suppressed. Arbitrary file operations
and sub-process execution are two primary points of concern.
Various packages must be useable: numpy, metpy, siphon, for example.
They are essential for producing the desired computational products.
However, the security of the system depends on the security of those
packages. If they provide accessible security-flaws, then security as a
whole is compromised.
Authentication
Strong authentication mechanisms will need to be implemented
so that only authorized users can utilize the resources of a Jupyter
server.
Jupyter authentication may need to be coordinated with
the Thredds server so that some programs executed on Jupyter
can have access to otherwise protected datasets on the Thredd server.
This is one case where multiple Jupyter servers (and even multiple
Thredds servers) may be needed to support specialized access
to controlled datasets by using isolated Jupyter servers.
Uncontrolled execution of code can potentially be a significant
performance problem. Additionally, it can result in significant
costs (in dollars) being charged to the server's owner.
For many situations, it will be desirable to force clients to
stand-up their own Jupyter server co-located with some Thredds
server in a way that allows the client to pay for the cost of
the Jupyter server. Cloud computing is the obvious
approach. Clients will pay for their own virtual machine that is
as "close" to the Thredds server as their cloud system will
allow. The client can then use their own Jupyter server on
their own virtual machine to do the necessary computations and
for which they will be charged.
A preliminary demonstraion of communication between Thredds and Jupyter
was created by Ryan May under the auspices of the ODSIP grant [7]
funded from the NSF Earthcube program.
We anticipate starting from the ODSIP base demonstration and
extending it over time. Subject to revision, the current plan
involves the following steps.
Step 1. Servlet-Based Access
The first step is to build the servlet front-end.
This may be viewed as a stripped-down mimic of IPython.
This servlet will support both forms-based access as well as
programmatic access.
Step 2. Operators
An initial set of operator libraries will need to be
collected so that testing, experimentation, and tutoring
can proceed. This will be an ongoing process. One can hope
that some form of repository can be established and that
a critical mass of operators will begin to form.
Step 3. Configuration
The next step is to make it possible, though not necessarily easy,
for others to stand up their own Jupyter + Thredds. One approach
would be to create a set of Docker instructions for this purpose.
This would allow others to directly instantiate the Docker container
as well as provide a recipe for non-Docker operation.
Step 4. Examples
As with any new system,
external users will have difficulties in using it.
So a variety of meaningful examples will need to be
created to allow at least cutting-edge users to begin
to experiment with the system. Again, this will be
an on-going activity.
Step 5. Access Controls
At least an initial access control regime cannot be delayed for very long.
Some external users can live without this in the short term.
But for more widespread use, the users must have some belief in the
security of the systems that they create. As with operators, this will
be an on-going process.
Step 6. Workshops and Tutorials
At some point, this approach must be presented to the larger community.
For Unidata, this is usually done using our Workshops. Additionally,
video tutorials and presentations will need to be created.
[1] https://en.wikipedia.org/wiki/IPython#Project_Jupyter
[2] https://github.com/Unidata/siphon
[3] https://www.unidata.ucar.edu/software/thredds/v4.3/tds/tds4.3/reference/Services.html
[4] https://unidata.github.io/siphon/
[5] https://github.com/nco/pynco
[6] http://nco.sourceforge.net/
[7] https://www.nsf.gov/awardsearch/showAward?AWD_ID=1343761