An Earth Scientist’s Python Toolbox Pt. 2

In my last post I talked about the basic packages for doing science with Python. Those tools are the basic necessities, and there are many packages that make life easier. In this post I’ll give a rundown of some of the more common ones.

pandas/xarray

I group these two together because the fill much of the same role. What they offer is labeled data manipulation. This might sound like a minor thing, but it’s a godsend to researchers. We are frequently dealing with data we didn’t produce ourselves, and we want to abstract away the little details. Do we care that the data we want are at indices 151-180? No, we just want to pull out June 1 to June 30. Do we care that going south to north is axis 0 and west to east is axis 1? No, but without labelled data it can be hard to tell if we’re sampling a point in Washington or Georgia.

pandas and xarray provide similar interfaces but have very different underlying data-models. pandas came first and is ideal for tabular data–the sort of data you would store in a CSV file. It’s primary classes are Series and DataFrames, a Series being a single column of a table and a DataFrame being the whole table. xarray extends pandas’ idea of labeled data to ndarrays–the sort of data you would store in a netCDF file. xarray’s classes are DataArrays and Datasets where a DataArray is a single variable from a Dataset.

I spend most of my time with weather model output and other gridded datasets, and xarray is my go-to package for data I/O and manipulation. Here’s an example of using xarray to discover some things about your data in a Jupyter notebook.

xarray_1xarray_2

The basic workflow for using computers to learn something with data is to, first, get the data into your program, second, manipulate the data, and third, output a figure. In this example the step to get data in was easy because I just used a built-in dataset, but xarray makes it easy to open any netCDF file with its open_dataset function. You can even open a whole folder’s worth of files with open_mfdataset. Next, printing the Dataset object gives loads of useful information about what is there. I wanted to know how long the Dataset covered, so I printed the first and last times. I didn’t like the 0-360 longitude system, so I converted it to -180-180. Then I performed a groupby operation in order to split the data into smaller pieces that I could analyze. xarray makes it so easy! Before this was available you had to use the standard library itertools.groupby function which would involve you having to write your own function to determine membership in a season. That’s just tedious and a scientist would rather be doing research. After grouping I plotted them up using the basic xarray plotting wrapper. Of course, the labels aren’t the best, but it’s perfect for a prototype or first draft.

Dask

When discussing xarray, you have to bring up Dask. That’s because xarray.open_mfdataset will load the data in a Dask array instead of a NumPy array, and if you don’t know that in advanced it can throw you off. What Dask does is take n-dimensional data and breaks it into chunks so that operations on it can be automatically parallelized. This is extremely useful for dealing with large datasets, which is why xarray defaults to Dask when opening many files at once. The problem comes down to computer memory. Let’s say you have a global dataset at 0.25° resolution, so 1440*720 points. And suppose you have the data stored as 32-bit floats, that’s 1440*720*4 ~ 4 MB. That’s no big deal. Now say you have 10 years of this data at 6-hour intervals and you want to compute the mean. Multiply the ~4 MB by 10*365*4 and you get ~60.5 GB. That’s a problem.

Enter Dask. It bills itself as “Scalable analytics in Python”. What Dask does is break up your data into chunks and uses lazy evaluation to build up an execution graph that maximizes opportunities for parallelism. And it uses out-of-core algorithms to make sure your computer doesn’t try to load that entire 60 GB array. And what that means for you the researcher is that you don’t have to worry about your program managing that enormous dataset of yours because it will do it automatically. With xarray, Dask is even more of a breeze since it chooses the chunk sizes for you.

There is one thing to watch out for, however. Dask exists to make things doable not to make things fast. It comes with a hefty overhead cost that can get ridiculous at times. I’ve spent a half hour waiting for a computation before getting exasperated and cancelling, and then moving the computation from Dask to NumPy and having it take less than a minute. I’m not sure exactly what the problem is (it’s probably related to chunk sizes), and I don’t know enough to fix it. There are tips out there on how to avoid performance hits. The way I deal with the problem is by switching from a Dask array to a NumPy array as soon as I can. Time is often the largest dimension in my data, so I do the operation I want to aggregate in time first (mean, median, whatever), and then call the DataArray’s compute method to load the result into memory. Once time is aggregated in my example a few paragraphs up, I’m back to the ~4 MB array and I can continue as normal with NumPy.

Numba and Cython

I group these together not because they are similar projects but because they perform a similar function: to accelerate calculations that require looping. Looping in Python can be ridiculously slow. Take for example this naive attempt to compute the sum of an array in Fortran and Python. The Fortran loop is 3 orders of magnitude faster than the equivalent Python loop. For computationally intensive codes this is unacceptable.

slow_python_example

In the vast majority of cases, you can use NumPy’s built-in functions to get around this problem (for this example numpy.sum). But there are cases where there’s nothing in NumPy or SciPy to help you and you have to turn to compiled code. There are lots of options here. You could write a C extension, or you could use Fortran and f2py. But the easiest paths (in my opinion) are to use Cython or Numba. Cython is a Python-like language that allows you to have the expressiveness of Python while adding on static typing and compilation. Numba provides decorators that allow you to just-in-time compile your Python functions. Here are examples of using them with the same functions as above.

fast_python_example

Both loops are just as fast as the Fortran subroutine! The Cython function has some nice things like static typing, and some ugliness with the cimport statements. Also, Cython functions have to appear in a separate file and uses a setup.py in order to compile and run (unless you’re using the iPython magic). Numba makes it really nice. All we had to do was add a decorator and we had near-Fortran speeds! Of course, this is a simple example and in the real world your mileage will vary. Numba especially can be variable since it’s so high level and a relatively new project. My strategy when optimizing scientific code is to a) try to do it in NumPy/SciPy and if it can’t be done b) feed it to Numba’s JIT compiler and if that throws an error c) write it in Cython.

Conclusion

Python has a robust ecosystem of packages that make life easier for a researcher. Beyond those mentioned here, there are loads of narrowly-focused packages like wrf-python, geopandas, and the scikits that solve more problems for you. As always, you can check out the Jupyter Notebook used in this post here.

Advertisements

An Earth Scientist’s Python Toolbox Pt. 1

In my last post, I talked about package management in Python. Today I’m going to talk about the must-have modules and what they’re used for. The community is of course rapidly evolving, so who knows how long of a shelf-life this post will have. Python has seen a meteoric rise in the last few years in the data science community, and this has been matched with astounding levels of innovation in the tools that are available. I’ll start with the packages that I think everyone in this field should be using, and then I’ll move on to ones that are a little bit more niche-y in a future post.

The Scientific Core

Numpy

NumPy, as the home page says, is “the fundamental package for scientific computing with Python”. As you might have heard, everything in Python is an object. What this means is that every value that you have stored in a variable is actually a pointer that points to some unknown-to-you place in memory. This is all well and fine until you have to start accessing a large amount of values at once. Python’s list objects are contiguous sequences in memory, but each stop on the sequence is a pointer that goes to an arbitrary place elsewhere in memory. This leads to inferior performance especially considering n-dimensional arrays. A 3D array made out of Python lists is really a list of lists of lists, and each of those sub-lists is somewhere arbitrary in memory with elements pointing somewhere else, and it becomes a nightmare fast. Also there’s a memory issue since each of these Python objects has some metadata overhead.

NumPy comes to the rescue with contiguous n-dimensional arrays. It accomplishes this by restricting the data types you can use to standard booleans, integers, floats, and doubles (among other fancier types) and by pushing your values into compiled C code. This solves both the overhead and pointer issues. The NumPy developers did a fantastic job of putting together a fully-featured array manipulation package. They have everything from statistics to quadrature, but the feature I want to point out here is boolean indexing. This nifty trick lets you conditionally change values within an array. As an example, let’s say you have an defined as pictured below.Screenshot_20171127_195910

That’s great and all, but let’s say you want to get rid of those pesky round-off errors. A brute-force attempt could be to loop through each element devising some sort of test to see if that element is 0 within some margin of error. That sounds complicated, and it would violate a fundamental rule of numerical computing in Python–avoid for-loops at all (reasonable) costs. The NumPy way to do it is to simply write this.Screenshot_20171127_200331

NumPy has a built in algorithm to determine whether two values are close to one another (which you can tweak the parameters of of course). That function outputs a boolean array of the same shape of the input. I used that boolean array to index the sined array, and I assigned the value 0 to every element where that was true. The result is a nice, neat array that you could use in something like a convolution kernel.

Matplotlib

Matplotlib is the de facto standard plotting software for Python. It was initially designed as a Matlab clone, but it has since evolved to be so much more. Unfortunately, it’s origins have left it a bit crufty, and this has earned it many detractors. Several packages have been develop that simplify Matplotlib syntax, but I’ve found them to be lacking for the 2D data that I need to plot. There are two ways to use Matplotlib: the vestigial Matlab-ish way and the Pythonic way. I prefer to use the Pythonic method, which takes advantage of object-oriented programming and is far more readable. The basic flow of a script that produces plots includes:

import matplotlib.pyplot as plt
fig, ax0 = plt.subplots()
""" Work Done Here """
plt.show()

I like to retrieve figure and axis objects using the subplots method since it’s a one-liner. And I like to make sure I have pointers to them in variables because it makes it so much easier to keep track of what belongs to which when using multiple figures. Also I very often work with subplots, so its only a small modification to specify how I want the grid to look (i.e., fig, axarr = plt.subplots(ncols=2, nrows=2) produces references for a 2×2 grid of subplots).

The Matplotlib documentation is very extensive, so please check it out to see all that it can do. The axes API is the page I end up using the most since it describes all the different plot styles you can make with Matplotlib. Also, a lot of Matplotlib’s extremely extensive plot customizations (all the ones have to do with axes) are described here.

As an example of the power of Matplotlib, here’s how I plot a set of trigonometric functions.

Screenshot_20171127_212114

Obviously this isn’t the simplest possible plot, but I think it does a good job of showing off some of the styling tools Matplotlib gives you. Note that I used the logical indexing I described in the Numpy section to create a “masked” array. It’s a super useful tool!

SciPy

The SciPy package contains lots of material that is used throughout the sciences but is too specific to put in NumPy. It is here that you find all sorts of useful algorithms that you wouldn’t want to implement yourself. SciPy is divided into 19 sections, most of which I have never touched. But some of them are very important for my data analysis. These include interpolation, image processing, and signal processing.

Scipy.interpolate is full of algorithms like splines, and my go-to function is griddata. It lets you interpolate unstructured data onto a grid, and you can choose between nearest-neighbor, linear, and cubic interpolation. It’s great for taking randomly spaced points–say MesoWest data–and making it more regular.

Scipy.signal has a plethora tools used in signal processing. In meteorology we often use bandpass filters to get rid of noise, and all the tools needed to design one are available in scipy.signal. There’s even a recipe for a Butterworth bandpass filter available.

I probably use scipy.ndimage the most out of each of the SciPy modules. It provides tools for working with images in n-dimensions, and, guess what, all of our meteorological model data are basically n-dimensional images. I generally use scipy.ndimage for the filtering functions since the high resolution models tend to be very noisy for some fields. I’m curious about that sine-y convolution kernel we made back in the NumPy section, so I’ll give you an example of how to use scipy.ndimage by applying that to an image of Red Castle I took this summer (shrunk and desaturated by GIMP).

Screenshot_20171127_222700

That’s actually really cool looking. (Side note: these sorts of convolutions are how phone cameras now can identify faces. Each facial feature has a unique signature after a convolution, and algorithms exploit that.)

Conclusion

This post covers the bedrock of scientific computing with Python. On top of these fundamental packages there are many tools that have propelled Python to the front of data science, including pandas, xarray, numba, scikit-learn, and many others. In my next post I’ll touch on those that I find myself using most often. Check out today’s notebook on GitHub.

About Packages

 

The first thing my hypothetical You did in my last post was go to the Python website and download CPython from there. That was You’s first mistake. While Python has a philosophy of being batteries included, for scientific programming you’re going to have to go to the metaphorical store and pick some up. All the extra batteries come in the form of packages that community developers create and share. The most important ones will each get their own post, but before I can get into that we have to discuss package management.

The standard, general-purpose Python tool is pip. It draws from packages that have been uploaded to the Python Package Index. Using pip can very uncomfortable many of the packages on there don’t install with all of the required libraries. Programming in the atmospheric sciences means you’ll be using lots of HDF5, netCDF4, and GRIB2 files, which require C/Fortran libraries to be installed on the system to be read. Some packages that read these formats on pip will download these libraries as dependencies, but others require you to fend for yourself.

The better way to do it is to use conda. Conda is a package manager supported by Anaconda that is the standard for data science. It offers several advantages over pip. The first is that it has a robust virtual environment system. Say you need to use a package that requires an older version of Python (I had this happen the other day with pyNGL). Without conda it would be painful to download that version of Python and all the other packages you need and then somehow make it so all the path variables work and then get it back to normal. With conda it’s as simple as

conda create -n py2 python=2 pyngl
source activate py2

to create and activate a virtual environment that has the pyNGL package in it. And when you’re finished with it, it’s as easy as source deactivate to go back to your normal environment.

The next advantage to conda is that it installs in your home directory. This is a huge boon when working on a system on which you do not have admin rights. It makes it easier to install and stay up-to-date on packages, and it makes it easier to guarantee that you know what versions of software you are using and reproduce results. A downside to this is that it can lead to some library duplication, which can eat up storage space.

Furthermore, conda is not just for Python. Say you have some data to wrangle and can’t find good Python tools to attack it, or you just really want to use ggplot2, you can use conda as a package manager for R as well.

I like to use the conda-forge channel for packages that I use. Conda-forge is a community managed package repository, and it includes many more packages than the standard Anaconda channel–including some that are critical to my workflow like wrf-python. It works best when all of your packages come from the same channel, and you can set conda-forge as the default by issuing conda config --add channels conda-forge. One of conda-forge’s rules is that maintainers shouldn’t upload a package unless all of the dependencies are also on conda-forge, so it guarantees that you will never be missing libraries unexpectedly.

More About Modules

Back when I first learned Python, I had no idea how modules worked. I would see import statements in examples and copy them without understanding what they meant. Of course, I still don’t fully understand modules, but I know enough to give practical advice.

It’s All About Namespaces

A namespace is how variable names are organized by the program. In a Python program, the largest one is the global namespace, and that includes everything that is defined in the top-most-level of your script, plus built-in functions and classes. Everything that is defined inside of a function or class is in a local namespace. As an example, if I define a class named Bar and give it a member variable x, I can’t access x unless I go through Bar.

Screenshot_20171126_151122

This is what is happening when you say something like this.

Screenshot_20171126_151155

The first line of this cell can be read as “bring module numpy into global namespace with name np”. There is now a module object with the name np through which can be accessed everything in numpy’s local namespace. A related idea is why it is almost always a bad idea to say from numpy import *. That line can be read as “bring every function and class from the numpy module into the global namespace”.  As an example of how it can bring problems, take a look at this code block.

def array(arr):
    """ Example function that says whether or not each item
        in an iterable has the same type
    """
    return all([isinstance(item, type(arr[0])) for item in arr])

from numpy import *

print(array([1, 2, 3, 4, 5, 6]))

You might include all of these elements in a script and expect the all function in the print statement to behave as you defined it at the top of the file.

 

Screenshot_20171126_155213

Instead of printing the expected True boolean, it returns a numpy.ndarray. Now you have to comb through some long script trying to figure out what went wrong, and, if you didn’t know about namespaces or that numpy had an array function, it would be impossible to figure out.

Another thing that I didn’t know about imports when I started out is that you only have to import the modules that you are going to directly use. A lot of my scripts used to start with something to the effect of

import numpy as np
from netCDF4 import Dataset
import dask
import xarray as xr

and then never used numpy, dask, or netCDF4. I thought that since xarray used numpy, dask, and netCDF4, I had to import them in my script or else it wouldn’t work. My logic was flawed, but to me everything was black magic so I did it anyways. The right way to do it is to only directly import the modules you are going to directly use, and they handle all of their dependencies in the background.

TL;DR

Package management is important with Python, and for data science purposes you’re most likely going to want to use conda. Namespaces may look like black magic at first, but they’re really simple once you get the hang of it. Please never say from module import *.

I’m going to try posting the notebooks and such that I use for this blog on GitHub so you can download and play around with them. Today’s is here.

 

 

 

In The Beginning There was IDLE…

So you’ve decided you’re going to start analyzing weather data yourself. You’ve been a weather weenie for a while now, but you’re getting tired of the inflexibility of all the standard websites. You’ve tried out IDV, but you found it too buggy and resource-intensive. Now you’re taking matters into your own hands to produce your own graphics–the ones that are exactly what you want to see. Good for you! But the only programming experience you have is the BASIC you learned in high school, so you aren’t sure where to begin.

You’ve heard that Python is a popular and beginner-friendly programming language, so you decide to check it out. You download the latest version from python.org, install it, and open up the IDLE program that appeared in your application menu. You open it and are greeted with this beauty.

Yes I know this is the conda-forge version but I don't want to actually install the official release IDLE on my computer.
Wow.

Nope, this won’t do. While you can be a successful Python programmer using just IDLE, it’s simplicity will be incredibly annoying. I have friends who took a Python class in college who all complained about IDLE.  The goal of this blog is to minimize the pain of future readers.

What You Should Do

If you want to start programming in Python, the first thing you should do is take a tutorial to learn the syntax. I started with Codecademy, but there are many out there. Also, you should take a tutorial on the terminal as well since it’s very valuable. (Note to Windows users: most scientific computing is done on Linux, so many important tools won’t work with Windows. Once you need one of these tools, you’ll have to either use a virtual machine or dual-boot a GNU/Linux OS. MacOS and Linux both follow the same standards underneath, so MacOS users probably have fewer issues.)

Once you have a fair understanding of the language syntax, you’ll be ready to start writing programs to manipulate data. There are several ways to write and execute Python programs, but for simplicity’s sake I’ll describe my preferred workflow. I write my scripts in a text editor and execute them via command line. There is a world of text editors out there that range from more IDE-like (Spyder etc.) to the simplest (IDLE and Kate and others). Some recommend command line editors like vim and Emacs, but I don’t think they’re worth the effort to learn their keyboard shortcuts (I do use nano for short and quick edits). My preferred text editors are the mid-range applications such as Visual Studio Code and Atom. These offer a comprehensive extension ecosystem with a lightweight and customizable interface. I use VS Code because I’ve had better luck with their Python extension and it offers nifty features like an integrated terminal and debugger.

Screenshot_20171122_200034
Much Friendlier

Many who know a lot about the Python data science ecosystem are probably screaming “Why isn’t he recommending Jupyter Notebooks and IPython? They’re the best!”. I know that they are extremely popular, but they are also limited. Any data scientist is going to be working with many languages and data formats, and good luck having a pleasant experience editing your JavaScript and PHP in a Jupyter Notebook. I believe it’s better to have few tools that do many things very well than many tools that do a few things very well. (And don’t worry, Jupyter will be getting a post all to itself.)

The thing with text editors is that it’s incredibly personal and your choice should be something that suits you and your workflow. My goal with this post has been to show that there are more options than the default, and that you should explore some. Stay tuned for my next post–the topic will be the number two thing I here people complain about with Python: package management.

Raison D’être

Hello World!

My name is Alex, and I’m an atmospheric science graduate student at the University of Utah. This blog is a place where I can share the various programs and algorithms that I use in my research. It’s hard for a new person starting out to become familiar with the tools that we use as earth scientists, so I want to document my learning curve in hopes that others can follow (and share ideas with me in return!). I have been wanting to make this blog for a while, so I have several ideas for posts in my head. For the first while I’ll be writing down what I’ve learned thus far, and then posts will become irregular as I continue to learn about the systems that have been built up over the last 70 years.

I primarily code in the Python programming language. I took a computer science minor in undergrad, and for that I learned Java and C#. While those may be great languages for writing apps, they don’t have the flexibility and library support data-intensive science requires. In my undergraduate atmospheric science classes, we mainly used MATLAB. Because I enjoy stoking Holy Wars (only halfway sarcastic), I’ll devote an entire post to why I don’t use MATLAB later. But since Python is my language of choice, the majority of these posts will be focused on the Python ecosystem.

Much has been said about Python’s prowess at data science. It and R are the two major languages in that field. But that’s not going to stop me from saying it again! Most of the information out there comes from the perspective of data that you would typically see in a high school statistics textbook (Here’s a list of heights of boys and girls in the class, compute the mean, median, and mode and such). While this 1D data is certainly interesting, it’s not what I see regularly. I’ve always found it difficult to translate skills from one dimension to n dimensions, so I will be taking an n-dimensional approach from the beginning.

My plan is to start from the very beginning of a person’s entry into data analysis, since I remember how lost I was when I first started. I first began getting interested in using computer programming as a tool two years ago, so we have a lot to catch up on.