In my last post I talked about the basic packages for doing science with Python. Those tools are the basic necessities, and there are many packages that make life easier. In this post I’ll give a rundown of some of the more common ones.
I group these two together because the fill much of the same role. What they offer is labeled data manipulation. This might sound like a minor thing, but it’s a godsend to researchers. We are frequently dealing with data we didn’t produce ourselves, and we want to abstract away the little details. Do we care that the data we want are at indices 151-180? No, we just want to pull out June 1 to June 30. Do we care that going south to north is axis 0 and west to east is axis 1? No, but without labelled data it can be hard to tell if we’re sampling a point in Washington or Georgia.
pandas and xarray provide similar interfaces but have very different underlying data-models. pandas came first and is ideal for tabular data–the sort of data you would store in a CSV file. It’s primary classes are Series and DataFrames, a Series being a single column of a table and a DataFrame being the whole table. xarray extends pandas’ idea of labeled data to ndarrays–the sort of data you would store in a netCDF file. xarray’s classes are DataArrays and Datasets where a DataArray is a single variable from a Dataset.
I spend most of my time with weather model output and other gridded datasets, and xarray is my go-to package for data I/O and manipulation. Here’s an example of using xarray to discover some things about your data in a Jupyter notebook.
The basic workflow for using computers to learn something with data is to, first, get the data into your program, second, manipulate the data, and third, output a figure. In this example the step to get data in was easy because I just used a built-in dataset, but xarray makes it easy to open any netCDF file with its
open_dataset function. You can even open a whole folder’s worth of files with
open_mfdataset. Next, printing the Dataset object gives loads of useful information about what is there. I wanted to know how long the Dataset covered, so I printed the first and last times. I didn’t like the 0-360 longitude system, so I converted it to -180-180. Then I performed a groupby operation in order to split the data into smaller pieces that I could analyze. xarray makes it so easy! Before this was available you had to use the standard library
itertools.groupby function which would involve you having to write your own function to determine membership in a season. That’s just tedious and a scientist would rather be doing research. After grouping I plotted them up using the basic xarray plotting wrapper. Of course, the labels aren’t the best, but it’s perfect for a prototype or first draft.
When discussing xarray, you have to bring up Dask. That’s because
xarray.open_mfdataset will load the data in a Dask array instead of a NumPy array, and if you don’t know that in advanced it can throw you off. What Dask does is take n-dimensional data and breaks it into chunks so that operations on it can be automatically parallelized. This is extremely useful for dealing with large datasets, which is why xarray defaults to Dask when opening many files at once. The problem comes down to computer memory. Let’s say you have a global dataset at 0.25° resolution, so 1440*720 points. And suppose you have the data stored as 32-bit floats, that’s 1440*720*4 ~ 4 MB. That’s no big deal. Now say you have 10 years of this data at 6-hour intervals and you want to compute the mean. Multiply the ~4 MB by 10*365*4 and you get ~60.5 GB. That’s a problem.
Enter Dask. It bills itself as “Scalable analytics in Python”. What Dask does is break up your data into chunks and uses lazy evaluation to build up an execution graph that maximizes opportunities for parallelism. And it uses out-of-core algorithms to make sure your computer doesn’t try to load that entire 60 GB array. And what that means for you the researcher is that you don’t have to worry about your program managing that enormous dataset of yours because it will do it automatically. With xarray, Dask is even more of a breeze since it chooses the chunk sizes for you.
There is one thing to watch out for, however. Dask exists to make things doable not to make things fast. It comes with a hefty overhead cost that can get ridiculous at times. I’ve spent a half hour waiting for a computation before getting exasperated and cancelling, and then moving the computation from Dask to NumPy and having it take less than a minute. I’m not sure exactly what the problem is (it’s probably related to chunk sizes), and I don’t know enough to fix it. There are tips out there on how to avoid performance hits. The way I deal with the problem is by switching from a Dask array to a NumPy array as soon as I can. Time is often the largest dimension in my data, so I do the operation I want to aggregate in time first (mean, median, whatever), and then call the DataArray’s
compute method to load the result into memory. Once time is aggregated in my example a few paragraphs up, I’m back to the ~4 MB array and I can continue as normal with NumPy.
Numba and Cython
I group these together not because they are similar projects but because they perform a similar function: to accelerate calculations that require looping. Looping in Python can be ridiculously slow. Take for example this naive attempt to compute the sum of an array in Fortran and Python. The Fortran loop is 3 orders of magnitude faster than the equivalent Python loop. For computationally intensive codes this is unacceptable.
In the vast majority of cases, you can use NumPy’s built-in functions to get around this problem (for this example
numpy.sum). But there are cases where there’s nothing in NumPy or SciPy to help you and you have to turn to compiled code. There are lots of options here. You could write a C extension, or you could use Fortran and f2py. But the easiest paths (in my opinion) are to use Cython or Numba. Cython is a Python-like language that allows you to have the expressiveness of Python while adding on static typing and compilation. Numba provides decorators that allow you to just-in-time compile your Python functions. Here are examples of using them with the same functions as above.
Both loops are just as fast as the Fortran subroutine! The Cython function has some nice things like static typing, and some ugliness with the
cimport statements. Also, Cython functions have to appear in a separate file and uses a
setup.py in order to compile and run (unless you’re using the iPython magic). Numba makes it really nice. All we had to do was add a decorator and we had near-Fortran speeds! Of course, this is a simple example and in the real world your mileage will vary. Numba especially can be variable since it’s so high level and a relatively new project. My strategy when optimizing scientific code is to a) try to do it in NumPy/SciPy and if it can’t be done b) feed it to Numba’s JIT compiler and if that throws an error c) write it in Cython.
Python has a robust ecosystem of packages that make life easier for a researcher. Beyond those mentioned here, there are loads of narrowly-focused packages like wrf-python, geopandas, and the scikits that solve more problems for you. As always, you can check out the Jupyter Notebook used in this post here.