{ "metadata": { }, "nbformat": 4, "nbformat_minor": 5, "cells": [ { "id": "metadata", "cell_type": "markdown", "source": "
In this tutorial, we will learn about Xarray, one of the most used Python library from the Pangeo ecosystem.
\nWe will be using data from Copernicus Atmosphere Monitoring Service\nand more precisely PM2.5 (Particle Matter < 2.5 μm) 4 days forecast from December, 22 2021. Parallel data analysis with Pangeo is not covered in this tutorial.
\n\n\n💬 Remark
\nThis tutorial uses data on a regular latitude-longitude grid. More complex and irregular grids are not discussed in this tutorial. In addition,\nthis tutorial is not meant to cover all the different possibilities offered by Xarrays but shows functionalities we find useful for day to day\nanalysis.
\n
\n\nAgenda
\nIn this tutorial, we will cover:
\n\n
\n- Analysis \n
\n
Some packages may need to be installed first. For example cmcrameri
is missing, so we need to install it by entering the following command in a new cell of your Jupyter Notebook:
Then we need to import all the necessary packages in our Jupyter Notebook.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-3", "source": [ "import numpy as np\n", "import xarray as xr\n", "import cartopy.crs as ccrs\n", "import matplotlib.pyplot as plt\n", "import cmcrameri.cm as cmc\n", "import pandas as pd" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "python" ], "id": "" } } }, { "id": "cell-4", "source": "The netCDF dataset can now be opened with Xarray:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-5", "source": [ "dset = xr.open_dataset(\"data/CAMS-PM2_5-20211222.netcdf\")" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "python" ], "id": "" } } }, { "id": "cell-6", "source": "Once opened, we can get metadata using print
statement.
Below is what you should get if everything goes fine.
\n\n\n🖥 Output
\n\n<xarray.Dataset>\n Dimensions: (longitude: 700, latitude: 400, level: 1, time: 97)\n Coordinates:\n * longitude (longitude) float32 -24.95 -24.85 -24.75 ... 44.75 44.85 44.95\n * latitude (latitude) float32 69.95 69.85 69.75 69.65 ... 30.25 30.15 30.05\n * level (level) float32 0.0\n * time (time) timedelta64[ns] 00:00:00 01:00:00 ... 4 days 00:00:00\nData variables:\n pm2p5_conc (time, level, latitude, longitude) float32 0.4202 ... 7.501\nAttributes:\n title: PM25 Air Pollutant FORECAST at the Surface\n institution: Data produced by Meteo France\n source: Data from ENSEMBLE model\n history: Model ENSEMBLE FORECAST\n FORECAST: Europe, 20211222+[0H_96H]\n summary: ENSEMBLE model hourly FORECAST of PM25 concentration at the...\n project: MACC-RAQ (http://macc-raq.gmes-atmosphere.eu)\n
\n\n💡 Command not found
\nIf you get an error with the previous command, first check the location of the input file
\nCAMS-PM2_5-20211222.netcdf
:\nit needs to be in the same directory as your Jupyter Notebook.
We can identify 4 different sections:
\nWe can also get metadata information for each coordinate and data variables using “.” followed by the coordinate or data variable name.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-9", "source": [ "print(dset.time)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "python" ], "id": "" } } }, { "id": "cell-10", "source": "\n\n❓ Questions CAM PM2.5 Dataset
\nWhat is the name of the variable for Particle matter < 2.5 μm and its physical units?
\n\n\nHint: Select the text with your mouse to see the answer👁 Solution
\n\n
\n- To get metadata information from
\npm2p5_conc
Data variable, we use its variable name and print it. Printing it will only print metadata, not the values.\n\n
\n- Variable name:
\nmass_concentration_of_pm2p5_ambient_aerosol_in_air
- Units:
\nµg/m3
\n\n⌨️ Input: Python
\n\nprint(dset.pm2p5_conc)\n
\n\n🖥 Output
\n\n<xarray.DataArray 'pm2p5_conc' (time: 97, level: 1, latitude: 400, longitude: 700)>\n[27160000 values with dtype=float32]\n Coordinates:\n* longitude (longitude) float32 335.0 335.1 335.2 335.4 ... 44.75 44.85 44.95\n* latitude (latitude) float32 69.95 69.85 69.75 69.65 ... 30.25 30.15 30.05\n* level (level) float32 0.0\n* time (time) timedelta64[ns] 00:00:00 01:00:00 ... 4 days 00:00:00\n Attributes:\n species: PM2.5 Aerosol\n units: µg/m3\n value: hourly values\n standard_name: mass_concentration_of_pm2p5_ambient_aerosol_in_air\n
\n\n💬 Different ways to access Data variables
\nTo access a variable or coordinate, we can use “.” or specify its name as a string between squared brackets “[” “]”. For example:
\n\nprint(dset['pm2p5_conc'])\n# or alternatively\nprint(dset.pm2p5_conc)\n
When we print a variable or coordinate, we do not get all the individual values but a
\nDataArray
that contains a lot of very useful metadata such as coordinates (if they have some), all the attributes such as the name, the physical units, etc.
We often want to select elements from the coordinates for instance to subset a geographical area or select specific times or a specific time range.
\nThere are two different ways to make a selection:
\nYou should see that the coordinate time
“disappeared” from the Dimensions
and now the variable pm2p5_conc
is a 3D field with longitude, latitude and level.
When selecting elements by the value of the coordinate, we need to use the same datatype. For instance, to select an element from\ntime
, we need to use timedelta64
. The code below will give the same result as isel(time=0)
.
The output will be very similar to what we did previously when selecting from coordinates by index.
\n\n\n❓ Select a single time for PM2.5
\nHow to select the forecast for December, 24th 2021 at 12:00 UTC?
\n\n\nHint: Select the text with your mouse to see the answer👁 Solution
\nData starts on December, 22nd 2021 at 00:00 UTC so we need to add 2 days and 12 hours to select the correct time index.
\n\n\n⌨️ Input: Python
\n\nprint(dset.sel(time=(np.timedelta64(2,'D')+ np.timedelta64(12,'h'))))\n
\n\n🖥 Output
\n\n<xarray.Dataset>\nDimensions: (longitude: 700, latitude: 400, level: 1)\nCoordinates:\n* longitude (longitude) float32 -24.95 -24.85 -24.75 ... 44.75 44.85 44.95\n* latitude (latitude) float32 69.95 69.85 69.75 69.65 ... 30.25 30.15 30.05\n* level (level) float32 0.0\ntime timedelta64[ns] 2 days 12:00:00\nData variables:\n pm2p5_conc (level, latitude, longitude) float32 0.4499 0.4421 ... 10.71\nAttributes:\n title: PM25 Air Pollutant FORECAST at the Surface\n institution: Data produced by Meteo France\nsource: Data from ENSEMBLE model\nhistory: Model ENSEMBLE FORECAST\n FORECAST: Europe, 20211222+[0H_96H]\n summary: ENSEMBLE model hourly FORECAST of PM25 concentration at the...\n project: MACC-RAQ (http://macc-raq.gmes-atmosphere.eu)\n
To plot a map, you need to select a variable with data on geographical coordinates (latitude, longitude). In addition, coordinates need to be sorted, and preferably in increasing order. This is not the case for the coordinate “longitude” which is given between 360 and 0.
\nLet’s shift the longitudes by 180 degrees so that they come in the range of -180 to 180.
\nWe print the longitudes before and after shifting them so we can see what is happening.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-15", "source": [ "print(dset.longitude)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "python" ], "id": "" } } }, { "id": "cell-16", "source": "The longitude values are between 335.05
and 44.95
degrees.
Let’s now shift the longitudes to get values between -180
, 180
degrees.
Indeed, the longitudes have been shifted and now the values are between -24.95
and 44.95
.
We will get a figure like the one below:
\n\n\n💬 What about
\nlevel
Note that in the previous plot, we did not need to select
\nlevel
because there is one value only. However, if we had more than one level, we would need to add a selection on the level before plotting
There are many ways to customize your plots and we will only detail what we think is important for creating publication ready figures:
\n\n\n
And you should get the following plot:
\nNow, we will plot several times on the same figure in different sub-plots; we will not plot all the times (too many) but the first 24 forecasted values.
\nFirstly, we need to create a list of times and convert it to pandas datetime
in order to make it easier to format times when plotting:
Secondly, we need to use the same plotting method as earlier, but we pass additional parameters:
\nvmin = 0
and vmax = 35
to set the minimum and maximum values when plotting (this is useful to highlight features in your plot)subplot_kws={\"projection\": proj_plot}
to project data on a non-default projection. See cartopy projection for more information about projections.col='time'
because we will plot several time
;col_wrap=4
to have a maximum of 4 plots per row. If we have more times to plot, then the next figures will be on another row.robust=True
and aspect=dset.dims[\"longitude\"] / dset.dims[\"latitude\"]
are additional parameters to make each subplot with a “sensible” figsize.cmap=cmc.roma_r
to select a non-default and color-blind friendly colormap (see scientific colormaps).In the second part of our plot, we are going to customize each subplot (this is why we loop for each of them and get their axes) by adding:
\ncoastlines
: we pass a parameter 10m
to get coastlines with a high resolution (non-default);set_title
to set a title for each subplot.\n\n❓ PM2.5 over Italy
\nUsing a Multi-plot between Rome and Naples, can you tell us if the forecasted PM2.5 will increase or decrease during the first 24 hours?
\n\n\nHint: Select the text with your mouse to see the answer👁 Solution
\nWe will select a sub-area: 11. East to 15.0 East and 40. N to 43. N. PM2.5 will increase and reach values close to 35 μm.m-3.\nWe will use
\nslice
to select the area and we slice latitudes withlatitude=slice(47.3, 36.5)
and notlatitude=slice(36.5, 47.3)
.\nThe reason is that when using slice, you need to specify values using the same order as in the coordinates. Latitudes are specified in\ndecreasing order for CAMS.\n\n⌨️ Input: Python
\n\nfig = plt.figure(1, figsize=[10,10])\n\n# We're using cartopy to project our data.\n# (see documentation on cartopy)\nproj_plot = ccrs.Mercator()\n\n# We need to project our data to the new projection and for this we use <code>transform</code>.\n# we set the original data projection in transform (here PlateCarree)\np = dset.sel(time=slice(np.timedelta64(1,'h'),np.timedelta64(1,'D'))).sel(latitude=slice(43., 40.),\n longitude=slice(11.,15.))['pm2p5_conc'].plot(transform=ccrs.PlateCarree(),\n vmin = 0, vmax = 35,\n subplot_kws={\"projection\": proj_plot},\n col='time', col_wrap=4,\n robust=True,\n aspect=dset.dims[\"longitude\"] / dset.dims[\"latitude\"], # for a sensible figsize\n cmap=cmc.roma_r)\n# We have to set the map's options on all axes\nfor ax,i in zip(p.axes.flat, (np.datetime64('2021-12-22') + dset.time.sel(time=slice(np.timedelta64(0),np.timedelta64(1,'D')))).values):\n ax.coastlines('10m')\n ax.set_title(\"CAMS PM2.5 \" + pd.to_datetime(i).strftime(\"%d %b %H:%S UTC\"), fontsize=12)\n# Save your figure\nplt.savefig(\"CAMS-PM2_5-fc-multi-Italy.png\")\n
\n
Sometimes we may want to make more complex selections with criteria on the values of a given variable and not only on its coordinates. For this purpose, we use the where
method. For instance, we may want to only keep PM2.5 if values are greater than 25 μm.m-3 (or any threshold you would like to choose).
Where
\n\n💬 What happened?
\nEach element of the dataset where the criteria within the
\nwhere
statement is not met, e.g. when PM2.5 <= 25, will be set tonan
.\nYou may not see any changes when printing the dataset but if you look carefuly atpm2p5_conc
values, you will see manynan
.
Let’s plot one time to better see what happened:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-29", "source": [ "######################\n", "# Plotting with mask #\n", "######################\n", "\n", "fig = plt.figure(1, figsize=[15,10])\n", "\n", "# We're using cartopy to project our data.\n", "# (see documentation on cartopy)\n", "ax = plt.subplot(1, 1, 1, projection=ccrs.Mercator())\n", "ax.coastlines(resolution='10m')\n", "\n", "# We need to project our data to the new projection and for this we use `transform`.\n", "# we set the original data projection in transform (here PlateCarree)\n", "dset.where(dset['pm2p5_conc'] > 25).isel(time=0)['pm2p5_conc'].plot(ax=ax,\n", " transform=ccrs.PlateCarree(),\n", " vmin = 0, vmax = 35,\n", " cmap=cmc.roma_r)\n", "# One way to customize your title\n", "plt.title(\"Copernicus Atmosphere Monitoring Service PM2.5, 2 day forecasts\\n 24th December 2021 at 12:00 UTC\\n only values > 25\", fontsize=18)\n", "plt.savefig(\"CAMS-PM2_5-fc-20211224-25.png\")" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "python" ], "id": "" } } }, { "id": "cell-30", "source": "We can then make the same multi-plot as earlier (over Italy) but with a where
statement to mask values lower than 25 μm.m-3:
We often want to compute the mean of all our datasets, or along a dimension (for instance time). If you do not pass any argument to the operation then it is done over all dimensions.
\nWhen we do not specify any parameters, we get a single value.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-33", "source": [ "print(dset.sel(latitude=slice(43., 40.), longitude=slice(11.,15.)).mean())" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "python" ], "id": "" } } }, { "id": "cell-34", "source": "\n\n🖥 Output
\n\n<xarray.Dataset>\nDimensions: ()\n Data variables:\n pm2p5_conc float32 9.118\n
\n\n❓ Maximum PM2.5 over Italy
\nWhat is the maximum forecasted PM2.5 value over the Rome-Naples region?
\n\n\nHint: Select the text with your mouse to see the answer👁 Solution
\nWe select the same sub-area: 11. East to 15.0 East and 40. N to 43. N and compute the maximum with
\nmax
. The maximum PM2.5 value is 59.13694382 μm.m-3 (that is rounded to 59.14).\n\n⌨️ Input: Python
\n\ndset.sel(latitude=slice(43., 40.), longitude=slice(11.,15.)).max()\n
\n\n🖥 Output
\n\nxarray.Dataset\nDimensions:\nCoordinates: (0)\nData variables:\npm2p5_conc\n()\nfloat64\n59.14\narray(59.13694382)\nAttributes: (0)\n
\n\n❓ Find when the maximum PM2.5 is forecasted
\nWhen is the maximum PM2.5 value forecasted?
\n\n\n👁 Solution
\nWe will select a sub-area: 11. East to 15.0 East and 40. N to 43. N and average over the entire selected area and search where the maximum PM2.5 value of 59.13694382 μm.m-3 is found. The maximum PM2.5 value occurs on 2021-12-22 at 20:00 UTC.
\n\n\n⌨️ Input: Python
\n\ndset_tmean = dset.sel(latitude=slice(43., 40.), longitude=slice(11.,15.)).max(dim=('latitude', 'longitude'))\ndset_tmean_max = dset_tmean.where(dset_tmean['pm2p5_conc'] == 59.13694382, drop=True)\nprint(dset_tmean_max)\n
\n\n🖥 Output
\n\n<xarray.Dataset>\nDimensions: (time: 1, level: 1)\nCoordinates:\n* level (level) float32 0.0\n* time (time) timedelta64[ns] 20:00:00\nData variables:\n pm2p5_conc (time, level) float32 59.14\n
\n\n💬 Pixel size when averaging
\nWe average over a relatively small area so we do not make a weighted average. Use weighted averages when averaging over the entire globe or over a large area where the pixel sizes may vary (depending on the latitude).
\n
The resampling frequency is lower than our original data, so we would need to apply a global operation on the data we group together such as mean, min, max:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-35", "source": [ "print(dset.resample(time='1D').mean())" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "python" ], "id": "" } } }, { "id": "cell-36", "source": "\n\n🖥 Output
\n\n<xarray.Dataset>\nDimensions: (time: 5, longitude: 700, latitude: 400, level: 1)\nCoordinates:\n* time (time) timedelta64[ns] 0 days 1 days 2 days 3 days 4 days\n* longitude (longitude) float32 -24.95 -24.85 -24.75 ... 44.75 44.85 44.95\n* latitude (latitude) float32 69.95 69.85 69.75 69.65 ... 30.25 30.15 30.05\n* level (level) float32 0.0\nData variables:\n pm2p5_conc (time, level, latitude, longitude) float32 0.4298 ... 7.501\n
When the resampling frequency is higher than the original data, we need to indicate how to fill the gaps, for instance, interpolate and indicate which interpolation method to apply or select nearest values, etc.:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-37", "source": [ "print(dset.resample(time='30min').interpolate('linear'))" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "python" ], "id": "" } } }, { "id": "cell-38", "source": "\n\n💬 Be careful when sub-sampling!
\nIncreasing the frequency of your data e.g. artificially creating data may not be scientifically relevant. Please use it carefully! Interpolating is not always scientifically relevant and sometimes you may prefer to choose a different method, like taking the nearest value for instance:
\n\n\n⌨️ Input: Python
\n\ndset.resample(time='30min').nearest()\n
\n\n❓ PM2.5 over Italy in the next 4 days
\nUsing a Multi-plot between Rome and Naples, and making averages per day, can you tell us if forecasted PM2.5 will increase or decrease?
\n\n\nHint: Select the text with your mouse to see the answer👁 Solution
\nPM2.5 over Italy is overall decreasing over the next 4 forecasted days.
\n\n\n⌨️ Input: Python
\n\nfig = plt.figure(1, figsize=[10,10])\n\n# We're using cartopy to project our data.\n# (see documentation on cartopy)\nproj_plot = ccrs.Mercator()\n\nsub_dset = dset.sel(latitude=slice(43., 40.), longitude=slice(11.,15.)).resample(time='1D').mean()\n# We need to project our data to the new projection and for this we use <code>transform</code>.\n# we set the original data projection in transform (here PlateCarree)\np = sub_dset['pm2p5_conc'].plot(transform=ccrs.PlateCarree(),\n vmin = 0, vmax = 35,\n subplot_kws={\"projection\": proj_plot},\n col='time', col_wrap=5,\n robust=True,\n aspect=dset.dims[\"longitude\"] / dset.dims[\"latitude\"], # for a sensible figsize\n cmap=cmc.roma_r)\n# We have to set the map's options on all axes\nfor ax,i in zip(p.axes.flat, (np.datetime64('2021-12-22') + dset.time.sel(time=slice(np.timedelta64(0),np.timedelta64(1,'D')))).values):\n ax.coastlines('10m')\n ax.set_title(\"CAMS PM2.5 \" + pd.to_datetime(i).strftime(\"%d %b %H:%S UTC\"), fontsize=12)\n# Save your figure\nplt.savefig(\"CAMS-PM2_5-fc-multi-Italy-mean-per-day.png\")\n
\n\n\n
\n\n💬
\nGrouby
versusresample
Use
\ngroupby
instead ofresample
when you wish to group over a dimension that is nottime
.groupby
is very similar to resample but can be applied to any coordinates and not only to time.
Well done! Pangeo is a fantastic community with many more resources for learning and/or contributing! Please, if you use any Python packages from the Pangeo ecosystem, do not forget to cite Pangeo Abernathey et al. 2017, Abernathey et al. 2021, Gentemann et al. 2021 and Sambasivan et al. 2021!
\nHave a look at the Pangeo Tutorial Gallery to pick up your next Pangeo training material!
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "cell_type": "markdown", "id": "final-ending-cell", "metadata": { "editable": false, "collapsed": false }, "source": [ "# Key Points\n\n", "- Pangeo ecosystem enables big data analysis in geosciences\n", "- Xarray is an important Python package for big data analysis in geosciences\n", "- Xarray can be used to read, select, mask and plot netCDF data\n", "- Xarray can also be used to perform global operations such as mean, max, min or resampling data\n", "\n# Congratulations on successfully completing this tutorial!\n\n", "Please [fill out the feedback on the GTN website](https://training.galaxyproject.org/training-material/topics/climate/tutorials/pangeo-notebook/tutorial.html#feedback) and check there for further resources!\n" ] } ] }