Python气候数据分析的简要指南+代码

气象学家

发布于 2022-04-18 20:01:49

2.5K0

发布于 2022-04-18 20:01:49

文章被收录于专栏：气象学家气象学家

By: Ali Ahmadalipour (LinkedIn, Twitter)

April 2022

Link：

https://colab.research.google.com/drive/1B7gFBSr0eoZ5IbsA0lY8q3XL8n-3BOn4#scrollTo=_00P5Wq6whH2

新段落

Link to a LinkedIn article I wrote explaining the details: https://www.linkedin.com/pulse/python-climate-data-analysis-tutorial-code-ali-ahmadalipour/

Basic info for running the code in a notebook environment:

To run a cell, press Shift + Enter
To restart the analysis (i.e. clean the variables and RAM, but keep the downloaded data), restart the runtime from the top menu
To completely start over (i.e. clean RAM and temporary storage that may contain downloaded data or any saved figure), go to Runtime>Manage sessions->TERMINATE.

Basic info about Google Colab Notebooks:

If the page stays idle for several minutes, the runtime may be terminated, and you'll need to connect again.
Google provides you about 70GB free temporary disk space. Any data you download or any model output is by default saved temporarily in this terminal. Therefore, if the runtime is terminated, you'll lose that data. If you would like to keep the data or the outputs, you can connect to your Google drive and choose any specific directory there. Here's how to connect to your google drive:

# from google.colab import drive
# drive.mount('/content/drive')

1. Basic (download, extract & save data, concat, groupby, select):

In this section, we will download and analyze gridded precipitation data (from CPC). The goal is to extract daily data, find monthly totals, find spatial average of precipitation in a given domain, plot the results, and save the outputs as netcdf files. We will work with some of the commonly used functionalities of xarray (a powerful python library for analyzing geospatial data) such as:

open_mfdataset (in xarray, which opens multiple files at the same time)
concatenate datasets
groupby
slicing and selecting data
save as netcdf

# Later in the advanced section of this tutorial (section 3.2), we will be analyzing 
# zarr data format, and the pre-installed xarray on google colab is not able to 
# do so. Thus, we need to intall the complete version of xarray to be able to do it.
!pip install xarray[complete] # this may take a few seconds

Requirement already satisfied: xarray[complete] in /usr/local/lib/python3.7/dist-packages (0.18.2)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from xarray[complete]) (1.21.5)
Requirement already satisfied: pandas>=1.0 in /usr/local/lib/python3.7/dist-packages (from xarray[complete]) (1.3.5)
Requirement already satisfied: setuptools>=40.4 in /usr/local/lib/python3.7/dist-packages (from xarray[complete]) (57.4.0)
Requirement already satisfied: seaborn in /usr/local/lib/python3.7/dist-packages (from xarray[complete]) (0.11.2)
Requirement already satisfied: netCDF4 in /usr/local/lib/python3.7/dist-packages (from xarray[complete]) (1.5.8)
Requirement already satisfied: cftime in /usr/local/lib/python3.7/dist-packages (from xarray[complete]) (1.6.0)
Requirement already satisfied: bottleneck in /usr/local/lib/python3.7/dist-packages (from xarray[complete]) (1.3.4)
Requirement already satisfied: dask[complete] in /usr/local/lib/python3.7/dist-packages (from xarray[complete]) (2.12.0)
Collecting h5netcdf
  Downloading h5netcdf-1.0.0-py2.py3-none-any.whl (24 kB)
Collecting pydap
  Downloading Pydap-3.2.2-py3-none-any.whl (2.3 MB)
[K     |████████████████████████████████| 2.3 MB 16.9 MB/s 
[?25hCollecting fsspec
  Downloading fsspec-2022.3.0-py3-none-any.whl (136 kB)
[K     |████████████████████████████████| 136 kB 61.5 MB/s 
[?25hCollecting zarr
  Downloading zarr-2.11.2-py3-none-any.whl (153 kB)
[K     |████████████████████████████████| 153 kB 64.4 MB/s 
[?25hCollecting numbagg
  Downloading numbagg-0.2.1-py2.py3-none-any.whl (18 kB)
Requirement already satisfied: pooch in /usr/local/lib/python3.7/dist-packages (from xarray[complete]) (1.6.0)
Collecting nc-time-axis
  Downloading nc_time_axis-1.4.0-py3-none-any.whl (15 kB)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from xarray[complete]) (3.2.2)
Collecting cfgrib
  Downloading cfgrib-0.9.10.1-py3-none-any.whl (45 kB)
[K     |████████████████████████████████| 45 kB 3.0 MB/s 
[?25hCollecting rasterio
  Downloading rasterio-1.2.10-cp37-cp37m-manylinux1_x86_64.whl (19.3 MB)
[K     |████████████████████████████████| 19.3 MB 1.3 MB/s 
[?25hRequirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from xarray[complete]) (1.4.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=1.0->xarray[complete]) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=1.0->xarray[complete]) (2018.9)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas>=1.0->xarray[complete]) (1.15.0)
Requirement already satisfied: attrs>=19.2 in /usr/local/lib/python3.7/dist-packages (from cfgrib->xarray[complete]) (21.4.0)
Collecting eccodes>=0.9.8
  Downloading eccodes-1.4.1.tar.gz (54 kB)
[K     |████████████████████████████████| 54 kB 2.8 MB/s 
[?25hRequirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from cfgrib->xarray[complete]) (7.1.2)
Requirement already satisfied: cffi in /usr/local/lib/python3.7/dist-packages (from eccodes>=0.9.8->cfgrib->xarray[complete]) (1.15.0)
Collecting findlibs
  Downloading findlibs-0.0.2.tar.gz (6.2 kB)
Requirement already satisfied: pycparser in /usr/local/lib/python3.7/dist-packages (from cffi->eccodes>=0.9.8->cfgrib->xarray[complete]) (2.21)
Requirement already satisfied: cloudpickle>=0.2.1 in /usr/local/lib/python3.7/dist-packages (from dask[complete]->xarray[complete]) (1.3.0)
Requirement already satisfied: bokeh>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from dask[complete]->xarray[complete]) (2.3.3)
Collecting distributed>=2.0
  Downloading distributed-2022.2.0-py3-none-any.whl (837 kB)
[K     |████████████████████████████████| 837 kB 62.2 MB/s 
[?25hRequirement already satisfied: PyYaml in /usr/local/lib/python3.7/dist-packages (from dask[complete]->xarray[complete]) (3.13)
Requirement already satisfied: toolz>=0.7.3 in /usr/local/lib/python3.7/dist-packages (from dask[complete]->xarray[complete]) (0.11.2)
Collecting partd>=0.3.10
  Downloading partd-1.2.0-py3-none-any.whl (19 kB)
Requirement already satisfied: Jinja2>=2.9 in /usr/local/lib/python3.7/dist-packages (from bokeh>=1.0.0->dask[complete]->xarray[complete]) (2.11.3)
Requirement already satisfied: packaging>=16.8 in /usr/local/lib/python3.7/dist-packages (from bokeh>=1.0.0->dask[complete]->xarray[complete]) (21.3)
Requirement already satisfied: typing-extensions>=3.7.4 in /usr/local/lib/python3.7/dist-packages (from bokeh>=1.0.0->dask[complete]->xarray[complete]) (3.10.0.2)
Requirement already satisfied: pillow>=7.1.0 in /usr/local/lib/python3.7/dist-packages (from bokeh>=1.0.0->dask[complete]->xarray[complete]) (7.1.2)
Requirement already satisfied: tornado>=5.1 in /usr/local/lib/python3.7/dist-packages (from bokeh>=1.0.0->dask[complete]->xarray[complete]) (5.1.1)
Requirement already satisfied: psutil>=5.0 in /usr/local/lib/python3.7/dist-packages (from distributed>=2.0->dask[complete]->xarray[complete]) (5.4.8)
Requirement already satisfied: zict>=0.1.3 in /usr/local/lib/python3.7/dist-packages (from distributed>=2.0->dask[complete]->xarray[complete]) (2.1.0)
Requirement already satisfied: msgpack>=0.6.0 in /usr/local/lib/python3.7/dist-packages (from distributed>=2.0->dask[complete]->xarray[complete]) (1.0.3)
Collecting distributed>=2.0
  Downloading distributed-2022.1.1-py3-none-any.whl (830 kB)
[K     |████████████████████████████████| 830 kB 57.9 MB/s 
[?25h  Downloading distributed-2022.1.0-py3-none-any.whl (822 kB)
[K     |████████████████████████████████| 822 kB 56.2 MB/s 
[?25hRequirement already satisfied: sortedcontainers!=2.0.0,!=2.0.1 in /usr/local/lib/python3.7/dist-packages (from distributed>=2.0->dask[complete]->xarray[complete]) (2.4.0)
  Downloading distributed-2021.12.0-py3-none-any.whl (802 kB)
[K     |████████████████████████████████| 802 kB 63.3 MB/s 
[?25h  Downloading distributed-2021.11.2-py3-none-any.whl (802 kB)
[K     |████████████████████████████████| 802 kB 65.3 MB/s 
[?25h  Downloading distributed-2021.11.1-py3-none-any.whl (793 kB)
[K     |████████████████████████████████| 793 kB 38.0 MB/s 
[?25h  Downloading distributed-2021.11.0-py3-none-any.whl (793 kB)
[K     |████████████████████████████████| 793 kB 11.4 MB/s 
[?25h  Downloading distributed-2021.10.0-py3-none-any.whl (791 kB)
[K     |████████████████████████████████| 791 kB 13.0 MB/s 
[?25h  Downloading distributed-2021.9.1-py3-none-any.whl (786 kB)
[K     |████████████████████████████████| 786 kB 55.9 MB/s 
[?25hCollecting cloudpickle>=0.2.1
  Downloading cloudpickle-2.0.0-py3-none-any.whl (25 kB)
Requirement already satisfied: tblib>=1.6.0 in /usr/local/lib/python3.7/dist-packages (from distributed>=2.0->dask[complete]->xarray[complete]) (1.7.0)
Collecting distributed>=2.0
  Downloading distributed-2021.9.0-py3-none-any.whl (779 kB)
[K     |████████████████████████████████| 779 kB 9.0 MB/s 
[?25h  Downloading distributed-2021.8.1-py3-none-any.whl (778 kB)
[K     |████████████████████████████████| 778 kB 61.3 MB/s 
[?25h  Downloading distributed-2021.8.0-py3-none-any.whl (776 kB)
[K     |████████████████████████████████| 776 kB 55.7 MB/s 
[?25h  Downloading distributed-2021.7.2-py3-none-any.whl (769 kB)
[K     |████████████████████████████████| 769 kB 64.4 MB/s 
[?25h  Downloading distributed-2021.7.1-py3-none-any.whl (766 kB)
[K     |████████████████████████████████| 766 kB 62.5 MB/s 
[?25h  Downloading distributed-2021.7.0-py3-none-any.whl (1.0 MB)
[K     |████████████████████████████████| 1.0 MB 61.0 MB/s 
[?25h  Downloading distributed-2021.6.2-py3-none-any.whl (722 kB)
[K     |████████████████████████████████| 722 kB 52.6 MB/s 
[?25h  Downloading distributed-2021.6.1-py3-none-any.whl (722 kB)
[K     |████████████████████████████████| 722 kB 67.0 MB/s 
[?25h  Downloading distributed-2021.6.0-py3-none-any.whl (715 kB)
[K     |████████████████████████████████| 715 kB 63.8 MB/s 
[?25h  Downloading distributed-2021.5.1-py3-none-any.whl (705 kB)
[K     |████████████████████████████████| 705 kB 61.6 MB/s 
[?25h  Downloading distributed-2021.5.0-py3-none-any.whl (699 kB)
[K     |████████████████████████████████| 699 kB 56.3 MB/s 
[?25h  Downloading distributed-2021.4.1-py3-none-any.whl (696 kB)
[K     |████████████████████████████████| 696 kB 84.6 MB/s 
[?25h  Downloading distributed-2021.4.0-py3-none-any.whl (684 kB)
[K     |████████████████████████████████| 684 kB 10.3 MB/s 
[?25h  Downloading distributed-2021.3.1-py3-none-any.whl (679 kB)
[K     |████████████████████████████████| 679 kB 64.6 MB/s 
[?25h  Downloading distributed-2021.3.0-py3-none-any.whl (675 kB)
[K     |████████████████████████████████| 675 kB 23.8 MB/s 
[?25h  Downloading distributed-2021.2.0-py3-none-any.whl (675 kB)
[K     |████████████████████████████████| 675 kB 55.6 MB/s 
[?25h  Downloading distributed-2021.1.1-py3-none-any.whl (672 kB)
[K     |████████████████████████████████| 672 kB 63.4 MB/s 
[?25h  Downloading distributed-2021.1.0-py3-none-any.whl (671 kB)
[K     |████████████████████████████████| 671 kB 70.0 MB/s 
[?25h  Downloading distributed-2020.12.0-py3-none-any.whl (669 kB)
[K     |████████████████████████████████| 669 kB 71.0 MB/s 
[?25h  Downloading distributed-2.30.1-py3-none-any.whl (656 kB)
[K     |████████████████████████████████| 656 kB 72.1 MB/s 
[?25hRequirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.7/dist-packages (from Jinja2>=2.9->bokeh>=1.0.0->dask[complete]->xarray[complete]) (2.0.1)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=16.8->bokeh>=1.0.0->dask[complete]->xarray[complete]) (3.0.7)
Collecting locket
  Downloading locket-0.2.1-py2.py3-none-any.whl (4.1 kB)
Requirement already satisfied: heapdict in /usr/local/lib/python3.7/dist-packages (from zict>=0.1.3->distributed>=2.0->dask[complete]->xarray[complete]) (1.0.1)
Requirement already satisfied: h5py in /usr/local/lib/python3.7/dist-packages (from h5netcdf->xarray[complete]) (3.1.0)
Requirement already satisfied: cached-property in /usr/local/lib/python3.7/dist-packages (from h5py->h5netcdf->xarray[complete]) (1.5.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->xarray[complete]) (1.4.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->xarray[complete]) (0.11.0)
Requirement already satisfied: numba in /usr/local/lib/python3.7/dist-packages (from numbagg->xarray[complete]) (0.51.2)
Requirement already satisfied: llvmlite<0.35,>=0.34.0.dev0 in /usr/local/lib/python3.7/dist-packages (from numba->numbagg->xarray[complete]) (0.34.0)
Requirement already satisfied: appdirs>=1.3.0 in /usr/local/lib/python3.7/dist-packages (from pooch->xarray[complete]) (1.4.4)
Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.7/dist-packages (from pooch->xarray[complete]) (2.23.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->pooch->xarray[complete]) (2021.10.8)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->pooch->xarray[complete]) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->pooch->xarray[complete]) (1.24.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->pooch->xarray[complete]) (3.0.4)
Collecting Webob
  Downloading WebOb-1.8.7-py2.py3-none-any.whl (114 kB)
[K     |████████████████████████████████| 114 kB 53.4 MB/s 
[?25hRequirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.7/dist-packages (from pydap->xarray[complete]) (4.6.3)
Requirement already satisfied: docopt in /usr/local/lib/python3.7/dist-packages (from pydap->xarray[complete]) (0.6.2)
Collecting affine
  Downloading affine-2.3.1-py2.py3-none-any.whl (16 kB)
Collecting snuggs>=1.4.1
  Downloading snuggs-1.4.7-py3-none-any.whl (5.4 kB)
Collecting click-plugins
  Downloading click_plugins-1.1.1-py2.py3-none-any.whl (7.5 kB)
Collecting cligj>=0.5
  Downloading cligj-0.7.2-py3-none-any.whl (7.1 kB)
Collecting fasteners
  Downloading fasteners-0.17.3-py3-none-any.whl (18 kB)
Collecting asciitree
  Downloading asciitree-0.3.3.tar.gz (4.0 kB)
Collecting numcodecs>=0.6.4
  Downloading numcodecs-0.9.1-cp37-cp37m-manylinux2010_x86_64.whl (6.2 MB)
[K     |████████████████████████████████| 6.2 MB 33.2 MB/s 
[?25hBuilding wheels for collected packages: eccodes, findlibs, asciitree
  Building wheel for eccodes (setup.py) ... [?25l[?25hdone
  Created wheel for eccodes: filename=eccodes-1.4.1-py3-none-any.whl size=39743 sha256=aaba91ddb221706f07001332463191c73757192c36fdc6735e9da9c475a0082b
  Stored in directory: /root/.cache/pip/wheels/7d/99/c9/7d3714ad26d171d095432dc315a82c06a45a7e23296efff981
  Building wheel for findlibs (setup.py) ... [?25l[?25hdone
  Created wheel for findlibs: filename=findlibs-0.0.2-py3-none-any.whl size=6560 sha256=b43d604673579e54a7f048ed3134312a2e14da830efb80fe95df22eed55fb8cd
  Stored in directory: /root/.cache/pip/wheels/34/e9/92/2a09d5a307252d22fb8d99b13685144b0419d98c36dba7b1c0
  Building wheel for asciitree (setup.py) ... [?25l[?25hdone
  Created wheel for asciitree: filename=asciitree-0.3.3-py3-none-any.whl size=5050 sha256=732d94fb520b7419141739c10ee640c5dbfd9648a9a0a3b9b748a535cec97d05
  Stored in directory: /root/.cache/pip/wheels/12/1c/38/0def51e15add93bff3f4bf9c248b94db0839b980b8535e72a0
Successfully built eccodes findlibs asciitree
Installing collected packages: locket, findlibs, cloudpickle, Webob, snuggs, partd, numcodecs, fsspec, fasteners, eccodes, distributed, cligj, click-plugins, asciitree, affine, zarr, rasterio, pydap, numbagg, nc-time-axis, h5netcdf, cfgrib
  Attempting uninstall: cloudpickle
    Found existing installation: cloudpickle 1.3.0
    Uninstalling cloudpickle-1.3.0:
      Successfully uninstalled cloudpickle-1.3.0
  Attempting uninstall: distributed
    Found existing installation: distributed 1.25.3
    Uninstalling distributed-1.25.3:
      Successfully uninstalled distributed-1.25.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gym 0.17.3 requires cloudpickle<1.7.0,>=1.2.0, but you have cloudpickle 2.0.0 which is incompatible.[0m
Successfully installed Webob-1.8.7 affine-2.3.1 asciitree-0.3.3 cfgrib-0.9.10.1 click-plugins-1.1.1 cligj-0.7.2 cloudpickle-2.0.0 distributed-2.30.1 eccodes-1.4.1 fasteners-0.17.3 findlibs-0.0.2 fsspec-2022.3.0 h5netcdf-1.0.0 locket-0.2.1 nc-time-axis-1.4.0 numbagg-0.2.1 numcodecs-0.9.1 partd-1.2.0 pydap-3.2.2 rasterio-1.2.10 snuggs-1.4.7 zarr-2.11.2

import glob
import matplotlib.pyplot as plt
import urllib.request
import xarray as xr

/usr/local/lib/python3.7/dist-packages/xarray/backends/cfgrib_.py:28: UserWarning: Failed to load cfgrib - most likely there is a problem accessing the ecCodes library. Try `import cfgrib` to get the full error message
  "Failed to load cfgrib - most likely there is a problem accessing the ecCodes library. "

for yr in range(2011,2015): # note that in python, the end range is not inclusive. So, in this case data for 2015 is not downloaded.
    url = f'https://downloads.psl.noaa.gov/Datasets/cpc_us_precip/RT/precip.V1.0.{yr}.nc'
    savename = url.split('/')[-1]
    urllib.request.urlretrieve(url,savename)

Let's start simple: open data for two years and concatenate them to one file:

ds2011 = xr.open_dataset('precip.V1.0.2011.nc')
ds2012 = xr.open_dataset('precip.V1.0.2012.nc')

ds2011

<xarray.Dataset>
Dimensions:  (lat: 120, lon: 300, time: 365)
Coordinates:
lat      (lat) float32 20.12 20.38 20.62 20.88 ... 49.12 49.38 49.62 49.88lon      (lon) float32 230.1 230.4 230.6 230.9 ... 304.1 304.4 304.6 304.9time     (time) datetime64[ns] 2011-01-01 2011-01-02 ... 2011-12-31
Data variables:
precip   (time, lat, lon) float32 ...
Attributes:
title:          CPC Unified Gauge-Based Analysis of Daily Precipitation o...
Conventions:    COARDS
description:    Gridded daily Precipitation
platform:       Observations
Comments:       Preciptation is accumulated from 12z of previous day to 1...
history:        originally created RT starting 04/2010 by CAS from data o...
dataset_title:  CPC Unified Gauge-Based Analysis of Daily Precipitation o...
References:     http://www.psl.noaa.gov/data/gridded/data.unified.daily.c...

ds2012

<xarray.Dataset>
Dimensions:  (lat: 120, lon: 300, time: 366)
Coordinates:
lat      (lat) float32 20.12 20.38 20.62 20.88 ... 49.12 49.38 49.62 49.88lon      (lon) float32 230.1 230.4 230.6 230.9 ... 304.1 304.4 304.6 304.9time     (time) datetime64[ns] 2012-01-01 2012-01-02 ... 2012-12-31
Data variables:
precip   (time, lat, lon) float32 ...
Attributes:
title:          CPC Unified Gauge-Based Analysis of Daily Precipitation o...
Conventions:    COARDS
description:    Gridded daily Precipitation
platform:       Observations
Comments:       Preciptation is accumulated from 12z of previous day to 1...
history:        originally created RT starting 04/2010 by CAS from data o...
dataset_title:  CPC Unified Gauge-Based Analysis of Daily Precipitation o...
References:     http://www.psl.noaa.gov/data/gridded/data.unified.daily.c...

ds2011_2012 = xr.concat([ds2011,ds2012], dim='time')

ds2011_2012

<xarray.Dataset>
Dimensions:  (lat: 120, lon: 300, time: 731)
Coordinates:lat      (lat) float32 20.12 20.38 20.62 20.88 ... 49.12 49.38 49.62 49.88
lon      (lon) float32 230.1 230.4 230.6 230.9 ... 304.1 304.4 304.6 304.9
time     (time) datetime64[ns] 2011-01-01 2011-01-02 ... 2012-12-31
Data variables:
precip   (time, lat, lon) float32 nan nan nan nan nan ... nan nan nan nan
Attributes:
title:          CPC Unified Gauge-Based Analysis of Daily Precipitation o...
Conventions:    COARDS
description:    Gridded daily Precipitation
platform:       Observations
Comments:       Preciptation is accumulated from 12z of previous day to 1...
history:        originally created RT starting 04/2010 by CAS from data o...
dataset_title:  CPC Unified Gauge-Based Analysis of Daily Precipitation o...
References:     http://www.psl.noaa.gov/data/gridded/data.unified.daily.c...

Now, let's try something similar, but through a more efficient way (especially if the number of files are more than two):

ds2011_2014 = xr.open_mfdataset('precip.V1.0.*.nc', concat_dim='time', combine='nested')
# Or, you can use the following command to do the same thing:
# ds2011_2014 = xr.open_mfdataset('precip*.nc', combine='by_coords')

ds2011_2014

<xarray.Dataset>
Dimensions:  (lat: 120, lon: 300, time: 1461)
Coordinates:
lat      (lat) float32 20.12 20.38 20.62 20.88 ... 49.12 49.38 49.62 49.88lon      (lon) float32 230.1 230.4 230.6 230.9 ... 304.1 304.4 304.6 304.9time     (time) datetime64[ns] 2011-01-01 2011-01-02 ... 2014-12-31
Data variables:
precip   (time, lat, lon) float32 dask.array<chunksize=(365, 120, 300), meta=np.ndarray>
Attributes:
title:          CPC Unified Gauge-Based Analysis of Daily Precipitation o...
Conventions:    COARDS
description:    Gridded daily Precipitation
platform:       Observations
Comments:       Preciptation is accumulated from 12z of previous day to 1...
history:        originally created RT starting 04/2010 by CAS from data o...
dataset_title:  CPC Unified Gauge-Based Analysis of Daily Precipitation o...
References:     http://www.psl.noaa.gov/data/gridded/data.unified.daily.c...

Now let's focus on 2012 and extract the monthly precipitation sum and make a simple plot of one of the months:

# The great thing about groupby is that you do not need to worry about the leap years or 
# number of days in each month.
# In addition, xarray is label-aware and when you pass the plot function, it understands that you want to
# make a spatial plot and finds the lat and lon values and the appropriate title and labels.
ds2012_mon = ds2012.groupby('time.month').sum()
ds2012_mon.precip[0,:,:].plot(cmap='jet', vmax=300)

<matplotlib.collections.QuadMesh at 0x7f32fffe3d50>

png

The above plot is quite simple and not high quality (e.g. the areas outside the US boundary had no data and are all shown in dark blue, both x & y axis limits are a bit large and can be narrowed down, the title is not exactly what we may like, etc.). We will now develop a more personalized plot for all the 12 months as follows:

import calendar # We'll use this library to easily add month name to subplot titles.

# First, We will develop a land mask data array that we can use to mask out the nan values:
landmask = ds2012.precip.sum(dim='time')>0

fig = plt.figure(figsize=[12,8], facecolor='w')
plt.subplots_adjust(bottom=0.15, top=0.96, left=0.04, right=0.99, 
                    wspace=0.2, hspace=0.27) # wspace and hspace adjust the horizontal and vertical spaces, respectively.
nrows = 3
ncols = 4
for i in range(1, 13):
    plt.subplot(nrows, ncols, i)
    dataplot = ds2012_mon.precip[i-1, :, :].where(landmask) # Remember that in Python, the data index starts at 0, but the subplot index start at 1.
    p = plt.pcolormesh(ds2012_mon.lon, ds2012_mon.lat, dataplot,
                   vmax = 400, vmin = 0, cmap = 'nipy_spectral_r',
                   ) 
    plt.xlim([233,295])
    plt.ylim([25,50])
    plt.title(calendar.month_name[dataplot.month.values], fontsize = 13, 
              fontweight = 'bold', color = 'b')
    plt.xticks(fontsize = 11)
    plt.yticks(fontsize = 11)
    if i % ncols == 1: # Add ylabel for the very left subplots
        plt.ylabel('Latitude', fontsize = 11, fontweight = 'bold')
    if i > ncols*(nrows-1): # Add xlabel for the bottom row subplots
        plt.xlabel('Longitude', fontsize = 11, fontweight = 'bold')

# Add a colorbar at the bottom:
cax = fig.add_axes([0.25, 0.06, 0.5, 0.018])
cb = plt.colorbar(cax=cax, orientation='horizontal', extend = 'max',)
cb.ax.tick_params(labelsize=11)
cb.set_label(label='Precipitation (mm)', color = 'k', size=14)

# Now we can save a high resolution (300dpi) version of the figure:
plt.savefig('Fig_prec_cpc_mon_2012.png', format = 'png', dpi = 300)

png

Now let's say we want to extract data for a specific boundary and look at the average condition within that area of interest. For simplicity, we can think of a rectangular box (but you can easily develop any landmask as above and use it to focus on only your domain of interest). For this case, let's look at a rectangular box almost similar to the state of Kansas.

top = 40
bottom = 37
left = 258
right = 265.4

ds_sel = ds2011_2014.isel(lon=(ds2011_2014.lon >= left) & (ds2011_2014.lon <= right),
                          lat=(ds2011_2014.lat >= bottom) & (ds2011_2014.lat <= top),
                          )
ds_sel_avg = ds_sel.mean(dim=['lat','lon'])

Now let's plot the cumulative daily precipitation of the selected area for each year. To make things easier, let's drop Feb 29th from any leap years in the record. Here we go:

ds_sel_avg_noleap = ds_sel_avg.sel(
    time=~((ds_sel_avg.time.dt.month == 2) & (ds_sel_avg.time.dt.day == 29)))

# Here's how the result will look like:
ds_sel_avg_noleap

<xarray.Dataset>
Dimensions:  (time: 1460)
Coordinates:
time     (time) datetime64[ns] 2011-01-01 2011-01-02 ... 2014-12-31
Data variables:
precip   (time) float32 dask.array<chunksize=(365,), meta=np.ndarray>

# Now we can easily save that output as a netcdf file using xarray:
ds_sel_avg_noleap.to_netcdf('ds_prec_Kansas_noleap_2011_2014.nc')

fig = plt.figure(figsize=[8,5], facecolor='w')
for yr in range(2011,2015):
    da_yr = ds_sel_avg_noleap.isel(time = ds_sel_avg_noleap.time.dt.year==yr).precip
    dataplot = da_yr.cumsum()
    plt.plot(dataplot, linewidth=2, label = yr)
plt.legend(fontsize=13)
plt.grid()
plt.xticks(fontsize=12) # we can also change the ticks to be on Jan-1, Feb-1, etc. but I'll skip it for here.
plt.yticks(fontsize=12)
plt.ylabel('Precipitation (mm)', fontsize = 13, fontweight = 'bold')
plt.xlabel('Day of Year', fontsize = 13, fontweight = 'bold')
plt.xlim([0,365])
plt.ylim(bottom=0)
plt.title('Annual cumulative precipitation in Kansas', fontsize=15)
plt.tight_layout()
plt.savefig('Fig_cumsum_prec_Kansas.png', format = 'png', dpi = 300)

png

We could also do a little more modification to revise the xticklabels and show the exact month and day values (instead of julian day number). See if you can figure it out yourself.

2. Intermediate (interpolate, revise coordinates, rolling average):

For this section, we will dive deeper. We will be using two different datasets with different resolutions and we will work with interpolation. The two datasets that I have considered are CPC-Globe and gridMet minimum air temperature data. CPC-Globe has a 0.5 degree (~ 50km) spatial resolution, whereas the same for gridMET is 1/24 degree (~ 4km). Here's a summary of the main functionalities that we will be practicing in this section:

Interpolation
Converting 0:360 longitude axis to -180:180
Scaling a dataset (e.g. convert degrees Kelvin to degrees Celcius)
Rolling average (with any time window)

# I import the libraries again, to keep the examples separate from each other (in case someone wants to start from here).
import urllib.request
import xarray as xr
import matplotlib.pyplot as plt
import datetime
import pandas as pd

Let's download the two datasets for 2021:

# Downloading GridMet tmin ():
url = 'https://www.northwestknowledge.net/metdata/data/tmmn_2021.nc'
savename = 'tmin_gridmet_2021.nc'
urllib.request.urlretrieve(url, savename)

# Downloading CPC-Globe tmin (0.5 deg, ~50km):
url = 'https://downloads.psl.noaa.gov/Datasets/cpc_global_temp/tmin.2021.nc'
savename = 'tmin_CPC_2021.nc'
urllib.request.urlretrieve(url, savename)

('tmin_CPC_2021.nc', <http.client.HTTPMessage at 0x7f32f8b0e550>)

# Now lets open the two datasets and explore them:
ds_gridmet = xr.open_dataset('tmin_gridmet_2021.nc')
ds_CPC = xr.open_dataset('tmin_CPC_2021.nc')

ds_gridmet

<xarray.Dataset>
Dimensions:          (crs: 1, day: 365, lat: 585, lon: 1386)
Coordinates:
lon              (lon) float64 -124.8 -124.7 -124.7 ... -67.14 -67.1 -67.06lat              (lat) float64 49.4 49.36 49.32 49.28 ... 25.15 25.11 25.07day              (day) datetime64[ns] 2021-01-01 2021-01-02 ... 2021-12-31crs              (crs) uint16 3
Data variables:
air_temperature  (day, lat, lon) float32 ...
Attributes: (12/19)
geospatial_bounds_crs:      EPSG:4326
Conventions:                CF-1.6
geospatial_bounds:          POLYGON((-124.7666666333333 49.40000000000000...
geospatial_lat_min:         25.066666666666666
geospatial_lat_max:         49.40000000000000
geospatial_lon_min:         -124.7666666333333
...                         ...
date:                       03 March 2022
note1:                      The projection information for this file is: ...
note2:                      Citation: Abatzoglou, J.T., 2013, Development...
note3:                      Data in slices after last_permanent_slice (1-...
note4:                      Data in slices after last_provisional_slice (...
note5:                      Days correspond approximately to calendar day...

ds_CPC

<xarray.Dataset>
Dimensions:  (lat: 360, lon: 720, time: 365)
Coordinates:
lat      (lat) float32 89.75 89.25 88.75 88.25 ... -88.75 -89.25 -89.75lon      (lon) float32 0.25 0.75 1.25 1.75 2.25 ... 358.2 358.8 359.2 359.8time     (time) datetime64[ns] 2021-01-01 2021-01-02 ... 2021-12-31
Data variables:
tmin     (time, lat, lon) float32 ...
Attributes:
Conventions:    CF-1.0
Source:         ftp://ftp.cpc.ncep.noaa.gov/precip/wd52ws/global_temp/
References:     https://www.psl.noaa.gov/data/gridded/data.cpc.globaltemp...
version:        V1.0
title:          CPC GLOBAL TEMP V1.0
dataset_title:  CPC GLOBAL TEMP
history:        Updated 2022-01-01 16:55:57

# Let's plot the original data for Jan 1st:
fig = plt.figure(figsize = [13,4.5])
plt.subplot(1,2,1)
ds_gridmet.air_temperature[0,:,:].plot(cmap = 'jet')
plt.title('GridMet')
plt.subplot(1,2,2)
ds_CPC.tmin[0,:,:].plot(cmap = 'jet')
plt.title('CPC-Globe')
plt.tight_layout()

png

Looking at the two datasets, we see that there are a few differences that should be addressed:

In the gridmet dataset, the "crs" coordinate can be dropped, and the "day" coordinate can be renamed to "time" to be consistent with the CPC dataset (similarly for "air_temperature").
The gridmet data is in Kelvin, but CPC is in Celcius. Let's convert gridmet data to Celcius.
In addition, the lon coordinate in one dataset is 0:360 and -180:180 in the other one. Let's change that to 0:360 for the gridmet data.
Lastly, let's interpolate the coarser resolution data (CPC-Globe) to the gridmet resolution.

ds_gridmet_revised = ds_gridmet.drop('crs').rename({'day':'time', 'air_temperature':'tmin'})
ds_gridmet_revised = ds_gridmet_revised-273.15 # Convert Kelvin to Celcius
lon_revised = ds_gridmet.lon + (ds_gridmet.lon < 0)*360
ds_gridmet_revised = ds_gridmet_revised.assign_coords(lon = lon_revised)

ds_gridmet_revised

<xarray.Dataset>
Dimensions:  (lat: 585, lon: 1386, time: 365)
Coordinates:
lon      (lon) float64 235.2 235.3 235.3 235.4 ... 292.8 292.9 292.9 292.9
lat      (lat) float64 49.4 49.36 49.32 49.28 ... 25.19 25.15 25.11 25.07
time     (time) datetime64[ns] 2021-01-01 2021-01-02 ... 2021-12-31
Data variables:
tmin     (time, lat, lon) float32 nan nan nan nan nan ... nan nan nan nan
Attributes: (12/19)
geospatial_bounds_crs:      EPSG:4326
Conventions:                CF-1.6
geospatial_bounds:          POLYGON((-124.7666666333333 49.40000000000000...
geospatial_lat_min:         25.066666666666666
geospatial_lat_max:         49.40000000000000
geospatial_lon_min:         -124.7666666333333
...                         ...
date:                       03 March 2022
note1:                      The projection information for this file is: ...
note2:                      Citation: Abatzoglou, J.T., 2013, Development...
note3:                      Data in slices after last_permanent_slice (1-...
note4:                      Data in slices after last_provisional_slice (...
note5:                      Days correspond approximately to calendar day...

ds_CPC_interp = ds_CPC.interp(lat = ds_gridmet_revised.lat, lon = ds_gridmet_revised.lon)

Great! We did all that with just a few lines of code. Please note that in interpolation, the data boundary (i.e. latlon bounds) are also matched to the output boundary.

Now let's take a look at Feb 16, 2021 for the regions in Texas, where a severe cold storm happened. I did a quick analysis for that event last year and you can take a look at that in this link: https://www.linkedin.com/pulse/how-unusual-2021-texas-cold-span-ali-ahmadalipour/

To make the temporal analyses easier, we first make sure that the "time" coordinate is in proper datetime format.

ds_gridmet_revised = ds_gridmet_revised.assign_coords(
    time = pd.to_datetime(ds_gridmet_revised.time))
ds_CPC_interp = ds_CPC_interp.assign_coords(
    time = pd.to_datetime(ds_CPC_interp.time))

target_date = datetime.date(2021,2,16)
target_date = pd.to_datetime(target_date)

fig = plt.figure(figsize = [13,4.5], facecolor='w')
plt.subplot(1,2,1)
ds_gridmet_revised.sel(time=target_date).tmin.plot(cmap = 'nipy_spectral', vmin = -30, vmax = 20)
plt.title(f'GridMet tmin on {target_date.strftime("%Y-%m-%d")}')

plt.subplot(1,2,2)
ds_CPC_interp.sel(time=target_date).tmin.plot(cmap = 'nipy_spectral', vmin = -30, vmax = 20)
plt.title(f'CPC-Globe tmin on {target_date.strftime("%Y-%m-%d")}')
plt.tight_layout()

png

It can be seen that the interpolated data (CPC plot shown on right) does not necessarily have the details (specially the orographic and elevation effects) of the finer resolution data.

Now let's find the spatial mean of both datasets around Austin, TX. Again, to make it simpler, I defined an estimate rectangular boundary that we will use.

# Rough boundaries for Austin, TX:
left = 360 - 97.9
right = 360 - 97.6
top = 30.5
bottom = 30.2

ds_Austin_gridmet = ds_gridmet_revised.isel(
    lon=(ds_gridmet_revised.lon >= left) & (ds_gridmet_revised.lon <= right),
    lat=(ds_gridmet_revised.lat >= bottom) & (ds_gridmet_revised.lat <= top),
).mean(dim=['lat','lon'])
ds_Austin_CPC = ds_CPC_interp.isel(
    lon=(ds_CPC_interp.lon >= left) & (ds_CPC_interp.lon <= right),
    lat=(ds_CPC_interp.lat >= bottom) & (ds_CPC_interp.lat <= top),
).mean(dim=['lat','lon'])

Now let's plot the two timeseries, but this time in degrees fahrenheit:

plt.figure(figsize = [12,6])
(ds_Austin_gridmet.tmin*1.8 + 32).plot(label = 'GridMet', color = 'r')
(ds_Austin_CPC.tmin*1.8 + 32).plot(label = 'CPC', color = 'b')
plt.grid(axis='y')
plt.xticks(ticks = [datetime.date(2021,x,1) for x in range(1,13)], fontsize=12)
plt.xlim([datetime.date(2021,1,1), datetime.date(2021,12,31)])
plt.yticks(fontsize=13)
plt.ylabel('Minimum air temperature (Tmin, °F)', fontsize = 12, 
           fontweight = 'bold')
plt.xlabel('')
plt.legend(fontsize=13, loc = 'upper left')
plt.title('Austin, TX', fontsize=13, fontweight = 'bold')
plt.tight_layout()

png

As you can see, the two datasets are considerably different (more than 10°F) in late February and late October. Notably, daily temperature has a lot of noise (short-term variation). To have a smoother plot, let's calculate the moving average of tmin with a 15-day window:

plt.figure(figsize = [12,6])
(ds_Austin_gridmet.tmin*1.8 + 32).rolling(time=15,center=True).mean().plot(label = 'GridMet', color = 'r')
(ds_Austin_CPC.tmin*1.8 + 32).rolling(time=15,center=True).mean().plot(label = 'CPC', color = 'b')
plt.grid()
plt.xticks(ticks = [datetime.date(2021,x,1) for x in range(1,13)], fontsize=12)
plt.xlim([datetime.date(2021,1,1), datetime.date(2021,12,31)])
plt.yticks(fontsize=13)
plt.ylabel('Minimum air temperature (Tmin, °F)', fontsize = 12, 
           fontweight = 'bold')
plt.xlabel('')
plt.legend(fontsize=13, loc = 'upper left')
plt.title('Austin, TX', fontsize=13, fontweight = 'bold')
plt.tight_layout()

png

Awesome! We are done with the intermediate example. You should now be able to replicate similar analyses for various datasets. There are a lot of other things that can be adjusted to make the plots more interesting. You can always search for anything you'd like to do and you will most likely find a decent answer for it on stackoverflow.

3. Advanced (skip data download, anomaly calculation, timelapse animation):

3.1. Seasonal Forecast:

In this part, we will analyze seasonal forecast data on the go (without downloading and saving the data on the disk). Then, we will look at calculating monthly anomaly (departure of each month from its historical mean state).

The data that we will be focusing on is going to be the NMME seasonal climate prediction, which is a global dataset of 1 degree (~ 100km) spatial resolution and monthly temporal resolution with multiple months ahead forecast lead time. To make the analysis simpler, we will only focus on just one model (instead of the entire ensemble of available NMME models). Let's go!

import xarray as xr
import matplotlib.pyplot as plt
import datetime
import pandas as pd

url = 'http://iridl.ldeo.columbia.edu/SOURCES/.Models/.NMME/.GFDL-SPEAR/.FORECAST/.MONTHLY/.tref/dods'

ds = xr.open_dataset(url,engine='netcdf4',decode_times=False)

ds

<xarray.Dataset>
Dimensions:  (L: 12, M: 30, S: 17, X: 360, Y: 181, Z: 1)
Coordinates:
S        (S) float32 731.0 732.0 733.0 734.0 ... 744.0 745.0 746.0 747.0M        (M) float32 1.0 2.0 3.0 4.0 5.0 6.0 ... 26.0 27.0 28.0 29.0 30.0X        (X) float32 0.0 1.0 2.0 3.0 4.0 ... 355.0 356.0 357.0 358.0 359.0L        (L) float32 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5Y        (Y) float32 -90.0 -89.0 -88.0 -87.0 -86.0 ... 87.0 88.0 89.0 90.0Z        (Z) float32 2.0
Data variables:
tref     (S, L, M, Z, Y, X) float32 ...
Attributes:
Conventions:  IRIDL

The NMME coordinates are not really self-explanatory. So, here's an overview of what each coordinate stands for:

S: Time (number of months since 1960-Jan-1)
M: Ensemble member
X: Longitude (in 0 0:360 format)
L: Leadtime (in months; 0.5 indicating the current month, 1.5 being one month ahead, and so on)
Y: Latitude
*Z: this variable is only found in the GFDL model for temperature forecast, and it indicates that the data is at 2m height from ground.

To make the data more descriptive and more convenient for analysis, we need to modify it first:

ds = ds.rename({'S':'time', 'X':'lon', 'Y':'lat'})

start_date_NMME = pd.to_datetime(datetime.date(1960,1,1))
time_new = [start_date_NMME + pd.DateOffset(months = x) for x in ds.time.values]
ds = ds.assign_coords(time = time_new)
if 'Z' in ds.dims:
    ds = ds.squeeze(dim = 'Z').drop('Z')

ds

<xarray.Dataset>
Dimensions:  (L: 12, M: 30, lat: 181, lon: 360, time: 17)
Coordinates:


time     (time) datetime64[ns] 2020-12-01 2021-01-01 ... 2022-04-01M        (M) float32 1.0 2.0 3.0 4.0 5.0 6.0 ... 26.0 27.0 28.0 29.0 30.0lon      (lon) float32 0.0 1.0 2.0 3.0 4.0 ... 355.0 356.0 357.0 358.0 359.0L        (L) float32 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5lat      (lat) float32 -90.0 -89.0 -88.0 -87.0 -86.0 ... 87.0 88.0 89.0 90.0
Data variables:
tref     (time, L, M, lat, lon) float32 ...
Attributes:
Conventions:  IRIDL

The above dataset has all the available forecast data for all leadtimes. We can now select our area of interest and limit the leadtime to our use case. For this example, let's take a look at the temperature forecast for Feb 2021 that was generated at the beginning of the same month (i.e. lead 0.5):

target_date = pd.to_datetime(datetime.date(2021,2,1))
ds_sel = ds.sel(time=target_date).isel(L=0) 
# Note the difference use of "sel" and "isel". For the former, you should indicate 
# the exact value to be selected, but for the latter, the index should be specified.

So far, the data is not loaded yet (although we can see all the metadata). To make the analysis easier, we will first load the data and then continue with the rest of analyses. For loading data, I am simply using .load(), but a better way of doing so is to use Dask and do the work in parallel mode. I won't go into that (partly because I tried it in Google Colab, and I was getting several errors here, and I didn't want to spend too much time on debugging).

ds_sel.load() # this can take a couple of minutes or so

<xarray.Dataset>
Dimensions:  (M: 30, lat: 181, lon: 360)
Coordinates:
time     datetime64[ns] 2021-02-01


M        (M) float32 1.0 2.0 3.0 4.0 5.0 6.0 ... 26.0 27.0 28.0 29.0 30.0
lon      (lon) float32 0.0 1.0 2.0 3.0 4.0 ... 355.0 356.0 357.0 358.0 359.0
L        float32 0.5
lat      (lat) float32 -90.0 -89.0 -88.0 -87.0 -86.0 ... 87.0 88.0 89.0 90.0
Data variables:
tref     (M, lat, lon) float32 nan nan nan nan nan ... nan nan nan nan nan
Attributes:
Conventions:  IRIDL

Now let's plot the data and explore it a bit:

ds_sel.tref[0,:,:].plot()

<matplotlib.collections.QuadMesh at 0x7f32fddf8950>

png

Now let's calculate the standard deviation of forecasts among the ensemble members, and then plot them for the entire globe as well as only North America:

ds_std = ds_sel.std(dim='M')

plt.figure(figsize=[12,4.5])
plt.subplot(1,2,1)
ds_std.tref.plot(cmap = 'Spectral_r', vmin = 0, vmax = 4)
plt.subplot(1,2,2)
ds_std.tref.plot(cmap = 'Spectral_r', vmin = 0, vmax = 4)
plt.xlim([230,300])
plt.ylim([20,55])
plt.tight_layout()

png

Looking at the above plots, we can see that the uncertainty of temperature forecast in February 2021 is much higher across the northern latitudes (i.e. in winter) than the southern latitudes (i.e. in summer).

Now let's check if the forecasts indicated any significant cold anomaly for Feb 2021 across the Midwest US and Southern Plains. To do so, we need to load the climatology (historical mean) of GFDL forecasts and then find the anomaly by subtracting it from forecasts. In other words:

Anomaly = Forecast - Climatology

ds_clim = xr.open_dataset('http://iridl.ldeo.columbia.edu/SOURCES/.Models/.NMME/.GFDL-SPEAR/.HINDCAST/.mc9120/.tref/dods',
                          decode_times=False)

# Again, to make things easier, we first load the climatology into RAM:
ds_clim.load()

<xarray.Dataset>
Dimensions:  (L: 12, S: 12, X: 360, Y: 181)
Coordinates:
Y        (Y) float32 -90.0 -89.0 -88.0 -87.0 -86.0 ... 87.0 88.0 89.0 90.0
S        (S) float32 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0
X        (X) float32 0.0 1.0 2.0 3.0 4.0 ... 355.0 356.0 357.0 358.0 359.0
L        (L) float32 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5
Data variables:
tref     (S, L, Y, X) float64 nan nan nan nan nan ... nan nan nan nan nan
Attributes:
Conventions:  IRIDL

In the above dataset, S indicates the month index (0 corresponding to January and 11 indicating December). We should adjust the coordinates of this climatology dataset and match it to the "ds_sel" that we prepared above:

ds_clim = ds_clim.rename({'X':'lon', 'Y':'lat'})

ds_clim_sel = ds_clim.isel(L = 0).sel(S = 1).drop(['S','L']) # S=1 for selecting February

ds_anom = ds_sel - ds_clim_sel

ds_anom.tref.plot(col = 'M', col_wrap = 10, vmin = -8, vmax = 8, cmap = 'bwr')
plt.xlim([230,300])
plt.ylim([20,55])

(20.0, 55.0)

png

As it can be seen, most of the ensemble members of the GFDL-SPEAR model correctly forecasted negative anomaly (i.e. colder than usual condition) for the majority of Midwest US and Texas. Notably, this is the Lead-0 forecast, which was initiated at the beginning of Feb for the month of Feb (thus, basically a couple of weeks ahead). If we look back at lead-1 or beyond, such a strong pattern and consensus are not found.

# And here's the ensemble mean plot:
ds_anom.mean(dim='M').tref.plot(cmap='bwr', vmax=5, vmin=-5)
plt.xlim([230,300])
plt.ylim([20,55])

(20.0, 55.0)

png

I was hoping to use two more libraries in this section:

dask --> for parallel computing
cartopy --> for customized mapping with geographic projection

However, I ran into several errors trying to use them on Google Colab, so I decided to exclude them from the advanced section. But if you really want to reach an advanced proficiency level, I highly encourage you to try dask and cartopy.

3.2. Climate change assessment using CMIP6 data:

I thought it would be really useful to have a quick exercise for analyzing climate change data. In this part, we will explore CMIP6 air temperature projections from one model. We will look at monthly data for the historical period of 1970-2010 and future projections of SSP585 during 2010-2100.

!pip install gcsfs # this will take a few seconds. We need it to extract CMIP6 data from Google Cloud Storage.

# We will be opening zarr data format, which is a relatively new data structure 
# that is practical for geospatial datasets. The pre-installed xarray on google
# colab does not allow this. So, we need to intall the complete version of xarray.
!pip install xarray[complete] # (adding this again in case someone wants to start from this part)

Collecting gcsfs
  Downloading gcsfs-2022.3.0-py2.py3-none-any.whl (25 kB)
Collecting aiohttp<4
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 19.0 MB/s 
[?25hRequirement already satisfied: google-auth-oauthlib in /usr/local/lib/python3.7/dist-packages (from gcsfs) (0.4.6)
Requirement already satisfied: decorator>4.1.2 in /usr/local/lib/python3.7/dist-packages (from gcsfs) (4.4.2)
Requirement already satisfied: google-cloud-storage in /usr/local/lib/python3.7/dist-packages (from gcsfs) (1.18.1)

...

from matplotlib import pyplot as plt
import pandas as pd
import xarray as xr
import gcsfs
import datetime
import os

df = pd.read_csv('https://storage.googleapis.com/cmip6/cmip6-zarr-consolidated-stores.csv')

df.head()

	activity_id	institution_id	source_id	experiment_id	member_id	table_id	variable_id	grid_label	zstore	dcpp_init_year	version
0	HighResMIP	CMCC	CMCC-CM2-HR4	highresSST-present	r1i1p1f1	Amon	ps	gn	gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/...	NaN	20170706
1	HighResMIP	CMCC	CMCC-CM2-HR4	highresSST-present	r1i1p1f1	Amon	rsds	gn	gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/...	NaN	20170706
2	HighResMIP	CMCC	CMCC-CM2-HR4	highresSST-present	r1i1p1f1	Amon	rlus	gn	gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/...	NaN	20170706
3	HighResMIP	CMCC	CMCC-CM2-HR4	highresSST-present	r1i1p1f1	Amon	rlds	gn	gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/...	NaN	20170706
4	HighResMIP	CMCC	CMCC-CM2-HR4	highresSST-present	r1i1p1f1	Amon	psl	gn	gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/...	NaN	20170706

  <script>
    const buttonEl =
      document.querySelector('#df-fa509b08-481e-4753-98fb-2bd6e28d86da button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-fa509b08-481e-4753-98fb-2bd6e28d86da');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

len(df)

The df table includes information for all climate change model runs that are available on Google Cloud Storage. As of March 2022, there are over half a million different model outputs (e.g. different models, variables, timescales, scenarios, ensemble member, etc.). Let's narrow down the outputs to only those of monthly air temperature in historical and ssp585 scenario:

df_ssp585 = df.query("activity_id=='ScenarioMIP' & table_id == 'Amon' & " +\
    "variable_id == 'tas' & experiment_id == 'ssp585' & member_id == 'r1i1p1f1'")
print('Length of df_ssp585:', len(df_ssp585))
df_ssp585.head(3)

Length of df_ssp585: 35

	activity_id	institution_id	source_id	experiment_id	member_id	table_id	variable_id	grid_label	zstore	dcpp_init_year	version
866	ScenarioMIP	NOAA-GFDL	GFDL-CM4	ssp585	r1i1p1f1	Amon	tas	gr1	gs://cmip6/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-CM...	NaN	20180701
19008	ScenarioMIP	NOAA-GFDL	GFDL-ESM4	ssp585	r1i1p1f1	Amon	tas	gr1	gs://cmip6/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-ES...	NaN	20180701
66586	ScenarioMIP	BCC	BCC-CSM2-MR	ssp585	r1i1p1f1	Amon	tas	gn	gs://cmip6/CMIP6/ScenarioMIP/BCC/BCC-CSM2-MR/s...	NaN	20190314

  <script>
    const buttonEl =
      document.querySelector('#df-c25dc5e3-8d7f-47f4-9cd2-512f3734ceb2 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-c25dc5e3-8d7f-47f4-9cd2-512f3734ceb2');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

df_historical = df.query("activity_id == 'CMIP' & table_id == 'Amon' & " +\
    "variable_id == 'tas' & experiment_id == 'historical' & member_id == 'r1i1p1f1'")
print('Length of df_historical:', len(df_historical))
df_historical.head(3)

Length of df_historical: 55

	activity_id	institution_id	source_id	experiment_id	member_id	table_id	variable_id	grid_label	zstore	dcpp_init_year	version
8074	CMIP	NOAA-GFDL	GFDL-CM4	historical	r1i1p1f1	Amon	tas	gr1	gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/histo...	NaN	20180701
29231	CMIP	IPSL	IPSL-CM6A-LR	historical	r1i1p1f1	Amon	tas	gr	gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor...	NaN	20180803
32351	CMIP	NASA-GISS	GISS-E2-1-G	historical	r1i1p1f1	Amon	tas	gn	gs://cmip6/CMIP6/CMIP/NASA-GISS/GISS-E2-1-G/hi...	NaN	20180827

  <script>
    const buttonEl =
      document.querySelector('#df-08643300-e196-40e1-a8a9-2b9cb2fe6ef1 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-08643300-e196-40e1-a8a9-2b9cb2fe6ef1');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

Ok, much better. In summary, our CMIP6 search is narrowed down to 55 historical models and 35 ssp585 future monthly temperature "forecasts". In a decent climate change study, we should first evaluate the models during the historical period and then select a sub-set of models that outperform at the region of interest. However, since this is just an exercise, I will only choose one model that has data in both historical and future periods.

model = 'GFDL-CM4'
zstore_hist = df_historical.query(f"source_id == '{model}'").zstore.values[0]
zstore_ssp585 = df_ssp585.query(f"source_id == '{model}'").zstore.values[0]
gcs = gcsfs.GCSFileSystem(token='anon')

mapper = gcs.get_mapper(zstore_hist)
ds_hist = xr.open_zarr(mapper, consolidated = True)
mapper = gcs.get_mapper(zstore_ssp585)
ds_ssp585 = xr.open_zarr(mapper, consolidated = True)

ds_hist

<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 180, lon: 288, time: 1980)
Coordinates:
bnds       (bnds) float64 1.0 2.0
height     float64 ...lat        (lat) float64 -89.5 -88.5 -87.5 -86.5 ... 86.5 87.5 88.5 89.5
lat_bnds   (lat, bnds) float64 dask.array<chunksize=(180, 2), meta=np.ndarray>lon        (lon) float64 0.625 1.875 3.125 4.375 ... 355.6 356.9 358.1 359.4
lon_bnds   (lon, bnds) float64 dask.array<chunksize=(288, 2), meta=np.ndarray>time       (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
time_bnds  (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
Data variables:
tas        (time, lat, lon) float32 dask.array<chunksize=(600, 180, 288), meta=np.ndarray>
Attributes: (12/49)
Conventions:            CF-1.7 CMIP-6.0 UGRID-1.0
activity_id:            CMIP
branch_method:          standard
branch_time_in_child:   0.0
branch_time_in_parent:  36500.0
comment:                <null ref>
...                     ...
variable_id:            tas
variant_info:           N/A
variant_label:          r1i1p1f1
status:                 2019-08-07;created;by nhn2@columbia.edu
netcdf_tracking_ids:    hdl:21.14100/e4193a02-6405-49b6-8ad3-65def741a4dd...
version_id:             v20180701

Looking at the metadata of the two datasets, the data periods can be beyond our interest (historical: 1850-2014, ssp585: 2015-2100). Therefore, we can select our period of interest and then load the subset of data for more convenience in further analyses. But before that, the time coordinate is in "object" format, which we need to convert to datetime to be able to easily analyze the timeseries.

print('hist date range  :', ds_hist.time[0].values, ' , ', ds_hist.time[-1].values)
print('ssp585 date range:', ds_ssp585.time[0].values, ' , ', ds_ssp585.time[-1].values)

hist date range  : 1850-01-16 12:00:00  ,  2014-12-16 12:00:00
ssp585 date range: 2015-01-16 12:00:00  ,  2100-12-16 12:00:00

start_time = pd.to_datetime(datetime.date(1850,1,15)) # I chose 15 for all dates to make it easier.
time_new_hist = [start_time + pd.DateOffset(months = x) for x in range(len(ds_hist.time))]

start_time = pd.to_datetime(datetime.date(2015,1,15))
time_new_ssp585 = [start_time + pd.DateOffset(months = x) for x in range(len(ds_ssp585.time))]

ds_hist = ds_hist.assign_coords(time = time_new_hist)
ds_ssp585 = ds_ssp585.assign_coords(time = time_new_ssp585)

start_date = pd.to_datetime(datetime.date(1980,1,1))
end_date = pd.to_datetime(datetime.date(2010,12,31))
ds_hist_sel = ds_hist.isel(time=(ds_hist.time >= start_date) & (ds_hist.time <= end_date))

start_date = pd.to_datetime(datetime.date(2070,1,1))
end_date = pd.to_datetime(datetime.date(2099,12,31))
ds_ssp585_sel = ds_ssp585.isel(time=(ds_ssp585.time >= start_date) & (ds_ssp585.time <= end_date))

ds_hist_sel.load()
ds_ssp585_sel.load()

<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 180, lon: 288, time: 360)
Coordinates:


bnds       (bnds) float64 1.0 2.0
height     float64 2.0
lat        (lat) float64 -89.5 -88.5 -87.5 -86.5 ... 86.5 87.5 88.5 89.5
lat_bnds   (lat, bnds) float64 -90.0 -89.0 -89.0 -88.0 ... 89.0 89.0 90.0
lon        (lon) float64 0.625 1.875 3.125 4.375 ... 355.6 356.9 358.1 359.4
lon_bnds   (lon, bnds) float64 0.0 1.25 1.25 2.5 ... 357.5 358.8 358.8 360.0
time       (time) datetime64[ns] 2070-01-15 2070-02-15 ... 2099-12-15
time_bnds  (time, bnds) object 2070-01-01 00:00:00 ... 2100-01-01 00:00:00
Data variables:
tas        (time, lat, lon) float32 242.4 242.4 242.4 ... 269.6 269.6 269.6
Attributes: (12/49)
Conventions:            CF-1.7 CMIP-6.0 UGRID-1.0
activity_id:            ScenarioMIP
branch_method:          standard
branch_time_in_child:   0.0
branch_time_in_parent:  60225.0
comment:                <null ref>
...                     ...
variable_id:            tas
variant_info:           N/A
variant_label:          r1i1p1f1
status:                 2019-08-13;created;by nhn2@columbia.edu
netcdf_tracking_ids:    hdl:21.14100/991bc4a4-20d5-4a58-a451-6b3ea33116be
version_id:             v20180701

Great! Now let's take a look at the average monthly temperature change over the globe in distant future:

tas_avg_hist = ds_hist_sel.groupby('time.month').mean()
tas_avg_ssp585 = ds_ssp585_sel.groupby('time.month').mean()

tas_30yr_diff = tas_avg_ssp585 - tas_avg_hist

tas_30yr_diff.tas.plot(col = 'month', col_wrap = 6, vmax = 10, vmin = 0, cmap = 'hot_r')

<xarray.plot.facetgrid.FacetGrid at 0x7f32e9cd8b10>

png

And here's a plot for the annual mean change:

tas_30yr_diff.mean('month').tas.plot(figsize=[8,5], cmap = 'hot_r', 
                                     vmin = 0, vmax = 10)
plt.title('Mean temperature change: 2070-2100 vs. 1980-2010')
plt.tight_layout()

png

The above plots show that in an extreme scenario (ssp585), the annual mean temperature in northern latitudes can be above 10°C warmer than the historical period, and the same for mid-latitudes can be ~5-7°C warmer. Please note that this is the result of one GCM initialization at coarse resolution. In practice, these outputs should be first bias-corrected and downscaled before further analysis.

Now, let's try an example and work with timeseries data (instead of 30-year mean). Let's look at the annual temperature change of each year.

tas_avg_hist_yr = tas_avg_hist.mean('month')

tas_change_yr = ds_ssp585_sel.groupby('time.year').mean('time')

tas_change_yr = tas_change_yr - tas_avg_hist_yr

Creating a timelapse:

Now, let's make a timelapse video for annual temperature change. To have a smooth video, we'll plot the 5-year rolling average.

tas_change_yr_rolling5 = tas_change_yr.rolling(year=5,center=True).mean().dropna('year').tas
# Make a directory to save all the figures there:
if not os.path.exists('./Figures_ssp585/'):
    os.makedirs('./Figures_ssp585/')

for i in range(len(tas_change_yr_rolling5)):
    dataplot = tas_change_yr_rolling5[i,:,:]
    # Convert 0:360 to -180:180 :
    dataplot = dataplot.assign_coords(lon = dataplot.lon - (dataplot.lon > 180)*360)
    dataplot = dataplot.sortby('lon', ascending=True)

    fig = plt.figure(figsize=[9,5], facecolor='w')
    # Adjust plot area (I find these by try and error until I get what I want)
    plt.subplots_adjust(left=0.075, right=0.895, bottom=0.1, top=0.93)
    plt.pcolormesh(dataplot.lon, dataplot.lat, dataplot, cmap='plasma', vmin=0, vmax=12)
    plt.title(f'Near-surface air temperature change: {model} ssp585, {dataplot.year.values} vs. 1980-2010',
              fontsize = 14)
    plt.ylabel('Latitude', fontsize = 12)
    plt.xlabel('Longitude', fontsize = 12)
    # Add colorbar:
    cax = fig.add_axes([0.91, 0.12, 0.02, 0.8])
    cb = plt.colorbar(cax=cax, orientation='vertical', extend = 'max')
    cb.ax.tick_params(labelsize=11)
    cb.set_label(label='Temperature Change (°C)', color = 'k', size=13)
    # Save and close figure:
    plt.savefig(f'./Figures_ssp585/Fig_tasChange_{dataplot.year.values}.png', 
                format = 'png', dpi=200)
    plt.close()

Great! Now we have saved all the figures. We can use different libraries to generate a timelapse (/animation) from the figures. I will use openCV here, which is an amazing library for image processing and you can also leverage it for different applications in climate data analysis (e.g. spatial smoothing, applying filters or weighted kernels to emphasize on edges for feature detection/extraction, or use it for data augmentation in computer vision applications). OpenCV requires a whole tutorial on its own and I'm not going to go in details here. But again, if you intend to be an advanced user, I highly recommend working with OpenCV.

import cv2
import glob

files = glob.glob(f'./Figures_ssp585/Fig_tasChange*.png')
files.sort()

img_array = []
for filename in files:
    img = cv2.imread(filename)
    height, width, layers = img.shape
    size = (width,height)
    img_array.append(img)
fps = 4
out = cv2.VideoWriter(f'Vid_tasChange_ssp585.mp4',cv2.VideoWriter_fourcc(*'MP4V'), 4, size)
for i in range(len(img_array)):
    out.write(img_array[i])
out.release()

There we go! We were able to load CMIP6 data directly from Google Cloud Storage, analyze the data, generate figures, and then make a timelapse animation. Remember that after you close the Colab (or if your session is terminated for any reason), you will lose all the results. So, if you want to keep any of this result, you can either download it to your local computer or save/copy the output to your own Google Drive.

Generate simple timeseries plot:

We can define a region of interest and explore the spatial mean of the temperature change. For this example, we'll focus on the Northwest US (e.g. the Cascades):

left = 236
right = 240
bottom = 42
top = 49

tas_NW_yr_hist = ds_hist_sel.isel(lat = (ds_hist_sel.lat>=bottom) & (ds_hist_sel.lat<=top),
                   lon = (ds_hist_sel.lon>=left) & (ds_hist_sel.lon<=right),
                   ).mean(['lat','lon']).drop(['bnds', 'height', 'time_bnds'])
tas_NW_yr_ssp585 = ds_ssp585_sel.isel(lat = (ds_ssp585_sel.lat>=bottom) & (ds_ssp585_sel.lat<=top),
                   lon = (ds_ssp585_sel.lon>=left) & (ds_ssp585_sel.lon<=right),
                   ).mean(['lat','lon']).drop(['bnds', 'height', 'time_bnds'])

plt.figure(figsize=[8,5],)
(tas_NW_yr_hist.groupby('time.year').mean().tas-273.15).plot(
    label='historical', color='b', linewidth=2)
(tas_NW_yr_ssp585.groupby('time.year').mean().tas-273.15).plot(
    label='ssp585', color='r', linewidth=2)
plt.grid()
plt.xlim([1980,2100])
plt.legend(fontsize=13)
plt.xticks(fontsize=12)
plt.yticks(fontsize=13)
plt.ylabel('Average Annual Temperature (°C)', fontsize=13, fontweight='bold')
plt.xlabel('Year', fontsize=13, fontweight='bold')
plt.title(f'Average Annual air temperature in the Cascades: {model} ssp585',
          fontsize=15)

plt.tight_layout()

That concludes the advanced section of this tutorial. I hope you found it useful. Suggestions and feedbacks are appreciated.

4. Advanced+ (Machine Learning):

I shared a case study more than a year ago where I used climate data to predict wildfire frequency in California. It is a relatively simple study and should be a good exercise for developing a machine learning prediction model. I have shared all the codes and explained the process in this link: https://www.linkedin.com/pulse/ai-climate-data-predicting-fire-frequency-california-ali-ahmadalipour/

声明：欢迎转载、转发本号原创内容，可留言区留言或者后台联系小编进行授权。气象学家公众号转载信息旨在传播交流，其内容由作者负责，不代表本号观点。文中部分图片来源于网络，如涉及作品内容、版权和其他问题，请后台联系小编处理。

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2022-04-06，如有侵权请联系 cloudcommunity@tencent.com 删除

tcp/ip

本文分享自气象学家微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

tcp/ip