前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >气象AI|面向AI研究的天气和气候公开数据集

气象AI|面向AI研究的天气和气候公开数据集

作者头像
bugsuse
发布2021-01-28 15:38:46
3.9K0
发布2021-01-28 15:38:46
举报
文章被收录于专栏:气象杂货铺

近些年很多研究者发布了相应的天气和气候数据集以用于进行AI气象领域研究。PANGEO[1]对近些年的公开数据集进行了收集整理。

地球科学大数据社区平台

数据集收集网站中罗列了当前大部分公开的天气和气候数据集。这些数据集被分割为用于AI相关研究的数据集和常用的原始数据集,还有专门用于研究混合ML-物理模型的数据集。

对于大多数研究者来说,只需要使用预处理数据集进行相关研究。预处理数据集中包括高质量的天气雷达和卫星数据。文末有公开数据集列表的链接

如果想提交新的数据集或者有什么问题,可以前往GitHub源[2]。

预处理数据集

AI for Earth System Science Summer School Hackathon

  • Code and Data: https://github.com/NCAR/ai4ess-hackathon-2020
  • Source: NCAR, Lawrence Berkeley Lab, and NOAA
  • Description: 5 challenge problems related to prediction and emulation. GOES challenge problem focuses on predicting lightning from GOES-16 satellite imagery. GECKO-A challenge problem focuses on emulating the GECKO-A chemistry model from a large set of model time series. Microphysics challenge problem focuses on emulating the TAU bin microphysics scheme. HOLODEC challenge problem focuses on estimating rain drop distribution properties from synthetic holographic diffraction patterns. ENSO challenge problem focuses on predicting ENSO from gridded model output.

AMS Solar Energy Prediction Contest

  • Code and data: https://www.kaggle.com/c/ams-2014-solar-energy-prediction-contest
  • Source data: GEFS forecasts and Mesonet solar observations
  • Description: Predict total daily solar irradiance from GEFS and Oklahoma Mesonet Data

CAMELS: CATCHMENT ATTRIBUTES AND METEOROLOGY FOR LARGE-SAMPLE STUDIES

  • Data: https://ral.ucar.edu/solutions/products/camels
  • Paper: https://ncar.github.io/hydrology/datasets/CAMELS_timeseries
  • Source data: Weather models (Daymet, NLDAS, Maurer), streamflow observations (USGS), catchment attributes (USGS, MODIS, Daymet, STATSGO, Global Lithological Map (GLiM), GLobal Hydrogeology Maps (GLHYMPS))
  • Description: Weather drivers, streamflow observations, and catchment attributes for 671 catchments across the continental US.
  • Papers using this dataset: https://doi.org/10.5194/hess-22-6005-2018, https://doi.org/10.1029/2019WR026793

ClimateNet: an expert-labelled open dataset and Deep Learning architecture for enabling high-precision analyses of extreme weather

  • Paper: https://gmd.copernicus.org/preprints/gmd-2020-72/
  • Code and data: https://portal.nersc.gov/project/ClimateNet/
  • Source data: Climate Model simulations and expert labels
  • Description: Detect atmospheric rivers and tropical cyclones from climate model simulations. Tool for labeling along with dataset of expert labelled data.

CloudCast: A large-scale dataset and baseline for forecasting clouds

  • Paper: https://arxiv.org/abs/2007.07978
  • Code and data: https://vision.eng.au.dk/cloudcast-dataset/
  • Source data: Meteosat-11 with cloud types annotated on a pixel-level
  • Description: The CloudCast dataset contains 70080 images with 10 different cloud types for multiple layers of the atmosphere annotated on a pixel level. The dataset has a spatial size of 928 x 1530 pixels recorded with 15-min intervals for the period 2017-2018, with a 3.0 km resolution.

CUMULO: a benchmark dataset for training and evaluating global cloud classification models

  • Paper: https://arxiv.org/abs/1911.04227
  • Code and data: https://github.com/FrontierDevelopmentLab/CUMULO
  • Source data: Moderate Resolution Imaging Spectroradiometer (MODIS) from Aqua satellite and 2B-CLDCLASS-LIDAR
  • Description: the dataset provides the global 1km-resolution imagery of the MODIS aligned with the accurately measured cloud properties of the CloudSat products. It contains three years of 1354 x 2030 pixel hyperspectral images combined with pixel-width ‘tracks’ of cloud labels, corresponding to the eight World Meteorological Organization genera.

Deepti: Deep-Learning-Based Tropical Cyclone Intensity Estimation System (+ Competition)

  • Paper: http://doi.org/10.1109/JSTARS.2020.3011907
  • Competition: https://www.drivendata.org/competitions/72/predict-wind-speeds/page/274/
  • Code and data: http://registry.mlhub.earth/10.34911/rdnt.xs53up/
  • Source data: GOES
  • Description: A collection of tropical storms in the Atlantic and East Pacific Oceans from 2000 to 2019 with corresponding maximum sustained surface wind speed. This dataset is split into training and test categories for the purpose of a competition. The train set consists of 70,257 images and the test set consists of 44,377 image, each one being 366 x 366 pixels

EarthNet2021: A novel large-scale dataset and challenge for forecasting localized climate impacts

  • Paper: https://arxiv.org/abs/2012.06246
  • Code and data: https://www.earthnet.tech/
  • Source data: Sentinel 2
  • Description: Curated dataset containing target spatio-temporal Sentinel 2 satellite imagery at 20 m resolution, matched with high-resolution topography and mesoscale (1.28 km) weather variables. With over 32000 samples it is suitable for training deep neural networks.

The ExtremeWeather Dataset

  • Paper: https://arxiv.org/abs/1612.02095
  • Code and data: https://github.com/eracah/hur-detect, https://extremeweatherdataset.github.io/
  • Source data: CAM5
  • Description: Consists of 768 × 1152 images of the global atmospheric state with a spatial resolution of 25 km and separated by 6 hour intervals from 1979 to 2005. There are 16 channels of images that correspond to different variables such as surface pressure, surface temperature and humidity of the reference altitude. In addition, there are boundary boxes and class labels for 4 types of extreme weather events: Tropical Depressions, Tropical Cyclones, Extratropical Cyclones and Atmospheric Rivers.

FlowDB: A new large scale river flow, flash flood, and precipitation dataset

  • Paper: https://arxiv.org/abs/2012.11154
  • Code and data: https://flow-forecast.atlassian.net/wiki/spaces/FF/pages/33456135/FlowDB+Dataset (Not public)
  • Source data: USGS, SNOTEL, NOAA, ASOS,EcoNet
  • Description: An hourly river flow and precipitation dataset and a second subset of flash flood events with damage estimates and injury counts. Created for general stream flow forecasting and flash flood damage estimation.

How Much Did It Rain I and II

  • Code and data: https://www.kaggle.com/c/how-much-did-it-rain and https://www.kaggle.com/c/how-much-did-it-rain-ii
  • Source data: US Radar and rain gauges
  • Description: Estimate rainfall probability distribution from Dual Pol. radar data.

MeteoNet, an open reference weather dataset by METEO FRANCE

  • Code and data: https://github.com/meteofrance/meteonet
  • Source data: AROME/ARPEGE forecasts, radar reflectivity and ground stations over France
  • Description: Multi source dataset of forecasts and observations over France spanning 3 years

Neural Networks for Postprocessing Ensemble Weather Forecasts

  • Paper: Rasp and Lerch 2018
  • Code and data: https://github.com/slerch/ppnn
  • Source data: TIGGE forecasts and station observations over Germany
  • Description: Ensemble temperature postprocessing of station observations over Germany. 9 years of data at 500 stations. Predictors include temperature as well as a range of other variables.

RainBench: Towards Global Precipitation Forecasting from Satellite Imagery

  • Paper: https://arxiv.org/abs/2012.09670
  • Code and data: https://github.com/frontierdevelopmentlab/pyrain
  • Source data: IMERG, ERA5 and SimSat
  • Description: Multi-modal benchmark dataset for data-driven precipitation forecasting at 3 different spatial resolutions: 0.1deg (IMERG and SimSat) and 0.5deg (ERA5). Presented along an efficient dataloading pipeline: Pyrain

SEVIR Dataset

  • Paper: NeurIPS
  • Code and data: http://sevir.mit.edu/
  • Source data: GOES-16 and NEXRAD over CONUS
  • Description: Preprocessed satellite and radar data over the continental US, served in patches. For a range of challenges with baselines (check website for updates).

SVRIMG - SeVere Reflectivity IMaGe Dataset

  • Presentation: AMS
  • Code and data: https://svrimg.org/
  • Source data: GridRad (which in turn is sourced from NOAA NEXRAD Level II archives)
  • Description: over 500,000 data rich, geospatial, radar reflectivity images centered on high-impact weather events. These images have consistent dimensions and intensity values on a grid with relatively low spatial distortion over the Conterminous United States. Also includes crowd-sourced labeling.

TAASRAD19, a high-resolution weather radar reflectivity dataset for precipitation nowcasting

  • Paper: https://doi.org/10.1038/s41597-020-0574-8
  • Code and data:https://github.com/MPBA/TAASRAD19
  • Source data: Official public meteorological agency of the civil protection department of the Autonomous Province of Trento (Italy)
  • Description: Benchmark dataset for radar nowcasting with deep learning. The dataset contains 1,732 radar sequences labeled with precipitation type spanning from 2010 to 2019, for a total of 362,233 radar images. Image size is 480 x 480 at 500m resolution (UTM grid) covering a complex orographic area in the Italian Alps.
  • Papers using this dataset: https://doi.org/10.3390/atmos11030267, https://doi.org/10.3390/rs11242922

Understanding Clouds from Satellite Images

  • Paper: BAMS
  • Code and data: https://www.kaggle.com/c/understanding_cloud_organization
  • Source data: TERRA and AQUA MODIS visible images
  • Description: Cloud classification challenge of 4 human-designed shallow cloud patterns of organization: Sugar, Flower, Fish and Gravel with 30,000 human labels

VALUE: A framework to validate downscaling approaches for climate change studies

  • Paper: https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/2014EF000259
  • Code and data: http://www.value-cost.eu/
  • Source data: Station observations
  • Description: Framework for evaluating climate model downscaling methods. Validation observations are provided.
  • Papers using this dataset: Many

WeatherBench: A Benchmark Data Set for Data‐Driven Weather Forecasting

  • Paper: https://doi.org/10.1029/2020MS002203
  • Code and data: https://github.com/pangeo-data/WeatherBench
  • Source data: ERA5 and TIGGE for baselines
  • Description: Benchmark dataset for medium-range (3 and 5 day) forecasting of global pressure, temperature and precipitation with preprocessed data (40 years), evaluation and baselines
  • Papers using this dataset: https://arxiv.org/abs/2003.11927, https://arxiv.org/abs/2008.08626

原始数据集

CMIP6

  • Reference: Eyring et al. 2016
  • Data: https://pcmdi.llnl.gov/CMIP6/
  • Description: Huge archive of global climate model simulations following all kinds of different scenarios.
  • Examples of papers using this dataset: Ham et al. 2019

ERA5

  • Reference: Hersbach et al. 2020
  • Data: https://cds.climate.copernicus.eu/
  • Description: The ultimate reanalysis dataset covering the last 40 years (1950 to 1978 as a preliminary version) at 0.25 degree global resolution. Hourly data available. Pretty much every variable. Examples of papers using this dataset: WeatherBench

❝Notes: Care is to be taken for a bunch of surface variables, such as precipitation and wind. These often don’t match direct observations very closely. ❞

TIGGE

  • Reference: Bougeault et al. 2010
  • Data: https://apps.ecmwf.int/datasets/data/tigge/levtype=sfc/type=cf/
  • Description: 15 year archive of operational global ensemble forecasts from different centers (not live). Examples of papers using this dataset: WeatherBench

混合ML-物理模型数据集

Lorenz ‘63

  • Reference: Lorenz 1963
  • Implementations: Wikipedia
  • Description: Legendary “butterfly” model with three coupled ODEs that exhibit chaotic behavior. Examples of papers using this models: Scher and Messori 2019

Lorenz ‘96

  • Reference: Lorenz 1996
  • Implementations: https://github.com/raspstephan/Lorenz-Online
  • Description: Chaotic model. Often used in its two-layer version for parameterization research. Examples of papers using this models: Rasp 2020, Gagne et al. 2020

官方链接:https://docs.google.com/document/d/1A7w64qoGMGVUm0O7azHdcjHOYAYxHOHdzm0kEsdOwDc/edit

石墨文档:https://shimo.im/docs/1rXbxQXnLvwNtpFs/ 「List of AI-related climate_weather datasets」,可复制链接后用石墨文档 App 或小程序打开

如果打不开官方链接,可以复制石墨文档链接查看。

参考链接

[1]

PANGEO: https://pangeo.io/

[2]

GitHub源: https://github.com/pangeo-data/mldata

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2021-01-20,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 气象杂货铺 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 预处理数据集
    • AI for Earth System Science Summer School Hackathon
      • AMS Solar Energy Prediction Contest
        • CAMELS: CATCHMENT ATTRIBUTES AND METEOROLOGY FOR LARGE-SAMPLE STUDIES
          • ClimateNet: an expert-labelled open dataset and Deep Learning architecture for enabling high-precision analyses of extreme weather
            • CloudCast: A large-scale dataset and baseline for forecasting clouds
              • CUMULO: a benchmark dataset for training and evaluating global cloud classification models
                • Deepti: Deep-Learning-Based Tropical Cyclone Intensity Estimation System (+ Competition)
                  • EarthNet2021: A novel large-scale dataset and challenge for forecasting localized climate impacts
                    • The ExtremeWeather Dataset
                      • FlowDB: A new large scale river flow, flash flood, and precipitation dataset
                        • How Much Did It Rain I and II
                          • MeteoNet, an open reference weather dataset by METEO FRANCE
                            • Neural Networks for Postprocessing Ensemble Weather Forecasts
                              • RainBench: Towards Global Precipitation Forecasting from Satellite Imagery
                                • SEVIR Dataset
                                  • SVRIMG - SeVere Reflectivity IMaGe Dataset
                                    • TAASRAD19, a high-resolution weather radar reflectivity dataset for precipitation nowcasting
                                      • Understanding Clouds from Satellite Images
                                        • VALUE: A framework to validate downscaling approaches for climate change studies
                                          • WeatherBench: A Benchmark Data Set for Data‐Driven Weather Forecasting
                                          • 原始数据集
                                            • CMIP6
                                              • ERA5
                                                • TIGGE
                                                • 混合ML-物理模型数据集
                                                  • Lorenz ‘63
                                                    • Lorenz ‘96
                                                      • 参考链接
                                                      相关产品与服务
                                                      云开发 CloudBase
                                                      云开发(Tencent CloudBase,TCB)是腾讯云提供的云原生一体化开发环境和工具平台,为200万+企业和开发者提供高可用、自动弹性扩缩的后端云服务,可用于云端一体化开发多种端应用(小程序、公众号、Web 应用等),避免了应用开发过程中繁琐的服务器搭建及运维,开发者可以专注于业务逻辑的实现,开发门槛更低,效率更高。
                                                      领券
                                                      问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档