1096 lines
36 KiB
Plaintext
1096 lines
36 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 赛题简介\n",
|
||
"\n",
|
||
"## 赛题背景\n",
|
||
"\n",
|
||
"发生在热带太平洋上的厄尔尼诺-南方涛动(ENSO)现象是地球上最强、最显著的年际气候信号。通过大气或海洋遥相关过程,经常会引发洪涝、干旱、高温、雪灾等极端事件,对全球的天气、气候以及粮食产量具有重要的影响。准确预测ENSO,是提高东亚和全球气候预测水平和防灾减灾的关键。\n",
|
||
"\n",
|
||
"本次赛题是一个时间序列预测问题。基于历史气候观测和模式模拟数据,利用T时刻过去12个月(包含T时刻)的时空序列(气象因子),构建预测ENSO的深度学习模型,预测未来1-24个月的Nino3.4指数,如下图所示:\n",
|
||
"\n",
|
||
"\n",
|
||
"\n",
|
||
"<img src=\"https://mmbiz.qpic.cn/mmbiz_jpg/ZQhHsg2x8ficqo4ibQMBLLlIDzHtDotd8aQ4nTGTxuNQeAicZBa9KPgqT95tsd0shdwQVdQEQJg4AXWvM642G6Pug/640?wx_fmt=jpeg\" width = \"400\" height = \"200\" alt=\"logo\" align=center />\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 数据描述\n",
|
||
"\n",
|
||
"### 数据简介\n",
|
||
"\n",
|
||
"本次比赛使用的数据包括CMIP5/6模式的历史模拟数据和美国SODA模式重建的近100多年历史观测同化数据。每个样本包含以下气象及时空变量:海表温度异常(SST),热含量异常(T300),纬向风异常(Ua),经向风异常(Va),数据维度为(year,month,lat,lon)。对于训练数据提供对应月份的Nino3.4 index标签数据。\n",
|
||
"\n",
|
||
"### 训练数据说明\n",
|
||
"\n",
|
||
"每个数据样本第一维度(year)表征数据所对应起始年份,对于CMIP数据共4645年,其中1-2265为CMIP6中15个模式提供的151年的历史模拟数据(总共:151年 *15 个模式=2265);2266-4645为CMIP5中17个模式提供的140年的历史模拟数据(总共:140年 *17 个模式=2380)。对于历史观测同化数据为美国提供的SODA数据。\n",
|
||
"\n",
|
||
"其中每个样本第二维度(mouth)表征数据对应的月份,对于训练数据均为36,对应的从当前年份开始连续三年数据(从1月开始,共36月),比如:\n",
|
||
"\n",
|
||
"SODA_train.nc中[0,0:36,:,:]为第1-第3年逐月的历史观测数据;\n",
|
||
"\n",
|
||
"SODA_train.nc中[1,0:36,:,:]为第2-第4年逐月的历史观测数据;\n",
|
||
"…,\n",
|
||
"SODA_train.nc中[99,0:36,:,:]为第100-102年逐月的历史观测数据。\n",
|
||
"\n",
|
||
"和\n",
|
||
"CMIP_train.nc中[0,0:36,:,:]为CMIP6第一个模式提供的第1-第3年逐月的历史模拟数据;\n",
|
||
"…,\n",
|
||
"CMIP_train.nc中[150,0:36,:,:]为CMIP6第一个模式提供的第151-第153年逐月的历史模拟数据;\n",
|
||
"\n",
|
||
"CMIP_train.nc中[151,0:36,:,:]为CMIP6第二个模式提供的第1-第3年逐月的历史模拟数据;\n",
|
||
"…,\n",
|
||
"CMIP_train.nc中[2265,0:36,:,:]为CMIP5第一个模式提供的第1-第3年逐月的历史模拟数据;\n",
|
||
"…,\n",
|
||
"CMIP_train.nc中[2405,0:36,:,:]为CMIP5第二个模式提供的第1-第3年逐月的历史模拟数据;\n",
|
||
"…,\n",
|
||
"CMIP_train.nc中[4644,0:36,:,:]为CMIP5第17个模式提供的第140-第142年逐月的历史模拟数据。\n",
|
||
"\n",
|
||
"其中每个样本第三、第四维度分别代表经纬度(南纬55度北纬60度,东经0360度),所有数据的经纬度范围相同。\n",
|
||
"\n",
|
||
"### 训练数据标签说明\n",
|
||
"\n",
|
||
"标签数据为Nino3.4 SST异常指数,数据维度为(year,month)。\n",
|
||
"\n",
|
||
"CMIP(SODA)_train.nc对应的标签数据当前时刻Nino3.4 SST异常指数的三个月滑动平均值,因此数据维度与维度介绍同训练数据一致\n",
|
||
"\n",
|
||
"注:三个月滑动平均值为当前月与未来两个月的平均值。\n",
|
||
"\n",
|
||
"### 测试数据说明\n",
|
||
"测试用的初始场(输入)数据为国际多个海洋资料同化结果提供的随机抽取的n段12个时间序列,数据格式采用NPY格式保存,维度为(12,lat,lon, 4),12为t时刻及过去11个时刻,4为预测因子,并按照SST,T300,Ua,Va的顺序存放。\n",
|
||
"\n",
|
||
"测试集文件序列的命名规则:test_编号_起始月份_终止月份.npy,如test_00001_01_12_.npy。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 评估指标\n",
|
||
"评分细则说明: 根据所提供的n个测试数据,对模型进行测试,得到n组未来1-24个月的序列选取对应预测时效的n个数据与标签值进行计算相关系数和均方根误差,如下图所示。并计算得分。计算公式为:\n",
|
||
"\n",
|
||
"$$\n",
|
||
"Score = \\frac{2}{3} * accskill - RMSE\n",
|
||
"$$\n",
|
||
"\n",
|
||
"其中,\n",
|
||
"\n",
|
||
"$$\n",
|
||
"accskill = \\sum_{i=1}^{24} a * ln(i) * cor_i, \\\\\n",
|
||
"(i \\le,a = 1.5; 5 \\le i \\le 11, a= 2; 12 \\le i \\le 18,a=3;19 \\le i, a = 4)\n",
|
||
"$$\n",
|
||
"\n",
|
||
"而:\n",
|
||
"\n",
|
||
"$$\n",
|
||
"cor = \\frac{\\sum(X-\\bar(X))\\sum(Y-\\bar(Y)}{\\sqrt{\\sum(X-\\bar{X})^2)\\sum(Y-\\bar{Y})^2)}}\n",
|
||
"$$\n",
|
||
"\n",
|
||
"$$\n",
|
||
"RMSE = \\sum_{i=1}^{24} rmse_i, \n",
|
||
"$$\n",
|
||
" \n",
|
||
"<img src=\"https://mmbiz.qpic.cn/mmbiz_jpg/ZQhHsg2x8ficqo4ibQMBLLlIDzHtDotd8aXaGlo3zrPU7Hq6XfcVNrrwXdsPQsg9k3HoCojhezjAI5EAWWJUZeFQ/640?wx_fmt=jpeg\" width = \"400\" height = \"200\" alt=\"metric\" align=center />\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 线下数据转换\n",
|
||
"\n",
|
||
"- 将数据转化为我们所熟悉的形式,每个人的风格不一样,此处可以作为如何将nc文件转化为csv等文件\n",
|
||
"\n",
|
||
"## 数据转化"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"## 工具包导入&数据读取\n",
|
||
"### 工具包导入\n",
|
||
"\n",
|
||
"'''\n",
|
||
"安装工具\n",
|
||
"# !pip install netCDF4 \n",
|
||
"''' \n",
|
||
"import pandas as pd\n",
|
||
"import numpy as np\n",
|
||
"import tensorflow as tf\n",
|
||
"from tensorflow.keras.optimizers import Adam\n",
|
||
"import matplotlib.pyplot as plt\n",
|
||
"import scipy \n",
|
||
"from netCDF4 import Dataset\n",
|
||
"import netCDF4 as nc\n",
|
||
"import gc\n",
|
||
"%matplotlib inline "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 数据读取\n",
|
||
"\n",
|
||
"#### SODA_label处理\n",
|
||
"\n",
|
||
"1. 标签含义\n",
|
||
"\n",
|
||
"```\n",
|
||
"标签数据为Nino3.4 SST异常指数,数据维度为(year,month)。 \n",
|
||
"CMIP(SODA)_train.nc对应的标签数据当前时刻Nino3.4 SST异常指数的三个月滑动平均值,因此数据维度与维度介绍同训练数据一致\n",
|
||
"注:三个月滑动平均值为当前月与未来两个月的平均值。\n",
|
||
"```\n",
|
||
"\n",
|
||
"2. 将标签转化为我们熟悉的pandas形式"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 139,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"label_path = './data/SODA_label.nc'\n",
|
||
"label_trans_path = './data/' \n",
|
||
"nc_label = Dataset(label_path,'r')\n",
|
||
" \n",
|
||
"years = np.array(nc_label['year'][:])\n",
|
||
"months = np.array(nc_label['month'][:])\n",
|
||
"\n",
|
||
"year_month_index = []\n",
|
||
"vs = []\n",
|
||
"for i,year in enumerate(years):\n",
|
||
" for j,month in enumerate(months):\n",
|
||
" year_month_index.append('year_{}_month_{}'.format(year,month))\n",
|
||
" vs.append(np.array(nc_label['nino'][i,j]))\n",
|
||
"\n",
|
||
"df_SODA_label = pd.DataFrame({'year_month':year_month_index}) \n",
|
||
"df_SODA_label['year_month'] = year_month_index\n",
|
||
"df_SODA_label['label'] = vs\n",
|
||
"\n",
|
||
"df_SODA_label.to_csv(label_trans_path + 'df_SODA_label.csv',index = None)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 132,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>year_month</th>\n",
|
||
" <th>label</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>year_1_month_1</td>\n",
|
||
" <td>-0.40720701217651367</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>year_1_month_2</td>\n",
|
||
" <td>-0.20244435966014862</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>year_1_month_3</td>\n",
|
||
" <td>-0.10386104136705399</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>year_1_month_4</td>\n",
|
||
" <td>-0.02910841442644596</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>year_1_month_5</td>\n",
|
||
" <td>-0.13252995908260345</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" year_month label\n",
|
||
"0 year_1_month_1 -0.40720701217651367\n",
|
||
"1 year_1_month_2 -0.20244435966014862\n",
|
||
"2 year_1_month_3 -0.10386104136705399\n",
|
||
"3 year_1_month_4 -0.02910841442644596\n",
|
||
"4 year_1_month_5 -0.13252995908260345"
|
||
]
|
||
},
|
||
"execution_count": 132,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"df_SODA_label.head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 转化\n",
|
||
"#### SODA_train处理\n",
|
||
"\n",
|
||
"```\n",
|
||
"SODA_train.nc中[0,0:36,:,:]为第1-第3年逐月的历史观测数据;\n",
|
||
"\n",
|
||
"SODA_train.nc中[1,0:36,:,:]为第2-第4年逐月的历史观测数据;\n",
|
||
"…,\n",
|
||
"SODA_train.nc中[99,0:36,:,:]为第100-102年逐月的历史观测数据。\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 68,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"SODA_path = './data/SODA_train.nc'\n",
|
||
"nc_SODA = Dataset(SODA_path,'r') "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 自定义抽取对应数据&转化为df的形式;\n",
|
||
"\n",
|
||
"> index为年月; columns为lat和lon的组合"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def trans_df(df, vals, lats, lons, years, months):\n",
|
||
" '''\n",
|
||
" (100, 36, 24, 72) -- year, month,lat,lon \n",
|
||
" ''' \n",
|
||
" for j,lat_ in enumerate(lats):\n",
|
||
" for i,lon_ in enumerate(lons):\n",
|
||
" c = 'lat_lon_{}_{}'.format(int(lat_),int(lon_)) \n",
|
||
" v = []\n",
|
||
" for y in range(len(years)):\n",
|
||
" for m in range(len(months)): \n",
|
||
" v.append(vals[y,m,j,i])\n",
|
||
" df[c] = v\n",
|
||
" return df"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 94,
|
||
"metadata": {
|
||
"scrolled": true
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"year_month_index = []\n",
|
||
"\n",
|
||
"years = np.array(nc_SODA['year'][:])\n",
|
||
"months = np.array(nc_SODA['month'][:])\n",
|
||
"lats = np.array(nc_SODA['lat'][:])\n",
|
||
"lons = np.array(nc_SODA['lon'][:])\n",
|
||
"\n",
|
||
"\n",
|
||
"for year in years:\n",
|
||
" for month in months:\n",
|
||
" year_month_index.append('year_{}_month_{}'.format(year,month))\n",
|
||
"\n",
|
||
"df_sst = pd.DataFrame({'year_month':year_month_index}) \n",
|
||
"df_t300 = pd.DataFrame({'year_month':year_month_index}) \n",
|
||
"df_ua = pd.DataFrame({'year_month':year_month_index}) \n",
|
||
"df_va = pd.DataFrame({'year_month':year_month_index})"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"%%time\n",
|
||
"df_sst = trans_df(df = df_sst, vals = np.array(nc_SODA['sst'][:]), lats = lats, lons = lons, years = years, months = months)\n",
|
||
"df_t300 = trans_df(df = df_t300, vals = np.array(nc_SODA['t300'][:]), lats = lats, lons = lons, years = years, months = months)\n",
|
||
"df_ua = trans_df(df = df_ua, vals = np.array(nc_SODA['ua'][:]), lats = lats, lons = lons, years = years, months = months)\n",
|
||
"df_va = trans_df(df = df_va, vals = np.array(nc_SODA['va'][:]), lats = lats, lons = lons, years = years, months = months)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 133,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"label_trans_path = './data/'\n",
|
||
"df_sst.to_csv(label_trans_path + 'df_sst_SODA.csv',index = None)\n",
|
||
"df_t300.to_csv(label_trans_path + 'df_t300_SODA.csv',index = None)\n",
|
||
"df_ua.to_csv(label_trans_path + 'df_ua_SODA.csv',index = None)\n",
|
||
"df_va.to_csv(label_trans_path + 'df_va_SODA.csv',index = None)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### CMIP_label处理"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 140,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"label_path = './data/CMIP_label.nc'\n",
|
||
"label_trans_path = './data/'\n",
|
||
"nc_label = Dataset(label_path,'r')\n",
|
||
" \n",
|
||
"years = np.array(nc_label['year'][:])\n",
|
||
"months = np.array(nc_label['month'][:])\n",
|
||
"\n",
|
||
"year_month_index = []\n",
|
||
"vs = []\n",
|
||
"for i,year in enumerate(years):\n",
|
||
" for j,month in enumerate(months):\n",
|
||
" year_month_index.append('year_{}_month_{}'.format(year,month))\n",
|
||
" vs.append(np.array(nc_label['nino'][i,j]))\n",
|
||
"\n",
|
||
"df_CMIP_label = pd.DataFrame({'year_month':year_month_index}) \n",
|
||
"df_CMIP_label['year_month'] = year_month_index\n",
|
||
"df_CMIP_label['label'] = vs\n",
|
||
"\n",
|
||
"df_CMIP_label.to_csv(label_trans_path + 'df_CMIP_label.csv',index = None)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 141,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>year_month</th>\n",
|
||
" <th>label</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>year_1_month_1</td>\n",
|
||
" <td>-0.26102548837661743</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>year_1_month_2</td>\n",
|
||
" <td>-0.1332537680864334</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>year_1_month_3</td>\n",
|
||
" <td>-0.014831557869911194</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>year_1_month_4</td>\n",
|
||
" <td>0.10506672412157059</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>year_1_month_5</td>\n",
|
||
" <td>0.24070978164672852</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" year_month label\n",
|
||
"0 year_1_month_1 -0.26102548837661743\n",
|
||
"1 year_1_month_2 -0.1332537680864334\n",
|
||
"2 year_1_month_3 -0.014831557869911194\n",
|
||
"3 year_1_month_4 0.10506672412157059\n",
|
||
"4 year_1_month_5 0.24070978164672852"
|
||
]
|
||
},
|
||
"execution_count": 141,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"df_CMIP_label.head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### CMIP_train处理\n",
|
||
"\n",
|
||
"```\n",
|
||
"\n",
|
||
"CMIP_train.nc中[0,0:36,:,:]为CMIP6第一个模式提供的第1-第3年逐月的历史模拟数据;\n",
|
||
"…,\n",
|
||
"CMIP_train.nc中[150,0:36,:,:]为CMIP6第一个模式提供的第151-第153年逐月的历史模拟数据;\n",
|
||
"\n",
|
||
"CMIP_train.nc中[151,0:36,:,:]为CMIP6第二个模式提供的第1-第3年逐月的历史模拟数据;\n",
|
||
"…,\n",
|
||
"CMIP_train.nc中[2265,0:36,:,:]为CMIP5第一个模式提供的第1-第3年逐月的历史模拟数据;\n",
|
||
"…,\n",
|
||
"CMIP_train.nc中[2405,0:36,:,:]为CMIP5第二个模式提供的第1-第3年逐月的历史模拟数据;\n",
|
||
"…,\n",
|
||
"CMIP_train.nc中[4644,0:36,:,:]为CMIP5第17个模式提供的第140-第142年逐月的历史模拟数据。\n",
|
||
"\n",
|
||
"其中每个样本第三、第四维度分别代表经纬度(南纬55度北纬60度,东经0360度),所有数据的经纬度范围相同。\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"CMIP_path = './data/CMIP_train.nc'\n",
|
||
"CMIP_trans_path = './data'\n",
|
||
"nc_CMIP = Dataset(CMIP_path,'r') "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"dict_keys(['sst', 't300', 'ua', 'va', 'year', 'month', 'lat', 'lon'])"
|
||
]
|
||
},
|
||
"execution_count": 3,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"nc_CMIP.variables.keys()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"(4645, 36, 24, 72)"
|
||
]
|
||
},
|
||
"execution_count": 5,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"nc_CMIP['t300'][:].shape"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 24,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"year_month_index = []\n",
|
||
"\n",
|
||
"years = np.array(nc_CMIP['year'][:])\n",
|
||
"months = np.array(nc_CMIP['month'][:])\n",
|
||
"lats = np.array(nc_CMIP['lat'][:])\n",
|
||
"lons = np.array(nc_CMIP['lon'][:])\n",
|
||
"\n",
|
||
"last_thre_years = 1000\n",
|
||
"for year in years:\n",
|
||
" '''\n",
|
||
" 数据的原因,我们\n",
|
||
" '''\n",
|
||
" if year >= 4645 - last_thre_years:\n",
|
||
" for month in months:\n",
|
||
" year_month_index.append('year_{}_month_{}'.format(year,month))\n",
|
||
"\n",
|
||
"df_CMIP_sst = pd.DataFrame({'year_month':year_month_index}) \n",
|
||
"df_CMIP_t300 = pd.DataFrame({'year_month':year_month_index}) \n",
|
||
"df_CMIP_ua = pd.DataFrame({'year_month':year_month_index}) \n",
|
||
"df_CMIP_va = pd.DataFrame({'year_month':year_month_index})"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"- 因为内存限制,我们暂时取最后1000个year的数据"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 47,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def trans_thre_df(df, vals, lats, lons, years, months, last_thre_years = 1000):\n",
|
||
" '''\n",
|
||
" (4645, 36, 24, 72) -- year, month,lat,lon \n",
|
||
" ''' \n",
|
||
" for j,lat_ in (enumerate(lats)):\n",
|
||
"# print(j)\n",
|
||
" for i,lon_ in enumerate(lons):\n",
|
||
" c = 'lat_lon_{}_{}'.format(int(lat_),int(lon_)) \n",
|
||
" v = []\n",
|
||
" for y_,y in enumerate(years):\n",
|
||
" '''\n",
|
||
" 数据的原因,我们\n",
|
||
" '''\n",
|
||
" if y >= 4645 - last_thre_years:\n",
|
||
" for m_,m in enumerate(months): \n",
|
||
" v.append(vals[y_,m_,j,i])\n",
|
||
" df[c] = v\n",
|
||
" return df"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 38,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"(36036, 1729)"
|
||
]
|
||
},
|
||
"execution_count": 38,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"%%time\n",
|
||
"df_CMIP_sst = trans_thre_df(df = df_CMIP_sst, vals = np.array(nc_CMIP['sst'][:]), lats = lats, lons = lons, years = years, months = months)\n",
|
||
"df_CMIP_sst.to_csv(CMIP_trans_path + 'df_CMIP_sst.csv',index = None)\n",
|
||
"del df_CMIP_sst\n",
|
||
"gc.collect()\n",
|
||
"\n",
|
||
"df_CMIP_t300 = trans_thre_df(df = df_CMIP_t300, vals = np.array(nc_CMIP['t300'][:]), lats = lats, lons = lons, years = years, months = months)\n",
|
||
"df_CMIP_t300.to_csv(CMIP_trans_path + 'df_CMIP_t300.csv',index = None)\n",
|
||
"del df_CMIP_t300\n",
|
||
"gc.collect()\n",
|
||
"\n",
|
||
"df_CMIP_ua = trans_thre_df(df = df_CMIP_ua, vals = np.array(nc_CMIP['ua'][:]), lats = lats, lons = lons, years = years, months = months)\n",
|
||
"df_CMIP_ua.to_csv(CMIP_trans_path + 'df_CMIP_ua.csv',index = None)\n",
|
||
"del df_CMIP_ua\n",
|
||
"gc.collect()\n",
|
||
"\n",
|
||
"df_CMIP_va = trans_thre_df(df = df_CMIP_va, vals = np.array(nc_CMIP['va'][:]), lats = lats, lons = lons, years = years, months = months)\n",
|
||
"df_CMIP_va.to_csv(CMIP_trans_path + 'df_CMIP_va.csv',index = None)\n",
|
||
"del df_CMIP_va\n",
|
||
"gc.collect()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"scrolled": true
|
||
},
|
||
"source": [
|
||
"# 数据建模\n",
|
||
"\n",
|
||
"## 工具包导入&数据读取\n",
|
||
"### 工具包导入"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import pandas as pd\n",
|
||
"import numpy as np\n",
|
||
"import tensorflow as tf\n",
|
||
"from tensorflow.keras.optimizers import Adam \n",
|
||
"import joblib\n",
|
||
"from netCDF4 import Dataset\n",
|
||
"import netCDF4 as nc\n",
|
||
"import gc\n",
|
||
"from sklearn.metrics import mean_squared_error\n",
|
||
"import numpy as np\n",
|
||
"from tensorflow.keras.callbacks import LearningRateScheduler, Callback\n",
|
||
"import tensorflow.keras.backend as K\n",
|
||
"from tensorflow.keras.layers import *\n",
|
||
"from tensorflow.keras.models import *\n",
|
||
"from tensorflow.keras.optimizers import *\n",
|
||
"from tensorflow.keras.callbacks import *\n",
|
||
"from tensorflow.keras.layers import Input \n",
|
||
"%matplotlib inline "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 数据读取\n",
|
||
"\n",
|
||
"#### SODA_label处理\n",
|
||
"\n",
|
||
"1. 标签含义\n",
|
||
"\n",
|
||
"```\n",
|
||
"标签数据为Nino3.4 SST异常指数,数据维度为(year,month)。 \n",
|
||
"CMIP(SODA)_train.nc对应的标签数据当前时刻Nino3.4 SST异常指数的三个月滑动平均值,因此数据维度与维度介绍同训练数据一致\n",
|
||
"注:三个月滑动平均值为当前月与未来两个月的平均值。\n",
|
||
"```\n",
|
||
"\n",
|
||
"2. 将标签转化为我们熟悉的pandas形式"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"df_SODA_label = pd.read_csv('./data/df_SODA_label.csv')\n",
|
||
"df_CMIP_label = pd.read_csv('./data/df_CMIP_label.csv') "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 训练集验证集构建"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"df_SODA_label['year'] = df_SODA_label['year_month'].apply(lambda x: x[:x.find('m') - 1])\n",
|
||
"df_SODA_label['month'] = df_SODA_label['year_month'].apply(lambda x: x[x.find('m') :])\n",
|
||
"\n",
|
||
"df_train = pd.pivot_table(data = df_SODA_label, values = 'label',index = 'year', columns = 'month')\n",
|
||
"year_new_index = ['year_{}'.format(i+1) for i in range(df_train.shape[0])]\n",
|
||
"month_new_columns = ['month_{}'.format(i+1) for i in range(df_train.shape[1])]\n",
|
||
"df_train = df_train[month_new_columns].loc[year_new_index]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 模型构建\n",
|
||
"\n",
|
||
"#### MLP框架"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def RMSE(y_true, y_pred):\n",
|
||
" return tf.sqrt(tf.reduce_mean(tf.square(y_true - y_pred)))\n",
|
||
"\n",
|
||
"def RMSE_fn(y_true, y_pred):\n",
|
||
" return np.sqrt(np.mean(np.power(np.array(y_true, float).reshape(-1, 1) - np.array(y_pred, float).reshape(-1, 1), 2)))\n",
|
||
"\n",
|
||
"def build_model(train_feat, test_feat): #allfeatures, \n",
|
||
" inp = Input(shape=(len(train_feat))) \n",
|
||
" \n",
|
||
" x = Dense(1024, activation='relu')(inp) \n",
|
||
" x = Dropout(0.25)(x) \n",
|
||
" x = Dense(512, activation='relu')(x) \n",
|
||
" x = Dropout(0.25)(x) \n",
|
||
" output = Dense(len(test_feat), activation='linear')(x) \n",
|
||
" model = Model(inputs=inp, outputs=output)\n",
|
||
"\n",
|
||
" adam = tf.optimizers.Adam(lr=1e-3,beta_1=0.99,beta_2 = 0.99) \n",
|
||
" model.compile(optimizer=adam, loss=RMSE)\n",
|
||
"\n",
|
||
" return model"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### 模型训练"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"feature_cols = ['month_{}'.format(i+1) for i in range(12)]\n",
|
||
"label_cols = ['month_{}'.format(i+1) for i in range(12, df_train.shape[1])] "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Model: \"model\"\n",
|
||
"_________________________________________________________________\n",
|
||
"Layer (type) Output Shape Param # \n",
|
||
"=================================================================\n",
|
||
"input_1 (InputLayer) [(None, 12)] 0 \n",
|
||
"_________________________________________________________________\n",
|
||
"dense (Dense) (None, 1024) 13312 \n",
|
||
"_________________________________________________________________\n",
|
||
"dropout (Dropout) (None, 1024) 0 \n",
|
||
"_________________________________________________________________\n",
|
||
"dense_1 (Dense) (None, 512) 524800 \n",
|
||
"_________________________________________________________________\n",
|
||
"dropout_1 (Dropout) (None, 512) 0 \n",
|
||
"_________________________________________________________________\n",
|
||
"dense_2 (Dense) (None, 24) 12312 \n",
|
||
"=================================================================\n",
|
||
"Total params: 550,424\n",
|
||
"Trainable params: 550,424\n",
|
||
"Non-trainable params: 0\n",
|
||
"_________________________________________________________________\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"model_mlp = build_model(feature_cols, label_cols)\n",
|
||
"model_mlp.summary()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"tr_len = int(df_train.shape[0] * 0.8)\n",
|
||
"tr_fea = df_train[feature_cols].iloc[:tr_len,:].copy()\n",
|
||
"tr_label = df_train[label_cols].iloc[:tr_len,:].copy()\n",
|
||
" \n",
|
||
"val_fea = df_train[feature_cols].iloc[tr_len:,:].copy()\n",
|
||
"val_label = df_train[label_cols].iloc[tr_len:,:].copy() \n",
|
||
"\n",
|
||
"\n",
|
||
"model_weights = './user_data/model_data/model_mlp_baseline.h5'\n",
|
||
"\n",
|
||
"checkpoint = ModelCheckpoint(model_weights, monitor='val_loss', verbose=0, save_best_only=True, mode='min',\n",
|
||
" save_weights_only=True)\n",
|
||
"\n",
|
||
"plateau = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5, verbose=1, min_delta=1e-4, mode='min')\n",
|
||
"early_stopping = EarlyStopping(monitor=\"val_loss\", patience=20)\n",
|
||
"history = model_mlp.fit(tr_fea.values, tr_label.values,\n",
|
||
" validation_data=(val_fea.values, val_label.values),\n",
|
||
" batch_size=4096, epochs=200,\n",
|
||
" callbacks=[plateau, checkpoint, early_stopping],\n",
|
||
" verbose=2) "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"#### Metrics"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def rmse(y_true, y_preds):\n",
|
||
" return np.sqrt(mean_squared_error(y_pred = y_preds, y_true = y_true))\n",
|
||
"\n",
|
||
"def score(y_true, y_preds):\n",
|
||
" accskill_score = 0\n",
|
||
" rmse_score = 0\n",
|
||
" a = [1.5] * 4 + [2] * 7 + [3] * 7 + [4] * 6\n",
|
||
" y_true_mean = np.mean(y_true,axis=0) \n",
|
||
" y_pred_mean = np.mean(y_true,axis=0)\n",
|
||
"\n",
|
||
" for i in range(24): \n",
|
||
" fenzi = np.sum((y_true[:,i] - y_true_mean[i]) *(y_preds[:,i] - y_pred_mean[i]) ) \n",
|
||
" fenmu = np.sqrt(np.sum((y_true[:,i] - y_true_mean[i])**2) * np.sum((y_preds[:,i] - y_pred_mean[i])**2) ) \n",
|
||
" cor_i= fenzi / fenmu\n",
|
||
" \n",
|
||
" accskill_score += a[i] * np.log(i+1) * cor_i\n",
|
||
" \n",
|
||
" rmse_score += rmse(y_true[:,i], y_preds[:,i]) \n",
|
||
" return 2 / 3.0 * accskill_score - rmse_score"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"y_val_preds = model_mlp.predict(val_fea.values, batch_size=1024)\n",
|
||
"print('score', score(y_true = val_label.values, y_preds = y_val_preds))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 模型预测\n",
|
||
"\n",
|
||
"## 模型构建\n",
|
||
"\n",
|
||
"在上面的部分,我们已经训练好了模型,接下来就是提交模型并在线上进行预测,这块可以分为三步:\n",
|
||
"\n",
|
||
"- 导入模型;\n",
|
||
"- 读取测试数据并且进行预测;\n",
|
||
"- 生成提交所需的版本;"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import tensorflow as tf\n",
|
||
"import tensorflow.keras.backend as K\n",
|
||
"from tensorflow.keras.layers import *\n",
|
||
"from tensorflow.keras.models import *\n",
|
||
"from tensorflow.keras.optimizers import *\n",
|
||
"from tensorflow.keras.callbacks import *\n",
|
||
"from tensorflow.keras.layers import Input \n",
|
||
"import numpy as np\n",
|
||
"import os\n",
|
||
"import zipfile\n",
|
||
"\n",
|
||
"def RMSE(y_true, y_pred):\n",
|
||
" return tf.sqrt(tf.reduce_mean(tf.square(y_true - y_pred)))\n",
|
||
"\n",
|
||
"def build_model(train_feat, test_feat): #allfeatures, \n",
|
||
" inp = Input(shape=(len(train_feat))) \n",
|
||
" \n",
|
||
" x = Dense(1024, activation='relu')(inp) \n",
|
||
" x = Dropout(0.25)(x) \n",
|
||
" x = Dense(512, activation='relu')(x) \n",
|
||
" x = Dropout(0.25)(x) \n",
|
||
" output = Dense(len(test_feat), activation='linear')(x) \n",
|
||
" model = Model(inputs=inp, outputs=output)\n",
|
||
"\n",
|
||
" adam = tf.optimizers.Adam(lr=1e-3,beta_1=0.99,beta_2 = 0.99) \n",
|
||
" model.compile(optimizer=adam, loss=RMSE)\n",
|
||
"\n",
|
||
" return model\n",
|
||
"\n",
|
||
"feature_cols = ['month_{}'.format(i+1) for i in range(12)]\n",
|
||
"label_cols = ['month_{}'.format(i+1) for i in range(12, 36)] \n",
|
||
"model = build_model(train_feat=feature_cols, test_feat=label_cols)\n",
|
||
"model.load_weights('./user_data/model_data/model_mlp_baseline.h5')\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 模型预测"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\n",
|
||
"test_path = './tcdata/enso_round1_test_20210201/'\n",
|
||
"\n",
|
||
"### 0. 模拟线上的测试集合\n",
|
||
"# for i in range(10):\n",
|
||
"# x = np.random.random(12) \n",
|
||
"# np.save(test_path + \"{}.npy\".format(i+1),x)\n",
|
||
"\n",
|
||
"### 1. 测试数据读取\n",
|
||
"files = os.listdir(test_path)\n",
|
||
"test_feas_dict = {}\n",
|
||
"for file in files:\n",
|
||
" test_feas_dict[file] = np.load(test_path + file)\n",
|
||
" \n",
|
||
"### 2. 结果预测\n",
|
||
"test_predicts_dict = {}\n",
|
||
"for file_name,val in test_feas_dict.items():\n",
|
||
" test_predicts_dict[file_name] = model.predict(val.reshape([-1,12]))\n",
|
||
"# test_predicts_dict[file_name] = model.predict(val.reshape([-1,12])[0,:])\n",
|
||
"\n",
|
||
"### 3.存储预测结果\n",
|
||
"for file_name,val in test_predicts_dict.items(): \n",
|
||
" np.save('./result/' + file_name,val)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 打包到run.sh目录下方"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"#打包目录为zip文件\n",
|
||
"def make_zip(source_dir='./result/', output_filename = 'result.zip'):\n",
|
||
" zipf = zipfile.ZipFile(output_filename, 'w')\n",
|
||
" pre_len = len(os.path.dirname(source_dir))\n",
|
||
" source_dirs = os.walk(source_dir)\n",
|
||
" print(source_dirs)\n",
|
||
" for parent, dirnames, filenames in source_dirs:\n",
|
||
" print(parent, dirnames)\n",
|
||
" for filename in filenames:\n",
|
||
" if '.npy' not in filename:\n",
|
||
" continue\n",
|
||
" pathfile = os.path.join(parent, filename)\n",
|
||
" arcname = pathfile[pre_len:].strip(os.path.sep) #相对路径\n",
|
||
" zipf.write(pathfile, arcname)\n",
|
||
" zipf.close()\n",
|
||
"make_zip() "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 提升方向\n",
|
||
"\n",
|
||
"\n",
|
||
"- 模型角度:我们只使用了简单的MLP模型进行建模,可以考虑使用其它的更加fancy的模型进行尝试;\n",
|
||
"- 数据层面:构建一些特征或者对数据进行一些数据变换等;\n",
|
||
"- 针对损失函数设计各种trick的提升技巧;"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.7.6"
|
||
},
|
||
"toc": {
|
||
"base_numbering": 1,
|
||
"nav_menu": {},
|
||
"number_sections": true,
|
||
"sideBar": true,
|
||
"skip_h1_title": false,
|
||
"title_cell": "Table of Contents",
|
||
"title_sidebar": "Contents",
|
||
"toc_cell": false,
|
||
"toc_position": {
|
||
"height": "calc(100% - 180px)",
|
||
"left": "10px",
|
||
"top": "150px",
|
||
"width": "288px"
|
||
},
|
||
"toc_section_display": true,
|
||
"toc_window_display": true
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 4
|
||
}
|