From cebf4b2e08ce68208a136284c39e7b424f4b46c0 Mon Sep 17 00:00:00 2001 From: Harold-Ran <56714856+Harold-Ran@users.noreply.github.com> Date: Thu, 21 Oct 2021 16:45:38 +0800 Subject: [PATCH] =?UTF-8?q?Delete=20Task3=20=E6=A8=A1=E5=9E=8B=E5=BB=BA?= =?UTF-8?q?=E7=AB=8B=E4=B9=8BCNN+LSTM.ipynb?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- Task3 模型建立之CNN+LSTM.ipynb | 1 - 1 file changed, 1 deletion(-) delete mode 100644 Task3 模型建立之CNN+LSTM.ipynb diff --git a/Task3 模型建立之CNN+LSTM.ipynb b/Task3 模型建立之CNN+LSTM.ipynb deleted file mode 100644 index b3918ce..0000000 --- a/Task3 模型建立之CNN+LSTM.ipynb +++ /dev/null @@ -1 +0,0 @@ -{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.7.10","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"# Datawhale 气象海洋预测-Task3 模型建立之 CNN+LSTM\n本次任务我们将学习来自TOP选手“学习AI的打工人”的建模方案,该方案中采用的模型是CNN+LSTM。\n\n在Task2的数据分析中我们发现,构建新的特征是很困难的,因此本赛题是一个模型问题,我们的主要目标是构造一个能够从小数据集中充分挖掘空间信息和时间信息的模型。那么,说到挖掘空间信息的模型,我们会很自然的想到CNN,同样的,挖掘时间信息的模型我们会很容易想到LSTM,我们本次学习的这个TOP方案正是构造了CNN+LSTM的串行结构。","metadata":{}},{"cell_type":"markdown","source":"## 学习目标\n1. 学习TOP方案的数据处理方法。\n2. 学习TOP方案的模型构建方法。","metadata":{}},{"cell_type":"markdown","source":"## 内容介绍\n1. 数据处理\n - 增加月特征\n - 数据扁平化\n - 空值填充\n - 构造数据集\n2. 模型构建\n - 构造评估函数\n - 模型构造与训练\n - 模型评估\n3. 总结","metadata":{}},{"cell_type":"markdown","source":"## 代码示例","metadata":{}},{"cell_type":"markdown","source":"### 数据处理","metadata":{}},{"cell_type":"markdown","source":"该TOP方案的数据处理主要包括四部分:\n\n1. 增加月特征。将序列数据的起始月份作为新的特征。\n2. 数据扁平化。将序列数据按月拼接起来通过滑窗增加数据量。\n3. 空值填充。\n4. 构造数据集。随机采样构造数据集。","metadata":{}},{"cell_type":"markdown","source":"在本次任务中我们需要用到TensorFlow Probability(TFP)这个库来计算皮尔逊相关系数,TFP是TensorFlow生态的一部分,它继承了TensorFlow的优势,主要用于概率推理和统计分析。使用方法可以参考TensorFlow Probability指南:https://tensorflow.google.cn/probability/overview?hl=zh-cn\n\n在TensorFlow 1.x中也有计算皮尔逊相关系数的函数tf.contrib.metrics.streaming_pearson_correlation,这里使用的是TensorFlow 2.x,没有相应的函数,因此需要借助TFP这个库,在使用时需要注意与TensorFlow的版本匹配。","metadata":{}},{"cell_type":"code","source":"# 安装TensorFlow Probability\n!pip install tensorflow-probability","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:24:47.918055Z","iopub.execute_input":"2021-10-21T07:24:47.918397Z","iopub.status.idle":"2021-10-21T07:24:55.957962Z","shell.execute_reply.started":"2021-10-21T07:24:47.918310Z","shell.execute_reply":"2021-10-21T07:24:55.957146Z"},"trusted":true},"execution_count":1,"outputs":[]},{"cell_type":"code","source":"import math\nimport pandas as pd\nimport numpy as np\n\nimport netCDF4 as nc\nfrom netCDF4 import Dataset\n\nimport matplotlib.pyplot as plt\n%matplotlib inline\n\nimport seaborn as sns\ncolor = sns.color_palette()\nsns.set_style('darkgrid')\n\n#忽略警告\nimport warnings\ndef ignore_warn(*args, **kwargs):\n pass\nwarnings.warn = ignore_warn\n\nfrom sklearn.model_selection import KFold, train_test_split\nfrom sklearn.metrics import mean_squared_error\n\nimport tensorflow as tf\nimport tensorflow_probability as tfp\nfrom tensorflow.keras.models import Model, Sequential\nfrom tensorflow.keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D, AveragePooling2D, TimeDistributed, BatchNormalization, Activation, LSTM\nfrom tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping\nfrom tensorflow.keras.regularizers import L2","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:24:55.959842Z","iopub.execute_input":"2021-10-21T07:24:55.960128Z","iopub.status.idle":"2021-10-21T07:25:02.221929Z","shell.execute_reply.started":"2021-10-21T07:24:55.960084Z","shell.execute_reply":"2021-10-21T07:25:02.221222Z"},"trusted":true},"execution_count":2,"outputs":[]},{"cell_type":"code","source":"# 固定随机种子\nSEED = 42\n\ntf.random.set_seed(SEED)\nnp.random.seed(SEED)","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:25:02.223265Z","iopub.execute_input":"2021-10-21T07:25:02.223537Z","iopub.status.idle":"2021-10-21T07:25:02.227675Z","shell.execute_reply.started":"2021-10-21T07:25:02.223502Z","shell.execute_reply":"2021-10-21T07:25:02.226978Z"},"trusted":true},"execution_count":3,"outputs":[]},{"cell_type":"code","source":"# 读取数据\n\n# 存放数据的路径\npath = '/kaggle/input/ninoprediction/'\nsoda_train = Dataset(path + 'SODA_train.nc')\nsoda_label = Dataset(path + 'SODA_label.nc')\ncmip_train = Dataset(path + 'CMIP_train.nc')\ncmip_label = Dataset(path + 'CMIP_label.nc')","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:25:02.229652Z","iopub.execute_input":"2021-10-21T07:25:02.230131Z","iopub.status.idle":"2021-10-21T07:25:02.334363Z","shell.execute_reply.started":"2021-10-21T07:25:02.230095Z","shell.execute_reply":"2021-10-21T07:25:02.333651Z"},"trusted":true},"execution_count":4,"outputs":[]},{"cell_type":"markdown","source":"#### 增加月特征","metadata":{}},{"cell_type":"markdown","source":"本赛题的线上测试集是任意选取某个月为起始的长度为12的序列,因此该方案中增加了起始月份作为新的特征。但是使用整数1~12不能反映12月与1月相邻这一特点,因此需要借助三角函数的周期性,同时考虑到单独使用sin函数或cos函数会存在某些月份的函数值相同的现象,因此同时使用sin函数和cos函数作为两个新增月份特征,保证每个起始月份的这两个特征组合都是独一无二的,并且又能够很好地表现出月份的周期性特征。","metadata":{}},{"cell_type":"markdown","source":"我们可以通过可视化直观地感受下每个月份所构造的月份特征组合。","metadata":{}},{"cell_type":"code","source":"months = range(0, 12)\nmonth_labels = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sept', 'Oct', 'Nov', 'Dec']\n# sin月份特征\nmonths_sin = map(lambda x: math.sin(2 * math.pi * x / len(months)), months)\n# cos月份特征\nmonths_cos = map(lambda x: math.cos(2 * math.pi * x / len(months)), months)\n\n# 绘制每个月的月份特征组合\nplt.figure(figsize=(20, 5))\nx_axis = np.arange(-1, 13, 1e-2)\nsns.lineplot(x=x_axis, y=np.sin(2 * math.pi * x_axis / len(months)))\nsns.lineplot(x=x_axis, y=np.cos(2 * math.pi * x_axis / len(months)))\nsns.scatterplot(x=months, y=months_sin, s=200)\nsns.scatterplot(x=months, y=months_cos, s=200)\nplt.xticks(ticks=months, labels=month_labels)\nplt.show()","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:25:02.335883Z","iopub.execute_input":"2021-10-21T07:25:02.336173Z","iopub.status.idle":"2021-10-21T07:25:02.897459Z","shell.execute_reply.started":"2021-10-21T07:25:02.336139Z","shell.execute_reply":"2021-10-21T07:25:02.896681Z"},"trusted":true},"execution_count":5,"outputs":[]},{"cell_type":"markdown","source":"构造SODA数据的sin月份特征。","metadata":{}},{"cell_type":"code","source":"# 构造一个维度为100*36*24*72的矩阵,矩阵中的每个值为所在月份的sin函数值\nsoda_month_sin = np.zeros((100, 36, 24, 72))\nfor y in range(100):\n for m in range(36):\n for lat in range(24):\n for lon in range(72):\n soda_month_sin[y, m, lat, lon] = math.sin(2 * math.pi * (m % 12) / 12)\n \nsoda_month_sin.shape","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:25:02.898646Z","iopub.execute_input":"2021-10-21T07:25:02.899235Z","iopub.status.idle":"2021-10-21T07:25:07.739024Z","shell.execute_reply.started":"2021-10-21T07:25:02.899195Z","shell.execute_reply":"2021-10-21T07:25:07.738197Z"},"trusted":true},"execution_count":6,"outputs":[]},{"cell_type":"markdown","source":"构造SODA数据的cos月份特征。","metadata":{}},{"cell_type":"code","source":"# 构造一个维度为100*36*24*72的矩阵,矩阵中的每个值为所在月份的cos函数值\nsoda_month_cos = np.zeros((100, 36, 24, 72))\nfor y in range(100):\n for m in range(36):\n for lat in range(24):\n for lon in range(72):\n soda_month_cos[y, m, lat, lon] = math.cos(2 * math.pi * (m % 12) / 12)\n \nsoda_month_cos.shape","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:25:07.740128Z","iopub.execute_input":"2021-10-21T07:25:07.740388Z","iopub.status.idle":"2021-10-21T07:25:12.921466Z","shell.execute_reply.started":"2021-10-21T07:25:07.740353Z","shell.execute_reply":"2021-10-21T07:25:12.920789Z"},"trusted":true},"execution_count":7,"outputs":[]},{"cell_type":"markdown","source":"构造CMIP数据的sin月份特征。","metadata":{}},{"cell_type":"code","source":"# 构造一个维度为4645*36*24*72的矩阵,矩阵中的每个值为所在月份的sin函数值\ncmip_month_sin = np.zeros((4645, 36, 24, 72))\nfor y in range(4645):\n for m in range(36):\n for lat in range(24):\n for lon in range(72):\n cmip_month_sin[y, m, lat, lon] = math.sin(2 * math.pi * (m % 12) / 12)\n \ncmip_month_sin.shape","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:25:12.922828Z","iopub.execute_input":"2021-10-21T07:25:12.923073Z","iopub.status.idle":"2021-10-21T07:29:05.428099Z","shell.execute_reply.started":"2021-10-21T07:25:12.923041Z","shell.execute_reply":"2021-10-21T07:29:05.427241Z"},"trusted":true},"execution_count":8,"outputs":[]},{"cell_type":"markdown","source":"构造CMIP数据的cos月份特征。","metadata":{}},{"cell_type":"code","source":"# 构造一个维度为4645*36*24*72的矩阵,矩阵中的每个值为所在月份的cos函数值\ncmip_month_cos = np.zeros((4645, 36, 24, 72))\nfor y in range(4645):\n for m in range(36):\n for lat in range(24):\n for lon in range(72):\n cmip_month_cos[y, m, lat, lon] = math.cos(2 * math.pi * (m % 12) / 12)\n \ncmip_month_cos.shape","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:29:05.429467Z","iopub.execute_input":"2021-10-21T07:29:05.429832Z","iopub.status.idle":"2021-10-21T07:32:53.820066Z","shell.execute_reply.started":"2021-10-21T07:29:05.429790Z","shell.execute_reply":"2021-10-21T07:32:53.819270Z"},"trusted":true},"execution_count":9,"outputs":[]},{"cell_type":"markdown","source":"#### 数据扁平化","metadata":{}},{"cell_type":"markdown","source":"在Task2中我们发现,赛题中给出的数据量非常少,如何增加数据量呢?对于时序数据,一种常用的做法就是滑窗。\n\n由于每条数据在时间上有重叠,我们取数据的前12个月拼接起来,就得到了长度为(数据条数×12个月)的序列数据,如图1所示:\n![图1 样本拼接示意图](http://)\n\n然后我们以每个月为起始月,接下来的12个月作为模型输入X,后24个月的Nino3.4指数作为预测目标Y构建训练样本,如图2所示:\n![图2 滑窗构造训练样本](http://)\n\n需要注意的是,CMIP数据提供了不同的拟合模式,只有在同种模式下各个年份的数据在时间上是连续的,因此同种模式的数据才能在时间上拼接起来,除去最后11个月不能构成训练样本外,滑窗最终能获得的训练样本数量可以按以下方式计算得到:\n\n- SODA:1种模式×(100年×12-11)=1189条样本\n- CMIP6:15种模式×(151年×12-11)=27015条样本\n- CMIP5:17种模式×(140年×12-11)=28373条样本","metadata":{}},{"cell_type":"markdown","source":"在下面的代码中,我们只将各个模式的数据拼接起来而没有采用滑窗,这是因为考虑到采用滑窗得到的训练样本维度是(数据条数×12×24×72),需要占用大量的内存资源。我们在之后构建数据集时,随机抽取了部分样本,大家在实际问题中,如果资源足够的话,可以采用滑窗构建的全部的数据,不过需要注意数据量大的情况下可以考虑构建更深的模型来挖掘更多信息。","metadata":{}},{"cell_type":"code","source":"def make_flatted(train_ds, label_ds, month_sin, month_cos, info, start_idx=0):\n keys = ['sst', 't300', 'ua', 'va']\n label_key = 'nino'\n # 年数\n years = info[1]\n # 模式数\n models = info[2]\n \n train_list = []\n label_list = []\n \n # 将同种模式下的数据拼接起来\n for model_i in range(models):\n blocks = []\n \n # 对每个特征,取每条数据的前12个月进行拼接\n for key in keys:\n block = train_ds[key][start_idx + model_i * years: start_idx + (model_i + 1) * years, :12].reshape(-1, 24, 72, 1).data\n blocks.append(block)\n # 增加sin月份特征\n block_sin = month_sin[start_idx + model_i * years: start_idx + (model_i + 1) * years, :12].reshape(-1, 24, 72, 1)\n blocks.append(block_sin)\n # 增加cos月份特征\n block_cos = month_cos[start_idx + model_i * years: start_idx + (model_i + 1) * years, :12].reshape(-1, 24, 72, 1)\n blocks.append(block_cos)\n \n # 将所有特征在最后一个维度上拼接起来\n train_flatted = np.concatenate(blocks, axis=-1)\n \n # 取12-23月的标签进行拼接,注意加上最后一年的最后12个月的标签(与最后一年12-23月的标签共同构成最后一年前12个月的预测目标)\n label_flatted = np.concatenate([\n label_ds[label_key][start_idx + model_i * years: start_idx + (model_i + 1) * years, 12: 24].reshape(-1).data,\n label_ds[label_key][start_idx + (model_i + 1) * years - 1, 24: 36].reshape(-1).data\n ], axis=0)\n \n train_list.append(train_flatted)\n label_list.append(label_flatted)\n \n return train_list, label_list","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:32:53.823659Z","iopub.execute_input":"2021-10-21T07:32:53.823886Z","iopub.status.idle":"2021-10-21T07:32:53.836173Z","shell.execute_reply.started":"2021-10-21T07:32:53.823861Z","shell.execute_reply":"2021-10-21T07:32:53.835245Z"},"trusted":true},"execution_count":10,"outputs":[]},{"cell_type":"code","source":"soda_info = ('soda', 100, 1)\ncmip6_info = ('cmip6', 151, 15)\ncmip5_info = ('cmip5', 140, 17)\n\nsoda_trains, soda_labels = make_flatted(soda_train, soda_label, soda_month_sin, soda_month_cos, soda_info)\ncmip6_trains, cmip6_labels = make_flatted(cmip_train, cmip_label, cmip_month_sin, cmip_month_cos, cmip6_info)\ncmip5_trains, cmip5_labels = make_flatted(cmip_train, cmip_label, cmip_month_sin, cmip_month_cos, cmip5_info, cmip6_info[1]*cmip6_info[2])\n\n# 得到扁平化后的数据维度为(模式数×序列长度×纬度×经度×特征数),其中序列长度=年数×12\nnp.shape(soda_trains), np.shape(cmip6_trains), np.shape(cmip5_trains)","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:32:53.839952Z","iopub.execute_input":"2021-10-21T07:32:53.840294Z","iopub.status.idle":"2021-10-21T07:33:48.038980Z","shell.execute_reply.started":"2021-10-21T07:32:53.840256Z","shell.execute_reply":"2021-10-21T07:33:48.038188Z"},"trusted":true},"execution_count":11,"outputs":[]},{"cell_type":"code","source":"del soda_month_sin, soda_month_cos\ndel cmip_month_sin, cmip_month_cos\ndel soda_train, soda_label\ndel cmip_train, cmip_label","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:33:48.040351Z","iopub.execute_input":"2021-10-21T07:33:48.040623Z","iopub.status.idle":"2021-10-21T07:33:48.222642Z","shell.execute_reply.started":"2021-10-21T07:33:48.040591Z","shell.execute_reply":"2021-10-21T07:33:48.221881Z"},"trusted":true},"execution_count":12,"outputs":[]},{"cell_type":"markdown","source":"#### 空值填充","metadata":{}},{"cell_type":"markdown","source":"在Task2中我们发现,除SST外,其它特征中都存在空值,这些空值基本都在陆地上,因此我们直接将空值填充为0。","metadata":{}},{"cell_type":"code","source":"# 填充SODA数据中的空值\nsoda_trains = np.array(soda_trains)\nsoda_trains_nan = np.isnan(soda_trains)\nsoda_trains[soda_trains_nan] = 0\nprint('Number of null in soda_trains after fillna:', np.sum(np.isnan(soda_trains)))","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:33:48.223769Z","iopub.execute_input":"2021-10-21T07:33:48.224085Z","iopub.status.idle":"2021-10-21T07:33:48.299272Z","shell.execute_reply.started":"2021-10-21T07:33:48.224046Z","shell.execute_reply":"2021-10-21T07:33:48.298533Z"},"trusted":true},"execution_count":13,"outputs":[]},{"cell_type":"code","source":"# 填充CMIP6数据中的空值\ncmip6_trains = np.array(cmip6_trains)\ncmip6_trains_nan = np.isnan(cmip6_trains)\ncmip6_trains[cmip6_trains_nan] = 0\nprint('Number of null in cmip6_trains after fillna:', np.sum(np.isnan(cmip6_trains)))","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:33:48.300605Z","iopub.execute_input":"2021-10-21T07:33:48.301017Z","iopub.status.idle":"2021-10-21T07:33:50.048014Z","shell.execute_reply.started":"2021-10-21T07:33:48.300970Z","shell.execute_reply":"2021-10-21T07:33:50.047072Z"},"trusted":true},"execution_count":14,"outputs":[]},{"cell_type":"code","source":"# 填充CMIP5数据中的空值\ncmip5_trains = np.array(cmip5_trains)\ncmip5_trains_nan = np.isnan(cmip5_trains)\ncmip5_trains[cmip5_trains_nan] = 0\nprint('Number of null in cmip6_trains after fillna:', np.sum(np.isnan(cmip5_trains)))","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:33:50.049351Z","iopub.execute_input":"2021-10-21T07:33:50.049590Z","iopub.status.idle":"2021-10-21T07:33:51.831867Z","shell.execute_reply.started":"2021-10-21T07:33:50.049562Z","shell.execute_reply":"2021-10-21T07:33:51.830990Z"},"trusted":true},"execution_count":15,"outputs":[]},{"cell_type":"markdown","source":"#### 构造数据集","metadata":{}},{"cell_type":"markdown","source":"在划分训练/验证集时,一个需要考虑的问题是训练集、验证集、测试集三者的分布是否是一致的。在本赛题中我们拿到的是两份数据,其中CMIP数据是CMIP5/6模式模拟的历史数据,SODA数据是由SODA模式重建的的历史观测同化数据,线上测试集则是来自国际多个海洋资料的同化数据,由此看来,SODA数据和线上测试集的分布是较为一致的,CMIP数据的分布则与测试集不同。在三者不一致的情况下,我们通常会尽可能使验证集与测试集的分布一致,这样当模型在验证集上有较好的表现时,在测试集上也会有较好的表现。\n\n因此,我们从CMIP数据的每个模式中各抽取100条数据作为训练集(这里抽取的样本数只是作为一个示例,实际模型训练的时候使用多少样本需要综合考虑可用的资源条件和构建的模型深度),从SODA模式中抽取100条数据作为验证集。有的同学可能会疑惑,既然这里只用了100条SODA数据,那么为什么还要对SODA数据扁平化后再抽样而不直接用原始数据呢,因为直接取原始数据的前12个月作为输入,后24个月作为标签所得到的验证集每一条都是从0月开始的,而线上的测试集起始月份是随机抽取的,因此这里仍然要尽可能保证验证集与测试集的数据分布一致,使构建的验证集的起始月份也是随机的。\n\n我们这里没有构造测试集,因为线上的测试集已经公开了,可以直接使用,在比赛时,线上的测试集是保密的,需要构造线下的测试集来评估模型效果,同时需要注意线下的评估结果和线上的提交结果是否差距不大或者变化趋势是一致的,如果不是就需要调整线下的测试集,保证它和线上测试集的分布尽可能一致,能够较为准确地指示模型的调整方向。","metadata":{}},{"cell_type":"code","source":"# 构造训练集\n\nX_train = []\ny_train = []\n# 从CMIP5的17种模式中各抽取100条数据\nfor model_i in range(17):\n samples = np.random.choice(cmip5_trains.shape[1]-12, size=100)\n for ind in samples:\n X_train.append(cmip5_trains[model_i, ind: ind+12])\n y_train.append(cmip5_labels[model_i][ind: ind+24])\n# 从CMIP6的15种模式种各抽取100条数据\nfor model_i in range(15):\n samples = np.random.choice(cmip6_trains.shape[1]-12, size=100)\n for ind in samples:\n X_train.append(cmip6_trains[model_i, ind: ind+12])\n y_train.append(cmip6_labels[model_i][ind: ind+24])\nX_train = np.array(X_train)\ny_train = np.array(y_train)","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:33:51.833349Z","iopub.execute_input":"2021-10-21T07:33:51.833680Z","iopub.status.idle":"2021-10-21T07:33:52.755421Z","shell.execute_reply.started":"2021-10-21T07:33:51.833640Z","shell.execute_reply":"2021-10-21T07:33:52.754677Z"},"trusted":true},"execution_count":16,"outputs":[]},{"cell_type":"code","source":"# 构造测试集\n\nX_valid = []\ny_valid = []\nsamples = np.random.choice(soda_trains.shape[1]-12, size=100)\nfor ind in samples:\n X_valid.append(soda_trains[0, ind: ind+12])\n y_valid.append(soda_labels[0][ind: ind+24])\nX_valid = np.array(X_valid)\ny_valid = np.array(y_valid)","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:33:52.756798Z","iopub.execute_input":"2021-10-21T07:33:52.757046Z","iopub.status.idle":"2021-10-21T07:33:52.793032Z","shell.execute_reply.started":"2021-10-21T07:33:52.757012Z","shell.execute_reply":"2021-10-21T07:33:52.792249Z"},"trusted":true},"execution_count":17,"outputs":[]},{"cell_type":"code","source":"# 查看数据集维度\nX_train.shape, y_train.shape, X_valid.shape, y_valid.shape","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:33:52.794370Z","iopub.execute_input":"2021-10-21T07:33:52.794886Z","iopub.status.idle":"2021-10-21T07:33:52.801042Z","shell.execute_reply.started":"2021-10-21T07:33:52.794851Z","shell.execute_reply":"2021-10-21T07:33:52.800397Z"},"trusted":true},"execution_count":18,"outputs":[]},{"cell_type":"code","source":"del cmip5_trains, cmip5_labels\ndel cmip6_trains, cmip6_labels\ndel soda_trains, soda_labels","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:33:52.802440Z","iopub.execute_input":"2021-10-21T07:33:52.802952Z","iopub.status.idle":"2021-10-21T07:33:52.821661Z","shell.execute_reply.started":"2021-10-21T07:33:52.802917Z","shell.execute_reply":"2021-10-21T07:33:52.821005Z"},"trusted":true},"execution_count":19,"outputs":[]},{"cell_type":"code","source":"# 保存数据集\nnp.save('X_train_sample.npy', X_train)\nnp.save('y_train_sample.npy', y_train)\nnp.save('X_valid_sample.npy', X_valid)\nnp.save('y_valid_sample.npy', y_valid)","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:33:52.822849Z","iopub.execute_input":"2021-10-21T07:33:52.823328Z","iopub.status.idle":"2021-10-21T07:33:59.285875Z","shell.execute_reply.started":"2021-10-21T07:33:52.823277Z","shell.execute_reply":"2021-10-21T07:33:59.285139Z"},"trusted":true},"execution_count":20,"outputs":[]},{"cell_type":"markdown","source":"### 模型构建","metadata":{}},{"cell_type":"markdown","source":"在模型构建部分的通用流程是:构造评估函数 -> 构建并训练模型 -> 模型评估,后两步是循环的,可以根据评估结果重新调整并训练模型,再重新进行评估。","metadata":{}},{"cell_type":"markdown","source":"#### 构造评估函数","metadata":{}},{"cell_type":"markdown","source":"模型的评估函数通常就是官方给出的评估指标,不过在比赛中经常会出现线下的评估结果和提交后的线上评估结果不一致的情况,这通常是线下测试集和线上测试集不一致造成的。","metadata":{}},{"cell_type":"code","source":"# 读取数据集\nX_train = np.load('../input/ai-earth-task03-samples/X_train_sample.npy')\ny_train = np.load('../input/ai-earth-task03-samples/y_train_sample.npy')\nX_valid = np.load('../input/ai-earth-task03-samples/X_valid_sample.npy')\ny_valid = np.load('../input/ai-earth-task03-samples/y_valid_sample.npy')","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:33:59.289160Z","iopub.execute_input":"2021-10-21T07:33:59.289375Z","iopub.status.idle":"2021-10-21T07:34:22.470749Z","shell.execute_reply.started":"2021-10-21T07:33:59.289350Z","shell.execute_reply":"2021-10-21T07:34:22.469957Z"},"trusted":true},"execution_count":21,"outputs":[]},{"cell_type":"code","source":"X_train.shape, y_train.shape, X_valid.shape, y_valid.shape","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:34:22.473256Z","iopub.execute_input":"2021-10-21T07:34:22.473768Z","iopub.status.idle":"2021-10-21T07:34:22.481006Z","shell.execute_reply.started":"2021-10-21T07:34:22.473730Z","shell.execute_reply":"2021-10-21T07:34:22.480199Z"},"trusted":true},"execution_count":22,"outputs":[]},{"cell_type":"code","source":"def rmse(y_true, y_preds):\n return np.sqrt(mean_squared_error(y_pred = y_preds, y_true = y_true))\n\n# 评估函数\ndef score(y_true, y_preds):\n # 相关性技巧评分\n accskill_score = 0\n # RMSE\n rmse_scores = 0\n a = [1.5] * 4 + [2] * 7 + [3] * 7 + [4] * 6\n y_true_mean = np.mean(y_true, axis=0)\n y_pred_mean = np.mean(y_preds, axis=0)\n for i in range(24):\n fenzi = np.sum((y_true[:, i] - y_true_mean[i]) * (y_preds[:, i] - y_pred_mean[i]))\n fenmu = np.sqrt(np.sum((y_true[:, i] - y_true_mean[i])**2) * np.sum((y_preds[:, i] - y_pred_mean[i])**2))\n cor_i = fenzi / fenmu\n accskill_score += a[i] * np.log(i+1) * cor_i\n rmse_score = rmse(y_true[:, i], y_preds[:, i])\n rmse_scores += rmse_score\n return 2/3.0 * accskill_score - rmse_scores","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:34:22.482435Z","iopub.execute_input":"2021-10-21T07:34:22.482741Z","iopub.status.idle":"2021-10-21T07:34:22.493682Z","shell.execute_reply.started":"2021-10-21T07:34:22.482705Z","shell.execute_reply":"2021-10-21T07:34:22.492901Z"},"trusted":true},"execution_count":23,"outputs":[]},{"cell_type":"markdown","source":"#### 模型构造与训练","metadata":{}},{"cell_type":"markdown","source":"这部分是赛题的重点,该TOP方案采用的是CNN+LSTM的串行结构,其中CNN用来提取空间信息,LSTM用来提取时间信息。\n\n- CNN部分\n\nCNN常用于处理图像信息,它在处理空间信息上也有很好的表现。CNN的输入尺寸是(Batch,H,W,C),其中Batch是批量梯度下降中一个批次的样本数量,H和W分别是输入图像的高和宽,C是输入图像的通道数,对于本题中的空间数据,H和W就对应数据的纬度和经度,C对应特征数。我们的训练样本中还多了一个时间维度,因此需要用TimeDistributed来包装卷积层。\n\nTimeDistributed是一个层封装器,它的参数是一个神经网络层,它的作用是将这个网络层应用到输入的每一个时间步上。例如,我们的每个输入样本的维度是(12,24,72,6),用TimeDistributed包装卷积层Conv2D,它就会将卷积层应用于这12个时间步的每一个,而卷积层作用的输入维度就是(24,72,6)。说明文档可以参考https://keras.io/zh/layers/wrappers/\n\nBatchNormalization(后面简称BN)是批标准化层,通常放在卷积层后用于标准化数据的分布,能够减少各层不同数据分布之间的相互影响和依赖,具有加快模型训练速度、避免梯度爆炸、在一定程度上能增强模型泛化能力等优点,是神经网络问题中常用的“大杀器”。不过目前关于BN层和ReLU激活函数的放置顺序孰先孰后的问题众说纷纭,具体还是看模型的效果。关于这个问题的讨论可以参考https://www.zhihu.com/question/283715823\n\n总体来看CNN这一部分采用的是比较通用的结构,第一层采用比较大的卷积核(7×7),后面接多层的小卷积核(3×3),并用BN提升模型效果,用池化层减少模型参数,池化层常用的有MaxPooling和AveragePooling,通常MaxPooling效果更好,不过具体看模型效果。模型的主要难点就在于调参,目前模型调参没有标准的答案,更多地是参考前人的经验以及不断地尝试。\n\n- LSTM部分\n\nCNN部分经过Flatten层将除时间维度以外的维度压平(即除时间步长12外的其它维度大小相乘,例如CNN部分最后的池化层输出维度是(T,H,W,C),则压平后的维度是(T,H×W×C)),输入LSTM层。LSTM层接受的输入维度为(Time_steps,Input_size),其中Time_steps就是时间步长12,Input_size是压平后的维度大小。keras中LSTM的主要参数是units(输出维度)、activation(激活函数)、return_sequences(是否返回最后一个输出),return_sequences为True时返回的是全部的时间序列,即维度是(Time_steps,units),为False时只返回最后一个时间步的输出,即维度(units)。LSTM的使用方法可以参考https://keras.io/zh/layers/recurrent/\n\nLSTM的参数量是4倍的Input_size×units,参数量过多就容易过拟合,同时我们之间构建的训练样本也不够多,因此该方案中只堆叠了两个LSTM层。","metadata":{}},{"cell_type":"code","source":"# 模型构造\ndef build_model(learning_rate, reg_lambda = 0.001): \n # 定义模型输入\n inp = tf.keras.layers.Input(shape=(12, 24, 72, 6))\n\n # CNN部分\n x = TimeDistributed(Conv2D(32, (7, 7), strides=(2, 2), padding='same', kernel_regularizer=L2(reg_lambda)))(inp)\n x = TimeDistributed(BatchNormalization())(x)\n x = TimeDistributed(Activation('relu'))(x)\n x = TimeDistributed(Conv2D(32, (3, 3), padding='same', kernel_regularizer=L2(reg_lambda)))(x)\n x = TimeDistributed(BatchNormalization())(x)\n x = TimeDistributed(Activation('relu'))(x)\n x = TimeDistributed(AveragePooling2D((2, 2), strides=(2, 2)))(x)\n x = TimeDistributed(Flatten())(x)\n\n # LSTM部分\n x = LSTM(2048, return_sequences=True, dropout=0.5)(x)\n x = LSTM(1024, return_sequences=False)(x)\n\n # 模型输出\n out = Dense(24, activation='linear')(x)\n # 建立模型\n model = Model(inputs=inp, outputs=out)\n # 使用RMSprop作为模型优化器\n opt = tf.optimizers.RMSprop(lr=learning_rate)\n # 模型编译,损失函数是MSE,赛题的评估指标中包括相关系数,因此用皮尔逊相关系数作为评价指标\n model.compile(optimizer = opt, loss = 'mean_squared_error', metrics=[tfp.stats.correlation])\n\n return model","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:34:22.495080Z","iopub.execute_input":"2021-10-21T07:34:22.495381Z","iopub.status.idle":"2021-10-21T07:34:22.507840Z","shell.execute_reply.started":"2021-10-21T07:34:22.495345Z","shell.execute_reply":"2021-10-21T07:34:22.506982Z"},"trusted":true},"execution_count":24,"outputs":[]},{"cell_type":"code","source":"learning_rate = 3e-4\nmodel = build_model(learning_rate)\nmodel.summary()","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:34:22.509144Z","iopub.execute_input":"2021-10-21T07:34:22.509475Z","iopub.status.idle":"2021-10-21T07:34:34.873887Z","shell.execute_reply.started":"2021-10-21T07:34:22.509438Z","shell.execute_reply":"2021-10-21T07:34:34.873164Z"},"trusted":true},"execution_count":25,"outputs":[]},{"cell_type":"markdown","source":"在模型训练时,有一些常用的回调函数:\n\n- ModelCheckPoint:在每个训练周期后保存模型,其中参数monitor为模型的监测指标,save_best_only设为True时监测指标最优的模型不会被覆盖,mode用来定义监测指标为最大或最小时模型最优。\n\n- EarlyStopping:早停,当监测指标在设定的训练轮数内不再提升时停止训练,其中参数patience为可容忍的指标不提升的训练轮数。\n\n- ReduceLROnPlateau:当监测指标在设定的训练轮数内不提升时降低学习率,其中参数factor为学习率降低的因数,新的学习率=factor×学习率。\n\n也可以通过扩展keras.callbacks.Callback子类自定义回调函数,回调函数的详细说明文档参考https://keras.io/zh/callbacks/","metadata":{}},{"cell_type":"code","source":"# 模型权重的保存路径\nmodel_weights = './cnn_lstm_baseline.h5'\n# 设置回调函数\ncheckpoint = ModelCheckpoint(model_weights, monitor='val_loss', verbose=0, save_best_only=True, mode='min',\n save_weights_only=True)\nearly_stopping = EarlyStopping(monitor=\"val_loss\", patience=5)\n# 训练模型并保存训练历史\nhistory = model.fit(X_train, y_train,\n validation_data=(X_valid, y_valid),\n batch_size=256, epochs=100,\n callbacks=[checkpoint, early_stopping],\n verbose=2)","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:34:34.874967Z","iopub.execute_input":"2021-10-21T07:34:34.875203Z","iopub.status.idle":"2021-10-21T07:35:18.695256Z","shell.execute_reply.started":"2021-10-21T07:34:34.875170Z","shell.execute_reply":"2021-10-21T07:35:18.694520Z"},"trusted":true},"execution_count":26,"outputs":[]},{"cell_type":"code","source":"# 查看训练历史中保存的监测指标\nhistory.history.keys()","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:35:18.697111Z","iopub.execute_input":"2021-10-21T07:35:18.697409Z","iopub.status.idle":"2021-10-21T07:35:18.704103Z","shell.execute_reply.started":"2021-10-21T07:35:18.697369Z","shell.execute_reply":"2021-10-21T07:35:18.703269Z"},"trusted":true},"execution_count":27,"outputs":[]},{"cell_type":"code","source":"# 绘制训练/验证曲线\ndef training_vis(hist):\n loss = hist.history['loss']\n val_loss = hist.history['val_loss']\n corr = hist.history['correlation']\n val_corr = hist.history['val_correlation']\n \n # 绘制损失函数曲线\n fig = plt.figure(figsize=(8,4))\n # subplot loss\n ax1 = fig.add_subplot(121)\n ax1.plot(loss,label='train_loss')\n ax1.plot(val_loss,label='val_loss')\n ax1.set_xlabel('Epochs')\n ax1.set_ylabel('Loss')\n ax1.set_title('Loss on Training and Validation Data')\n ax1.legend()\n # 绘制皮尔逊相关系数曲线\n ax2 = fig.add_subplot(122)\n ax2.plot(corr,label='train_pearson_correction')\n ax2.plot(val_corr,label='val_pearson_correction')\n ax2.set_xlabel('Epochs')\n ax2.set_ylabel('Pearson Correction')\n ax2.set_title('Pearson Correction on Training and Validation Data')\n ax2.legend()\n plt.tight_layout()","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:35:18.709514Z","iopub.execute_input":"2021-10-21T07:35:18.710238Z","iopub.status.idle":"2021-10-21T07:35:18.732764Z","shell.execute_reply.started":"2021-10-21T07:35:18.710206Z","shell.execute_reply":"2021-10-21T07:35:18.731959Z"},"trusted":true},"execution_count":28,"outputs":[]},{"cell_type":"code","source":"training_vis(history)","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:35:18.733848Z","iopub.execute_input":"2021-10-21T07:35:18.734159Z","iopub.status.idle":"2021-10-21T07:35:19.296929Z","shell.execute_reply.started":"2021-10-21T07:35:18.734126Z","shell.execute_reply":"2021-10-21T07:35:19.296279Z"},"trusted":true},"execution_count":29,"outputs":[]},{"cell_type":"markdown","source":"我们通常会绘制训练/验证曲线来观察模型的拟合情况,上图中我们分别绘制了训练过程中训练集和验证集损失函数变化曲线及皮尔逊相关系数变化曲线。可以看到,训练集的损失函数下降很快,但是验证集的损失函数是震荡的,没有明显的下降,这说明模型的学习效果较差,并存在过拟合问题,需要调整相关的参数。","metadata":{}},{"cell_type":"markdown","source":"#### 模型评估","metadata":{}},{"cell_type":"markdown","source":"最后,我们在测试集上评估模型的训练结果。","metadata":{}},{"cell_type":"code","source":"# 测试集路径\ntest_path = '../input/ai-earth-tests/'\n# 测试集标签路径\ntest_label_path = '../input/ai-earth-tests-labels/'","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:35:19.298002Z","iopub.execute_input":"2021-10-21T07:35:19.299442Z","iopub.status.idle":"2021-10-21T07:35:19.303903Z","shell.execute_reply.started":"2021-10-21T07:35:19.299402Z","shell.execute_reply":"2021-10-21T07:35:19.302832Z"},"trusted":true},"execution_count":30,"outputs":[]},{"cell_type":"code","source":"import os\n\n# 读取测试数据和测试数据的标签,并记录每个测试样本的起始月份用于之后构造月份特征\nfiles = os.listdir(test_path)\nX_test = []\ny_test = []\nfirst_months = [] # 样本起始月份\nfor file in files:\n X_test.append(np.load(test_path + file))\n y_test.append(np.load(test_label_path + file))\n first_months.append(int(file.split('_')[2]))","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:35:19.305419Z","iopub.execute_input":"2021-10-21T07:35:19.305703Z","iopub.status.idle":"2021-10-21T07:35:20.606627Z","shell.execute_reply.started":"2021-10-21T07:35:19.305668Z","shell.execute_reply":"2021-10-21T07:35:20.605862Z"},"trusted":true},"execution_count":31,"outputs":[]},{"cell_type":"code","source":"X_test = np.array(X_test)\ny_test = np.array(y_test)\nX_test.shape, y_test.shape","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:35:20.609331Z","iopub.execute_input":"2021-10-21T07:35:20.609756Z","iopub.status.idle":"2021-10-21T07:35:20.639717Z","shell.execute_reply.started":"2021-10-21T07:35:20.609718Z","shell.execute_reply":"2021-10-21T07:35:20.638877Z"},"trusted":true},"execution_count":32,"outputs":[]},{"cell_type":"code","source":"# 构造一个维度为103*12*24*72的矩阵,矩阵中的每个值为所在月份的sin函数值\ntest_month_sin = np.zeros((103, 12, 24, 72, 1))\nfor y in range(103):\n for m in range(12):\n for lat in range(24):\n for lon in range(72):\n test_month_sin[y, m, lat, lon] = math.sin(2 * math.pi * ((m + first_months[y]-1) % 12) / 12)\n \ntest_month_sin.shape","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:35:20.641336Z","iopub.execute_input":"2021-10-21T07:35:20.641596Z","iopub.status.idle":"2021-10-21T07:35:23.271966Z","shell.execute_reply.started":"2021-10-21T07:35:20.641560Z","shell.execute_reply":"2021-10-21T07:35:23.271252Z"},"trusted":true},"execution_count":33,"outputs":[]},{"cell_type":"code","source":"# 构造一个维度为103*12*24*72的矩阵,矩阵中的每个值为所在月份的cos函数值\ntest_month_cos = np.zeros((103, 12, 24, 72, 1))\nfor y in range(103):\n for m in range(12):\n for lat in range(24):\n for lon in range(72):\n test_month_cos[y, m, lat, lon] = math.cos(2 * math.pi * ((m + first_months[y]-1) % 12) / 12)\n \ntest_month_cos.shape","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:35:23.273185Z","iopub.execute_input":"2021-10-21T07:35:23.273530Z","iopub.status.idle":"2021-10-21T07:35:26.128961Z","shell.execute_reply.started":"2021-10-21T07:35:23.273485Z","shell.execute_reply":"2021-10-21T07:35:26.128274Z"},"trusted":true},"execution_count":34,"outputs":[]},{"cell_type":"code","source":"# 构造测试集\nX_test = np.concatenate([X_test, test_month_sin, test_month_cos], axis=-1)\nX_test.shape","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:35:26.130332Z","iopub.execute_input":"2021-10-21T07:35:26.130600Z","iopub.status.idle":"2021-10-21T07:35:26.221279Z","shell.execute_reply.started":"2021-10-21T07:35:26.130568Z","shell.execute_reply":"2021-10-21T07:35:26.220403Z"},"trusted":true},"execution_count":35,"outputs":[]},{"cell_type":"code","source":"# 模型预测\ntest_preds = model.predict(X_test)","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:35:26.222880Z","iopub.execute_input":"2021-10-21T07:35:26.223157Z","iopub.status.idle":"2021-10-21T07:35:27.054634Z","shell.execute_reply.started":"2021-10-21T07:35:26.223121Z","shell.execute_reply":"2021-10-21T07:35:27.053851Z"},"trusted":true},"execution_count":36,"outputs":[]},{"cell_type":"code","source":"# 模型评估\nscore = score(y_test, test_preds)\nprint(score)","metadata":{"execution":{"iopub.status.busy":"2021-10-21T07:35:27.056009Z","iopub.execute_input":"2021-10-21T07:35:27.056271Z","iopub.status.idle":"2021-10-21T07:35:27.074280Z","shell.execute_reply.started":"2021-10-21T07:35:27.056238Z","shell.execute_reply":"2021-10-21T07:35:27.073539Z"},"trusted":true},"execution_count":37,"outputs":[]},{"cell_type":"markdown","source":"## 总结","metadata":{}},{"cell_type":"markdown","source":"- 该方案在数据处理部分增加了一组月份特征,个人认为在使用了时序模型的情况下增加的这组特征收益不高,并且由于维度增加会使得训练数据占用内存大大增加,在本赛题中对模型的效果提升不明显。不过在其他场景中这种特征构造方法仍然是值得借鉴的。\n- 该方案构造模型的思路非常适合初学者学习,灵活地将不同模型串行结合能够结合模型各自的优势,这种模型构造方法需要注意的是一个模型的输出维度与另一个模型接受的输入维度要相互匹配。","metadata":{}},{"cell_type":"markdown","source":"## 参考文献\n1. “学习AI的打工人”经验分享:https://tianchi.aliyun.com/notebook-ai/detail?spm=5176.12586969.1002.18.561d5330HKwYOW&postId=196536","metadata":{}}]} \ No newline at end of file