lyj_commit

2021-04-06 16:40:09 +08:00
parent 7842d69867
commit 2191631ba1
9 changed files with 7671 additions and 11 deletions
--- a/地理数据分析常用工具-checkpoint.ipynb
+++ b/地理数据分析常用工具-checkpoint.ipynb
--- a/wisdomOcean/.ipynb_checkpoints/Task2
+++ b/wisdomOcean/.ipynb_checkpoints/Task2
--- a/wisdomOcean/.ipynb_checkpoints/Task3
+++ b/wisdomOcean/.ipynb_checkpoints/Task3
--- a/wisdomOcean/.ipynb_checkpoints/Task5
+++ b/wisdomOcean/.ipynb_checkpoints/Task5
@@ -0,0 +1,710 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Datawhale 智慧海洋建设-Task5 模型融合"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5.1 学习目标"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "学习融合策略\n",
+    "\n",
+    "完成相应学习打卡任务"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5.2 内容介绍"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "https://mlwave.com/kaggle-ensembling-guide/  \n",
+    "https://github.com/MLWave/Kaggle-Ensemble-Guide"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "模型融合是比赛后期一个重要的环节，大体来说有如下的类型方式。\n",
+    "\n",
+    "1. 简单加权融合:\n",
+    "    - 回归（分类概率）：算术平均融合（Arithmetic mean），几何平均融合（Geometric mean）；\n",
+    "    - 分类：投票（Voting)\n",
+    "\n",
+    "\n",
+    "2. boosting/bagging（在xgboost，Adaboost,GBDT中已经用到）:\n",
+    "    - 多树的提升方法\n",
+    "    \n",
+    "    \n",
+    "3. stacking/blending:\n",
+    "    - 构建多层模型，并利用预测结果再拟合预测。"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5.3 相关理论介绍"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 5.3.1 简单加权融合"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**平均法-Averaging**\n",
+    "\n",
+    "1. 对于回归问题，一个简单直接的思路是取平均。将多个模型的回归结果取平均值作为最终预测结果，进而把多个弱分类器荣和城强分类器。\n",
+    "\n",
+    "2. 稍稍改进的方法是进行加权平均，权值可以用排序的方法确定，举个例子，比如A、B、C三种基本模型，模型效果进行排名，假设排名分别是1，2，3，那么给这三个模型赋予的权值分别是3/6、2/6、1/6。\n",
+    "\n",
+    "3. 平均法或加权平均法看似简单，其实后面的高级算法也可以说是基于此而产生的，Bagging或者Boosting都是一种把许多弱分类器这样融合成强分类器的思想。\n",
+    "\n",
+    "4. Averaging也可以用于对分类问题的概率进行平均。"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**投票法-voting**\n",
+    "\n",
+    "1. 对于一个二分类问题，有3个基础模型，现在我们可以在这些基学习器的基础上得到一个投票的分类器，把票数最多的类作为我们要预测的类别。\n",
+    "\n",
+    "2. 投票法有硬投票（hard voting）和软投票（soft voting）\n",
+    "\n",
+    "3. 硬投票: 对多个模型直接进行投票，不区分模型结果的相对重要度，最终投票数最多的类为最终被预测的类。\n",
+    "\n",
+    "4. 软投票：增加了设置权重的功能，可以为不同模型设置不同权重，进而区别模型不同的重要度。\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 5.3.2 stacking/blending"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### 堆叠法-stacking "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "**基本思想**：用初始训练数据学习出若干个基学习器后，将这几个学习器的预测结果作为新的训练集(第一层)，来学习一个新的学习器(第二层)。\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**背景**: 为了帮助大家理解模型的原理，我们先假定一下数据背景。\n",
+    "1. 训练集数据大小为`10000*100`，测试集大小为`3000*100`。即训练集有10000条数据、100个特征；测试集有3000条数据、100个特征。该数据对应**回归问题**。\n",
+    "\n",
+    "2. 第一层使用三种算法-XGB、LGB、NN。第二层使用GBDT。"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**算法解读**\n",
+    "1. **stacking 第一层**\n",
+    "\n",
+    "  1. XGB算法 - 对应图中`model 1`部分\n",
+    "    - 输入：使用训练集进行5-fold处理\n",
+    "    - 处理：具体处理细节如下\n",
+    "        - 使用1、2、3、4折作为训练集，训练一个XGB模型并预测第5折和测试集，将预测结果分别称为**XGB-pred-tran5**(shape `2000*1`)和**XGB-pred-test1**(shape `3000*1`).\n",
+    "        - 使用1、2、3、5折作为训练集，训练一个XGB模型并预测第4折和测试集，将预测结果分别称为**XGB-pred-tran4**(shape `2000*1`)和**XGB-pred-test2**(shape `3000*1`).\n",
+    "        - 使用1、2、4、5折作为训练集，训练一个XGB模型并预测第3折和测试集，将预测结果分别称为**XGB-pred-tran3**(shape `2000*1`)和**XGB-pred-test3**(shape `3000*1`).\n",
+    "        - 使用1、3、4、5折作为训练集，训练一个XGB模型并预测第2折和测试集，将预测结果分别称为**XGB-pred-tran2**(shape `2000*1`)和**XGB-pred-test4**(shape `3000*1`).\n",
+    "        - 使用2、3、4、5折作为训练集，训练一个XGB模型并预测第1折和测试集，将预测结果分别称为**XGB-pred-tran1**(shape `2000*1`)和**XGB-pred-test5**(shape `3000*1`).\n",
+    "    - 输出：\n",
+    "        - 将XGB分别对1、2、3、4、5折进行预测的结果合并，得到**XGB-pred-tran**(shape `10000*1`)。并且根据5-fold的原理可以知道，与原数据可以形成对应关系。因此在图中称为NEW FEATURE。\n",
+    "        - 将XGB-pred-test1 - 5 的结果使用Averaging的方法求平均值，最终得到**XGB-pred-test**(shape `3000*1`)。\n",
+    "    \n",
+    "  2. LGB算法 - 同样对应图中`model 1`部分\n",
+    "    - 输入：与XGB算法一致\n",
+    "    - 处理：与XGB算法一致。只需更改预测结果的命名即可，如**LGB-pred-tran5**和**LGB-pred-test1**\n",
+    "    - 输出：\n",
+    "        - 将LGB分别对1、2、3、4、5折进行预测的结果合并，得到**LGB-pred-tran**(shape `10000*1`)。\n",
+    "        - 将LGB-pred-test1 - 5 的结果使用Averaging的方法求平均值，最终得到**LGB-pred-test**(shape `3000*1`)。\n",
+    "        \n",
+    "  3. NN算法 - 同样对应图中`model 1`部分\n",
+    "    - 输入：与XGB算法一致\n",
+    "    - 处理：与XGB算法一致。只需更改预测结果的命名即可，如**NN-pred-tran5**和**NN-pred-test1**\n",
+    "    - 输出：\n",
+    "        - 将NN分别对1、2、3、4、5折进行预测的结果合并，得到**NN-pred-tran**(shape `10000*1`)。\n",
+    "        - 将NN-pred-test1 - 5 的结果使用Averaging的方法求平均值，最终得到**NN-pred-test**(shape `3000*1`)。\n",
+    "\n",
+    "2. **stacking 第二层**\n",
+    "  - 训练集：将三个新特征  **XGB-pred-tran**、**LGB-pred-tran**、**NN-pred-tran**合并得到新的训练集(shape `10000*3`)\n",
+    "  - 测试集：将三个新测试集**XGB-pred-test**、**LGB-pred-test**、**NN-pred-test**合并得到新的测试集(shape `30000*3`)\n",
+    "  - 用新训练集和测试集构造第二层的预测器，即GBDT模型"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![在这里插入图片描述](https://img-blog.csdnimg.cn/20210401090352724.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80NDU4NTgzOQ==,size_16,color_FFFFFF,t_70)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### 混合法 - blending"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Blending与Stacking大致相同，只是Blending的主要区别在于训练集不是通过K-Fold的CV策略来获得预测值从而生成第二阶段模型的特征，而是建立一个Holdout集。简单来说，Blending直接用不相交的数据集用于不同层的训练。\n",
+    "\n",
+    "同样以上述数据集为例，构造一个两层的Blending模型。\n",
+    "\n",
+    "首先将训练集划分为两部分(d1，d2)，例如d1为4000条数据用于blending的第一层，d2是6000条数据用于blending的第二层。\n",
+    "\n",
+    "第一层：用d1训练多个模型，将其对d2和test的预测结果作为第二层的New Features。例如同样适用上述三个模型，对d2生成`6000*3`的新特征数据；对test生成`3000*3`的新特征矩阵。\n",
+    "\n",
+    "第二层：用d2的New Features和标签训练新的分类器，然后把test的New Features输入作为最终的测试集，对test预测出的结果就是最终的模型融合的值。\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### 优缺点对比"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Blending的优点在于：\n",
+    "\n",
+    "1. 比stacking简单（因为不用进行k次的交叉验证来获得stacker feature）\n",
+    "\n",
+    "2. 避开了一个信息泄露问题：generlizers和stacker使用了不一样的数据集\n",
+    "\n",
+    "3. 在团队建模过程中，不需要给队友分享自己的随机种子\n",
+    "\n",
+    "而缺点在于：\n",
+    "\n",
+    "1. 使用了很少的数据（是划分hold-out作为测试集，并非cv）\n",
+    "\n",
+    "2. blender可能会过拟合（其实大概率是第一点导致的）\n",
+    "\n",
+    "3. stacking使用多次的CV会比较稳健"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5.4 代码实现"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "import warnings\n",
+    "import matplotlib\n",
+    "import matplotlib.pyplot as plt\n",
+    "import seaborn as sns\n",
+    "\n",
+    "warnings.filterwarnings('ignore')\n",
+    "%matplotlib inline\n",
+    "\n",
+    "import itertools\n",
+    "import matplotlib.gridspec as gridspec\n",
+    "from sklearn import datasets\n",
+    "from sklearn.linear_model import LogisticRegression\n",
+    "from sklearn.neighbors import KNeighborsClassifier\n",
+    "from sklearn.naive_bayes import GaussianNB \n",
+    "from sklearn.ensemble import RandomForestClassifier,RandomForestRegressor\n",
+    "from sklearn.linear_model import LogisticRegression\n",
+    "# from mlxtend.classifier import StackingClassifier\n",
+    "from sklearn.model_selection import cross_val_score, train_test_split\n",
+    "# from mlxtend.plotting import plot_learning_curves\n",
+    "# from mlxtend.plotting import plot_decision_regions\n",
+    "\n",
+    "from sklearn.model_selection import StratifiedKFold\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "from sklearn.model_selection import StratifiedKFold\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "from sklearn.ensemble import AdaBoostClassifier\n",
+    "from sklearn.ensemble import VotingClassifier\n",
+    "import lightgbm as lgb\n",
+    "from sklearn.neural_network import MLPClassifier,MLPRegressor\n",
+    "from sklearn.metrics import mean_squared_error, mean_absolute_error"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 5.4.1 load data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "from sklearn.metrics import classification_report, f1_score\n",
+    "from sklearn.model_selection import StratifiedKFold, KFold,train_test_split"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def reduce_mem_usage(df):\n",
+    "    start_mem = df.memory_usage().sum() / 1024**2 \n",
+    "    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))\n",
+    "    \n",
+    "    for col in df.columns:\n",
+    "        col_type = df[col].dtype\n",
+    "        \n",
+    "        if col_type != object:\n",
+    "            c_min = df[col].min()\n",
+    "            c_max = df[col].max()\n",
+    "            if str(col_type)[:3] == 'int':\n",
+    "                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:\n",
+    "                    df[col] = df[col].astype(np.int8)\n",
+    "                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:\n",
+    "                    df[col] = df[col].astype(np.int16)\n",
+    "                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:\n",
+    "                    df[col] = df[col].astype(np.int32)\n",
+    "                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:\n",
+    "                    df[col] = df[col].astype(np.int64)  \n",
+    "            else:\n",
+    "                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:\n",
+    "                    df[col] = df[col].astype(np.float16)\n",
+    "                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:\n",
+    "                    df[col] = df[col].astype(np.float32)\n",
+    "                else:\n",
+    "                    df[col] = df[col].astype(np.float64)\n",
+    "        else:\n",
+    "            df[col] = df[col].astype('category')\n",
+    "\n",
+    "    end_mem = df.memory_usage().sum() / 1024**2 \n",
+    "    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))\n",
+    "    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))\n",
+    "    \n",
+    "    return df"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Memory usage of dataframe is 30.28 MB\n",
+      "Memory usage after optimization is: 7.59 MB\n",
+      "Decreased by 74.9%\n"
+     ]
+    }
+   ],
+   "source": [
+    "all_df = pd.read_csv('data/group_df.csv',index_col=0)\n",
+    "all_df = reduce_mem_usage(all_df)\n",
+    "all_df = all_df.fillna(99)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(9000, 440)"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "all_df.shape"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       " 2    4361\n",
+       "-1    2000\n",
+       " 0    1621\n",
+       " 1    1018\n",
+       "Name: label, dtype: int64"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "all_df['label'].value_counts()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "all_df中label为0/1/2的为训练集，一共有7000条；label为-1的为测试集，一共有2000条。\n",
+    "1. label为-1的测试集没有label，这部分数据用于模拟真实比赛提交数据。\n",
+    "\n",
+    "2. train数据均有标签，我们将从中分出30%作为验证集，其余作为训练集。在验证集上比较模型性能优劣，模型性能均使用f1作为评分。\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train = all_df[all_df['label'] != -1]\n",
+    "test =  all_df[all_df['label'] == -1]\n",
+    "feats = [c for c in train.columns if c not in ['ID', 'label']]\n",
+    "\n",
+    "# 根据7：3划分训练集和测试集\n",
+    "X_train,X_val,y_train,y_val= train_test_split(train[feats],train['label'],test_size=0.3,random_state=0)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 5.4.2 单模及加权融合"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "这里训练三个单模，分别是用了一个三种不同的RF/LGB/LGB模型。事实上模型融合需要基础分类器之间存在差异，一般不会选用相同的分类器模型。这里只是作为展示。"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 单模函数\n",
+    "def build_model_rf(X_train,y_train):\n",
+    "    model = RandomForestClassifier(n_estimators = 100)\n",
+    "    model.fit(X_train, y_train)\n",
+    "    return model\n",
+    "\n",
+    "\n",
+    "def build_model_lgb(X_train,y_train):\n",
+    "    model = lgb.LGBMClassifier(num_leaves=127,learning_rate = 0.1,n_estimators = 200)\n",
+    "    model.fit(X_train, y_train)\n",
+    "    return model\n",
+    "\n",
+    "\n",
+    "def build_model_lgb2(X_train,y_train):\n",
+    "    model = lgb.LGBMClassifier(num_leaves=63,learning_rate = 0.05,n_estimators = 400)\n",
+    "    model.fit(X_train, y_train)\n",
+    "    return model\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "predict rf ...\n",
+      "0.8987051046527208\n",
+      "predict lgb...\n",
+      "0.9144414270113281\n",
+      "predict lgb 2...\n",
+      "0.9183965870229657\n"
+     ]
+    }
+   ],
+   "source": [
+    "# 这里针对三个单模进行训练，其中subA_rf/lgb/nn都是可以提交的模型\n",
+    "# 单模没有进行调参，因此是弱分类器，效果可能不是很好。\n",
+    "\n",
+    "print('predict rf ...')\n",
+    "model_rf = build_model_rf(X_train,y_train)\n",
+    "val_rf = model_rf.predict(X_val)\n",
+    "subA_rf = model_rf.predict(test[feats])\n",
+    "rf_f1_score = f1_score(y_val,val_rf,average='macro')\n",
+    "print(rf_f1_score)\n",
+    "\n",
+    "print('predict lgb...')\n",
+    "model_lgb = build_model_lgb(X_train,y_train)\n",
+    "val_lgb = model_lgb.predict(X_val)\n",
+    "subA_lgb = model_lgb.predict(test[feats])\n",
+    "lgb_f1_score = f1_score(y_val,val_lgb,average='macro')\n",
+    "print(lgb_f1_score)\n",
+    "\n",
+    "\n",
+    "print('predict lgb 2...')\n",
+    "model_lgb2 = build_model_lgb2(X_train,y_train)\n",
+    "val_lgb2 = model_lgb2.predict(X_val)\n",
+    "subA_lgb2 = model_lgb2.predict(test[feats])\n",
+    "lgb2_f1_score = f1_score(y_val,val_lgb2,average='macro')\n",
+    "print(lgb2_f1_score)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0.9142736444973326\n"
+     ]
+    }
+   ],
+   "source": [
+    "voting_clf = VotingClassifier(estimators=[('rf',model_rf ),\n",
+    "                                          ('lgb',model_lgb),\n",
+    "                                          ('lgb2',model_lgb2 )],voting='hard')\n",
+    "\n",
+    "voting_clf.fit(X_train,y_train)\n",
+    "val_voting = voting_clf.predict(X_val)\n",
+    "subA_voting = voting_clf.predict(test[feats])\n",
+    "voting_f1_score = f1_score(y_val,val_voting,average='macro')\n",
+    "print(voting_f1_score)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 5.4.3 Stacking融合"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "_N_FOLDS = 5  # 采用5折交叉验证\n",
+    "kf = KFold(n_splits=_N_FOLDS, random_state=42)  # sklearn的交叉验证模块，用于划分数据\n",
+    "\n",
+    "\n",
+    "def get_oof(clf, X_train, y_train, X_test):\n",
+    "    oof_train = np.zeros((X_train.shape[0], 1)) \n",
+    "    oof_test_skf = np.empty((_N_FOLDS, X_test.shape[0], 1))  \n",
+    "    \n",
+    "    for i, (train_index, test_index) in enumerate(kf.split(X_train)): # 交叉验证划分此时的训练集和验证集\n",
+    "        kf_X_train = X_train.iloc[train_index,]\n",
+    "        kf_y_train = y_train.iloc[train_index,]\n",
+    "        kf_X_val = X_train.iloc[test_index,]\n",
+    "        \n",
+    "        clf.fit(kf_X_train, kf_y_train)\n",
+    " \n",
+    "        oof_train[test_index] = clf.predict(kf_X_val).reshape(-1, 1) \n",
+    "        oof_test_skf[i, :] = clf.predict(X_test).reshape(-1, 1)  \n",
+    " \n",
+    "    oof_test = oof_test_skf.mean(axis=0)  # 对每一则交叉验证的结果取平均\n",
+    "    return oof_train, oof_test  # 返回当前分类器对训练集和测试集的预测结果"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 将你的每个分类器都调用get_oof函数，并把它们的结果合并，就得到了新的训练和测试数据new_train,new_test\n",
+    "new_train, new_test = [], []\n",
+    "\n",
+    "\n",
+    "model1 = RandomForestClassifier(n_estimators = 100)\n",
+    "model2 = lgb.LGBMClassifier(num_leaves=127,learning_rate = 0.1,n_estimators = 200)\n",
+    "model3 = lgb.LGBMClassifier(num_leaves=63,learning_rate = 0.05,n_estimators = 400)\n",
+    "\n",
+    "for clf in [model1, model2, model3]:\n",
+    "    oof_train, oof_test = get_oof(clf, X_train, y_train, X_val)\n",
+    "    new_train.append(oof_train)\n",
+    "    new_test.append(oof_test)\n",
+    "    \n",
+    "new_train = np.concatenate(new_train, axis=1)\n",
+    "new_test = np.concatenate(new_test, axis=1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0.8816601744239989\n"
+     ]
+    }
+   ],
+   "source": [
+    "# 用新的训练数据new_train作为新的模型的输入，stacking第二层\n",
+    "# 使用LogisticRegression作为第二层是为了防止模型过拟合\n",
+    "# 这里使用的模型还有待优化，因此模型融合效果并不是很好\n",
+    "clf = LogisticRegression()\n",
+    "clf.fit(new_train, y_train)\n",
+    "result = clf.predict(new_test)\n",
+    "\n",
+    "stacking_f1_score = f1_score(y_val,result,average='macro')\n",
+    "print(stacking_f1_score)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5.5 思考题"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "1. 如何基于stacking改进出blending - stacking使用了foldCV，blending使用了holdout.\n",
+    "\n",
+    "2. stacking还可以进行哪些优化提升F1-score - 从第一层模型数量？模型差异性？角度出发"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**参考内容**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "https://blog.csdn.net/weixin_44585839/article/details/110148396\n",
+    "\n",
+    "https://blog.csdn.net/weixin_39962758/article/details/111101263"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "**END.**\n",
+    "\n",
+    "【 张晋 ：Datawhale成员，算法竞赛爱好者。CSDN：https://blog.csdn.net/weixin_44585839/】\n",
+    "\n",
+    "\n",
+    "\n",
+    "关于Datawhale：\n",
+    "\n",
+    "> Datawhale是一个专注于数据科学与AI领域的开源组织，汇集了众多领域院校和知名企业的优秀学习者，聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner，和学习者一起成长”为愿景，鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案，赋能人才培养，助力人才成长，建立起人与人，人与知识，人与企业和人与未来的联结。\n",
+    "\n",
+    "本次数据挖掘路径学习，专题知识将在天池分享，详情可关注Datawhale：\n",
+    "\n",
+    "![logo.png](https://img-blog.csdnimg.cn/2020090509294089.png)"
+   ]
+  }
+ ],
+ "metadata": {
+  "hide_input": false,
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {},
+   "toc_section_display": true,
+   "toc_window_display": false
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/地理数据分析常用工具.ipynb
+++ b/地理数据分析常用工具.ipynb
@@ -1116,7 +1116,7 @@
    "    tmp_df['lat'] = tmp_df['lat'].astype(float)\n",
    "    tmp_df['lon'] = tmp_df['lon'].astype(float)\n",
    "    tmp_df['speed'] = tmp_df['speed'].astype(float)\n",
-    "    tmp_df['direction'] = tmp_df['direction'].astype(int)\n",
+    "    tmp_df['direction'] = tmp_df['direction'].astype(int)#如果该行代码运行失败，请尝试更新pandas的版本\n",
    "    return tmp_df\n",
    "# 平面坐标转经纬度，供初赛数据使用\n",
    "# 选择标准为NAD83 / California zone 6 (ftUS) (EPSG:2230)，查询链接：https://mygeodata.cloud/cs2cs/\n",
@@ -1674,6 +1674,20 @@
    "进阶作业：   \n",
    "2.在这个模块中，我们介绍了各种库以及他们常用的方法。如果可以，请同学们尝试在原有剔除异常点的数据（DF）中保留douglas-peucker算法所识别的关键点的数据，删除douglas-peucker未保存的数据，并尝试对这些坐标点进行geohash编码"
   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 参考内容"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "https://tianchi.aliyun.com/forum/postDetail?spm=5176.12586969.1002.3.163c24d1HiGiFo&postId=110644"
+   ]
  }
 ],
 "metadata": {
--- a/数据分析.ipynb
+++ b/数据分析.ipynb
@@ -1556,6 +1556,7 @@
  }
 ],
 "metadata": {
+  "hide_input": false,
  "kernelspec": {
   "display_name": "Python [conda env:seacom]",
   "language": "python",
--- a/特征工程.ipynb
+++ b/特征工程.ipynb
@@ -3560,6 +3560,7 @@
  }
 ],
 "metadata": {
+  "hide_input": false,
  "kernelspec": {
   "display_name": "Python [conda env:all] *",
   "language": "python",
--- a/wisdomOcean/readme.md
+++ b/wisdomOcean/readme.md
@@ -37,9 +37,8 @@ https://tianchi.aliyun.com/competition/entrance/231768/information

 二、比赛数据和地理数据分析常用工具介绍中的附件数据

-链接：https://pan.baidu.com/s/1AEWhNkSzx6Ls8XmVXFOQMg
-
-提取码：wrgg 
+链接：https://pan.xunlei.com/s/VMX5JAhFN7ZmPaaCVsHQEVkrA1
+提取码：hmtz

 比赛数据在本次组队学习中只用到了hy_round1_testA_20200102与hy_round1_train_20200102文件。其中DF.csv和df_gpd_change.pkl 分别是Task1中所需要的数据。 其中DF.csv是将轨迹数据进行异常处理之后的数据，而df_gpd_change.pkl是将异常处理之后的数据进行douglas-peucker算法进行压缩之后的数据。  

--- a/模型融合.ipynb
+++ b/模型融合.ipynb
@@ -670,16 +670,10 @@
    "\n",
    "![logo.png](https://img-blog.csdnimg.cn/2020090509294089.png)"
   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
  }
 ],
 "metadata": {
+  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
@@ -696,6 +690,19 @@
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {},
+   "toc_section_display": true,
+   "toc_window_display": false
  }
 },
 "nbformat": 4,