lyj_commit
This commit is contained in:
1733
wisdomOcean/.ipynb_checkpoints/Task1 地理数据分析常用工具-checkpoint.ipynb
Normal file
1733
wisdomOcean/.ipynb_checkpoints/Task1 地理数据分析常用工具-checkpoint.ipynb
Normal file
File diff suppressed because one or more lines are too long
1593
wisdomOcean/.ipynb_checkpoints/Task2 数据分析-checkpoint.ipynb
Normal file
1593
wisdomOcean/.ipynb_checkpoints/Task2 数据分析-checkpoint.ipynb
Normal file
File diff suppressed because one or more lines are too long
3602
wisdomOcean/.ipynb_checkpoints/Task3 特征工程-checkpoint.ipynb
Normal file
3602
wisdomOcean/.ipynb_checkpoints/Task3 特征工程-checkpoint.ipynb
Normal file
File diff suppressed because it is too large
Load Diff
710
wisdomOcean/.ipynb_checkpoints/Task5 模型融合-checkpoint.ipynb
Normal file
710
wisdomOcean/.ipynb_checkpoints/Task5 模型融合-checkpoint.ipynb
Normal file
@@ -0,0 +1,710 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Datawhale 智慧海洋建设-Task5 模型融合"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 5.1 学习目标"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"学习融合策略\n",
|
||||
"\n",
|
||||
"完成相应学习打卡任务"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 5.2 内容介绍"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"https://mlwave.com/kaggle-ensembling-guide/ \n",
|
||||
"https://github.com/MLWave/Kaggle-Ensemble-Guide"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"模型融合是比赛后期一个重要的环节,大体来说有如下的类型方式。\n",
|
||||
"\n",
|
||||
"1. 简单加权融合:\n",
|
||||
" - 回归(分类概率):算术平均融合(Arithmetic mean),几何平均融合(Geometric mean);\n",
|
||||
" - 分类:投票(Voting)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"2. boosting/bagging(在xgboost,Adaboost,GBDT中已经用到):\n",
|
||||
" - 多树的提升方法\n",
|
||||
" \n",
|
||||
" \n",
|
||||
"3. stacking/blending:\n",
|
||||
" - 构建多层模型,并利用预测结果再拟合预测。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 5.3 相关理论介绍"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 5.3.1 简单加权融合"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**平均法-Averaging**\n",
|
||||
"\n",
|
||||
"1. 对于回归问题,一个简单直接的思路是取平均。将多个模型的回归结果取平均值作为最终预测结果,进而把多个弱分类器荣和城强分类器。\n",
|
||||
"\n",
|
||||
"2. 稍稍改进的方法是进行加权平均,权值可以用排序的方法确定,举个例子,比如A、B、C三种基本模型,模型效果进行排名,假设排名分别是1,2,3,那么给这三个模型赋予的权值分别是3/6、2/6、1/6。\n",
|
||||
"\n",
|
||||
"3. 平均法或加权平均法看似简单,其实后面的高级算法也可以说是基于此而产生的,Bagging或者Boosting都是一种把许多弱分类器这样融合成强分类器的思想。\n",
|
||||
"\n",
|
||||
"4. Averaging也可以用于对分类问题的概率进行平均。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**投票法-voting**\n",
|
||||
"\n",
|
||||
"1. 对于一个二分类问题,有3个基础模型,现在我们可以在这些基学习器的基础上得到一个投票的分类器,把票数最多的类作为我们要预测的类别。\n",
|
||||
"\n",
|
||||
"2. 投票法有硬投票(hard voting)和软投票(soft voting)\n",
|
||||
"\n",
|
||||
"3. 硬投票: 对多个模型直接进行投票,不区分模型结果的相对重要度,最终投票数最多的类为最终被预测的类。\n",
|
||||
"\n",
|
||||
"4. 软投票:增加了设置权重的功能,可以为不同模型设置不同权重,进而区别模型不同的重要度。\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 5.3.2 stacking/blending"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### 堆叠法-stacking "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"\n",
|
||||
"**基本思想**:用初始训练数据学习出若干个基学习器后,将这几个学习器的预测结果作为新的训练集(第一层),来学习一个新的学习器(第二层)。\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**背景**: 为了帮助大家理解模型的原理,我们先假定一下数据背景。\n",
|
||||
"1. 训练集数据大小为`10000*100`,测试集大小为`3000*100`。即训练集有10000条数据、100个特征;测试集有3000条数据、100个特征。该数据对应**回归问题**。\n",
|
||||
"\n",
|
||||
"2. 第一层使用三种算法-XGB、LGB、NN。第二层使用GBDT。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**算法解读**\n",
|
||||
"1. **stacking 第一层**\n",
|
||||
"\n",
|
||||
" 1. XGB算法 - 对应图中`model 1`部分\n",
|
||||
" - 输入:使用训练集进行5-fold处理\n",
|
||||
" - 处理:具体处理细节如下\n",
|
||||
" - 使用1、2、3、4折作为训练集,训练一个XGB模型并预测第5折和测试集,将预测结果分别称为**XGB-pred-tran5**(shape `2000*1`)和**XGB-pred-test1**(shape `3000*1`).\n",
|
||||
" - 使用1、2、3、5折作为训练集,训练一个XGB模型并预测第4折和测试集,将预测结果分别称为**XGB-pred-tran4**(shape `2000*1`)和**XGB-pred-test2**(shape `3000*1`).\n",
|
||||
" - 使用1、2、4、5折作为训练集,训练一个XGB模型并预测第3折和测试集,将预测结果分别称为**XGB-pred-tran3**(shape `2000*1`)和**XGB-pred-test3**(shape `3000*1`).\n",
|
||||
" - 使用1、3、4、5折作为训练集,训练一个XGB模型并预测第2折和测试集,将预测结果分别称为**XGB-pred-tran2**(shape `2000*1`)和**XGB-pred-test4**(shape `3000*1`).\n",
|
||||
" - 使用2、3、4、5折作为训练集,训练一个XGB模型并预测第1折和测试集,将预测结果分别称为**XGB-pred-tran1**(shape `2000*1`)和**XGB-pred-test5**(shape `3000*1`).\n",
|
||||
" - 输出:\n",
|
||||
" - 将XGB分别对1、2、3、4、5折进行预测的结果合并,得到**XGB-pred-tran**(shape `10000*1`)。并且根据5-fold的原理可以知道,与原数据可以形成对应关系。因此在图中称为NEW FEATURE。\n",
|
||||
" - 将XGB-pred-test1 - 5 的结果使用Averaging的方法求平均值,最终得到**XGB-pred-test**(shape `3000*1`)。\n",
|
||||
" \n",
|
||||
" 2. LGB算法 - 同样对应图中`model 1`部分\n",
|
||||
" - 输入:与XGB算法一致\n",
|
||||
" - 处理:与XGB算法一致。只需更改预测结果的命名即可,如**LGB-pred-tran5**和**LGB-pred-test1**\n",
|
||||
" - 输出:\n",
|
||||
" - 将LGB分别对1、2、3、4、5折进行预测的结果合并,得到**LGB-pred-tran**(shape `10000*1`)。\n",
|
||||
" - 将LGB-pred-test1 - 5 的结果使用Averaging的方法求平均值,最终得到**LGB-pred-test**(shape `3000*1`)。\n",
|
||||
" \n",
|
||||
" 3. NN算法 - 同样对应图中`model 1`部分\n",
|
||||
" - 输入:与XGB算法一致\n",
|
||||
" - 处理:与XGB算法一致。只需更改预测结果的命名即可,如**NN-pred-tran5**和**NN-pred-test1**\n",
|
||||
" - 输出:\n",
|
||||
" - 将NN分别对1、2、3、4、5折进行预测的结果合并,得到**NN-pred-tran**(shape `10000*1`)。\n",
|
||||
" - 将NN-pred-test1 - 5 的结果使用Averaging的方法求平均值,最终得到**NN-pred-test**(shape `3000*1`)。\n",
|
||||
"\n",
|
||||
"2. **stacking 第二层**\n",
|
||||
" - 训练集:将三个新特征 **XGB-pred-tran**、**LGB-pred-tran**、**NN-pred-tran**合并得到新的训练集(shape `10000*3`)\n",
|
||||
" - 测试集:将三个新测试集**XGB-pred-test**、**LGB-pred-test**、**NN-pred-test**合并得到新的测试集(shape `30000*3`)\n",
|
||||
" - 用新训练集和测试集构造第二层的预测器,即GBDT模型"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### 混合法 - blending"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Blending与Stacking大致相同,只是Blending的主要区别在于训练集不是通过K-Fold的CV策略来获得预测值从而生成第二阶段模型的特征,而是建立一个Holdout集。简单来说,Blending直接用不相交的数据集用于不同层的训练。\n",
|
||||
"\n",
|
||||
"同样以上述数据集为例,构造一个两层的Blending模型。\n",
|
||||
"\n",
|
||||
"首先将训练集划分为两部分(d1,d2),例如d1为4000条数据用于blending的第一层,d2是6000条数据用于blending的第二层。\n",
|
||||
"\n",
|
||||
"第一层:用d1训练多个模型,将其对d2和test的预测结果作为第二层的New Features。例如同样适用上述三个模型,对d2生成`6000*3`的新特征数据;对test生成`3000*3`的新特征矩阵。\n",
|
||||
"\n",
|
||||
"第二层:用d2的New Features和标签训练新的分类器,然后把test的New Features输入作为最终的测试集,对test预测出的结果就是最终的模型融合的值。\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### 优缺点对比"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Blending的优点在于:\n",
|
||||
"\n",
|
||||
"1. 比stacking简单(因为不用进行k次的交叉验证来获得stacker feature)\n",
|
||||
"\n",
|
||||
"2. 避开了一个信息泄露问题:generlizers和stacker使用了不一样的数据集\n",
|
||||
"\n",
|
||||
"3. 在团队建模过程中,不需要给队友分享自己的随机种子\n",
|
||||
"\n",
|
||||
"而缺点在于:\n",
|
||||
"\n",
|
||||
"1. 使用了很少的数据(是划分hold-out作为测试集,并非cv)\n",
|
||||
"\n",
|
||||
"2. blender可能会过拟合(其实大概率是第一点导致的)\n",
|
||||
"\n",
|
||||
"3. stacking使用多次的CV会比较稳健"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 5.4 代码实现"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"import numpy as np\n",
|
||||
"import warnings\n",
|
||||
"import matplotlib\n",
|
||||
"import matplotlib.pyplot as plt\n",
|
||||
"import seaborn as sns\n",
|
||||
"\n",
|
||||
"warnings.filterwarnings('ignore')\n",
|
||||
"%matplotlib inline\n",
|
||||
"\n",
|
||||
"import itertools\n",
|
||||
"import matplotlib.gridspec as gridspec\n",
|
||||
"from sklearn import datasets\n",
|
||||
"from sklearn.linear_model import LogisticRegression\n",
|
||||
"from sklearn.neighbors import KNeighborsClassifier\n",
|
||||
"from sklearn.naive_bayes import GaussianNB \n",
|
||||
"from sklearn.ensemble import RandomForestClassifier,RandomForestRegressor\n",
|
||||
"from sklearn.linear_model import LogisticRegression\n",
|
||||
"# from mlxtend.classifier import StackingClassifier\n",
|
||||
"from sklearn.model_selection import cross_val_score, train_test_split\n",
|
||||
"# from mlxtend.plotting import plot_learning_curves\n",
|
||||
"# from mlxtend.plotting import plot_decision_regions\n",
|
||||
"\n",
|
||||
"from sklearn.model_selection import StratifiedKFold\n",
|
||||
"from sklearn.model_selection import train_test_split\n",
|
||||
"from sklearn.model_selection import StratifiedKFold\n",
|
||||
"from sklearn.model_selection import train_test_split\n",
|
||||
"from sklearn.ensemble import AdaBoostClassifier\n",
|
||||
"from sklearn.ensemble import VotingClassifier\n",
|
||||
"import lightgbm as lgb\n",
|
||||
"from sklearn.neural_network import MLPClassifier,MLPRegressor\n",
|
||||
"from sklearn.metrics import mean_squared_error, mean_absolute_error"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 5.4.1 load data"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"import numpy as np\n",
|
||||
"from sklearn.metrics import classification_report, f1_score\n",
|
||||
"from sklearn.model_selection import StratifiedKFold, KFold,train_test_split"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def reduce_mem_usage(df):\n",
|
||||
" start_mem = df.memory_usage().sum() / 1024**2 \n",
|
||||
" print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))\n",
|
||||
" \n",
|
||||
" for col in df.columns:\n",
|
||||
" col_type = df[col].dtype\n",
|
||||
" \n",
|
||||
" if col_type != object:\n",
|
||||
" c_min = df[col].min()\n",
|
||||
" c_max = df[col].max()\n",
|
||||
" if str(col_type)[:3] == 'int':\n",
|
||||
" if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:\n",
|
||||
" df[col] = df[col].astype(np.int8)\n",
|
||||
" elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:\n",
|
||||
" df[col] = df[col].astype(np.int16)\n",
|
||||
" elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:\n",
|
||||
" df[col] = df[col].astype(np.int32)\n",
|
||||
" elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:\n",
|
||||
" df[col] = df[col].astype(np.int64) \n",
|
||||
" else:\n",
|
||||
" if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:\n",
|
||||
" df[col] = df[col].astype(np.float16)\n",
|
||||
" elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:\n",
|
||||
" df[col] = df[col].astype(np.float32)\n",
|
||||
" else:\n",
|
||||
" df[col] = df[col].astype(np.float64)\n",
|
||||
" else:\n",
|
||||
" df[col] = df[col].astype('category')\n",
|
||||
"\n",
|
||||
" end_mem = df.memory_usage().sum() / 1024**2 \n",
|
||||
" print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))\n",
|
||||
" print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))\n",
|
||||
" \n",
|
||||
" return df"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Memory usage of dataframe is 30.28 MB\n",
|
||||
"Memory usage after optimization is: 7.59 MB\n",
|
||||
"Decreased by 74.9%\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"all_df = pd.read_csv('data/group_df.csv',index_col=0)\n",
|
||||
"all_df = reduce_mem_usage(all_df)\n",
|
||||
"all_df = all_df.fillna(99)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"(9000, 440)"
|
||||
]
|
||||
},
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"all_df.shape"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
" 2 4361\n",
|
||||
"-1 2000\n",
|
||||
" 0 1621\n",
|
||||
" 1 1018\n",
|
||||
"Name: label, dtype: int64"
|
||||
]
|
||||
},
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"all_df['label'].value_counts()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"\n",
|
||||
"all_df中label为0/1/2的为训练集,一共有7000条;label为-1的为测试集,一共有2000条。\n",
|
||||
"1. label为-1的测试集没有label,这部分数据用于模拟真实比赛提交数据。\n",
|
||||
"\n",
|
||||
"2. train数据均有标签,我们将从中分出30%作为验证集,其余作为训练集。在验证集上比较模型性能优劣,模型性能均使用f1作为评分。\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"train = all_df[all_df['label'] != -1]\n",
|
||||
"test = all_df[all_df['label'] == -1]\n",
|
||||
"feats = [c for c in train.columns if c not in ['ID', 'label']]\n",
|
||||
"\n",
|
||||
"# 根据7:3划分训练集和测试集\n",
|
||||
"X_train,X_val,y_train,y_val= train_test_split(train[feats],train['label'],test_size=0.3,random_state=0)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 5.4.2 单模及加权融合"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"这里训练三个单模,分别是用了一个三种不同的RF/LGB/LGB模型。事实上模型融合需要基础分类器之间存在差异,一般不会选用相同的分类器模型。这里只是作为展示。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# 单模函数\n",
|
||||
"def build_model_rf(X_train,y_train):\n",
|
||||
" model = RandomForestClassifier(n_estimators = 100)\n",
|
||||
" model.fit(X_train, y_train)\n",
|
||||
" return model\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def build_model_lgb(X_train,y_train):\n",
|
||||
" model = lgb.LGBMClassifier(num_leaves=127,learning_rate = 0.1,n_estimators = 200)\n",
|
||||
" model.fit(X_train, y_train)\n",
|
||||
" return model\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def build_model_lgb2(X_train,y_train):\n",
|
||||
" model = lgb.LGBMClassifier(num_leaves=63,learning_rate = 0.05,n_estimators = 400)\n",
|
||||
" model.fit(X_train, y_train)\n",
|
||||
" return model\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"predict rf ...\n",
|
||||
"0.8987051046527208\n",
|
||||
"predict lgb...\n",
|
||||
"0.9144414270113281\n",
|
||||
"predict lgb 2...\n",
|
||||
"0.9183965870229657\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# 这里针对三个单模进行训练,其中subA_rf/lgb/nn都是可以提交的模型\n",
|
||||
"# 单模没有进行调参,因此是弱分类器,效果可能不是很好。\n",
|
||||
"\n",
|
||||
"print('predict rf ...')\n",
|
||||
"model_rf = build_model_rf(X_train,y_train)\n",
|
||||
"val_rf = model_rf.predict(X_val)\n",
|
||||
"subA_rf = model_rf.predict(test[feats])\n",
|
||||
"rf_f1_score = f1_score(y_val,val_rf,average='macro')\n",
|
||||
"print(rf_f1_score)\n",
|
||||
"\n",
|
||||
"print('predict lgb...')\n",
|
||||
"model_lgb = build_model_lgb(X_train,y_train)\n",
|
||||
"val_lgb = model_lgb.predict(X_val)\n",
|
||||
"subA_lgb = model_lgb.predict(test[feats])\n",
|
||||
"lgb_f1_score = f1_score(y_val,val_lgb,average='macro')\n",
|
||||
"print(lgb_f1_score)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"print('predict lgb 2...')\n",
|
||||
"model_lgb2 = build_model_lgb2(X_train,y_train)\n",
|
||||
"val_lgb2 = model_lgb2.predict(X_val)\n",
|
||||
"subA_lgb2 = model_lgb2.predict(test[feats])\n",
|
||||
"lgb2_f1_score = f1_score(y_val,val_lgb2,average='macro')\n",
|
||||
"print(lgb2_f1_score)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"0.9142736444973326\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"voting_clf = VotingClassifier(estimators=[('rf',model_rf ),\n",
|
||||
" ('lgb',model_lgb),\n",
|
||||
" ('lgb2',model_lgb2 )],voting='hard')\n",
|
||||
"\n",
|
||||
"voting_clf.fit(X_train,y_train)\n",
|
||||
"val_voting = voting_clf.predict(X_val)\n",
|
||||
"subA_voting = voting_clf.predict(test[feats])\n",
|
||||
"voting_f1_score = f1_score(y_val,val_voting,average='macro')\n",
|
||||
"print(voting_f1_score)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 5.4.3 Stacking融合"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"_N_FOLDS = 5 # 采用5折交叉验证\n",
|
||||
"kf = KFold(n_splits=_N_FOLDS, random_state=42) # sklearn的交叉验证模块,用于划分数据\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def get_oof(clf, X_train, y_train, X_test):\n",
|
||||
" oof_train = np.zeros((X_train.shape[0], 1)) \n",
|
||||
" oof_test_skf = np.empty((_N_FOLDS, X_test.shape[0], 1)) \n",
|
||||
" \n",
|
||||
" for i, (train_index, test_index) in enumerate(kf.split(X_train)): # 交叉验证划分此时的训练集和验证集\n",
|
||||
" kf_X_train = X_train.iloc[train_index,]\n",
|
||||
" kf_y_train = y_train.iloc[train_index,]\n",
|
||||
" kf_X_val = X_train.iloc[test_index,]\n",
|
||||
" \n",
|
||||
" clf.fit(kf_X_train, kf_y_train)\n",
|
||||
" \n",
|
||||
" oof_train[test_index] = clf.predict(kf_X_val).reshape(-1, 1) \n",
|
||||
" oof_test_skf[i, :] = clf.predict(X_test).reshape(-1, 1) \n",
|
||||
" \n",
|
||||
" oof_test = oof_test_skf.mean(axis=0) # 对每一则交叉验证的结果取平均\n",
|
||||
" return oof_train, oof_test # 返回当前分类器对训练集和测试集的预测结果"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# 将你的每个分类器都调用get_oof函数,并把它们的结果合并,就得到了新的训练和测试数据new_train,new_test\n",
|
||||
"new_train, new_test = [], []\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"model1 = RandomForestClassifier(n_estimators = 100)\n",
|
||||
"model2 = lgb.LGBMClassifier(num_leaves=127,learning_rate = 0.1,n_estimators = 200)\n",
|
||||
"model3 = lgb.LGBMClassifier(num_leaves=63,learning_rate = 0.05,n_estimators = 400)\n",
|
||||
"\n",
|
||||
"for clf in [model1, model2, model3]:\n",
|
||||
" oof_train, oof_test = get_oof(clf, X_train, y_train, X_val)\n",
|
||||
" new_train.append(oof_train)\n",
|
||||
" new_test.append(oof_test)\n",
|
||||
" \n",
|
||||
"new_train = np.concatenate(new_train, axis=1)\n",
|
||||
"new_test = np.concatenate(new_test, axis=1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 18,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"0.8816601744239989\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# 用新的训练数据new_train作为新的模型的输入,stacking第二层\n",
|
||||
"# 使用LogisticRegression作为第二层是为了防止模型过拟合\n",
|
||||
"# 这里使用的模型还有待优化,因此模型融合效果并不是很好\n",
|
||||
"clf = LogisticRegression()\n",
|
||||
"clf.fit(new_train, y_train)\n",
|
||||
"result = clf.predict(new_test)\n",
|
||||
"\n",
|
||||
"stacking_f1_score = f1_score(y_val,result,average='macro')\n",
|
||||
"print(stacking_f1_score)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 5.5 思考题"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"1. 如何基于stacking改进出blending - stacking使用了foldCV,blending使用了holdout.\n",
|
||||
"\n",
|
||||
"2. stacking还可以进行哪些优化提升F1-score - 从第一层模型数量?模型差异性?角度出发"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**参考内容**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"https://blog.csdn.net/weixin_44585839/article/details/110148396\n",
|
||||
"\n",
|
||||
"https://blog.csdn.net/weixin_39962758/article/details/111101263"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"\n",
|
||||
"**END.**\n",
|
||||
"\n",
|
||||
"【 张晋 :Datawhale成员,算法竞赛爱好者。CSDN:https://blog.csdn.net/weixin_44585839/】\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"关于Datawhale:\n",
|
||||
"\n",
|
||||
"> Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。\n",
|
||||
"\n",
|
||||
"本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:\n",
|
||||
"\n",
|
||||
""
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"hide_input": false,
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.5"
|
||||
},
|
||||
"toc": {
|
||||
"base_numbering": 1,
|
||||
"nav_menu": {},
|
||||
"number_sections": true,
|
||||
"sideBar": true,
|
||||
"skip_h1_title": false,
|
||||
"title_cell": "Table of Contents",
|
||||
"title_sidebar": "Contents",
|
||||
"toc_cell": false,
|
||||
"toc_position": {},
|
||||
"toc_section_display": true,
|
||||
"toc_window_display": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
||||
@@ -1116,7 +1116,7 @@
|
||||
" tmp_df['lat'] = tmp_df['lat'].astype(float)\n",
|
||||
" tmp_df['lon'] = tmp_df['lon'].astype(float)\n",
|
||||
" tmp_df['speed'] = tmp_df['speed'].astype(float)\n",
|
||||
" tmp_df['direction'] = tmp_df['direction'].astype(int)\n",
|
||||
" tmp_df['direction'] = tmp_df['direction'].astype(int)#如果该行代码运行失败,请尝试更新pandas的版本\n",
|
||||
" return tmp_df\n",
|
||||
"# 平面坐标转经纬度,供初赛数据使用\n",
|
||||
"# 选择标准为NAD83 / California zone 6 (ftUS) (EPSG:2230),查询链接:https://mygeodata.cloud/cs2cs/\n",
|
||||
@@ -1674,6 +1674,20 @@
|
||||
"进阶作业: \n",
|
||||
"2.在这个模块中,我们介绍了各种库以及他们常用的方法。如果可以,请同学们尝试在原有剔除异常点的数据(DF)中保留douglas-peucker算法所识别的关键点的数据,删除douglas-peucker未保存的数据,并尝试对这些坐标点进行geohash编码"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 参考内容"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"https://tianchi.aliyun.com/forum/postDetail?spm=5176.12586969.1002.3.163c24d1HiGiFo&postId=110644"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
|
||||
@@ -1556,6 +1556,7 @@
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"hide_input": false,
|
||||
"kernelspec": {
|
||||
"display_name": "Python [conda env:seacom]",
|
||||
"language": "python",
|
||||
|
||||
@@ -3560,6 +3560,7 @@
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"hide_input": false,
|
||||
"kernelspec": {
|
||||
"display_name": "Python [conda env:all] *",
|
||||
"language": "python",
|
||||
|
||||
@@ -37,9 +37,8 @@ https://tianchi.aliyun.com/competition/entrance/231768/information
|
||||
|
||||
二、比赛数据和地理数据分析常用工具介绍中的附件数据
|
||||
|
||||
链接:https://pan.baidu.com/s/1AEWhNkSzx6Ls8XmVXFOQMg
|
||||
|
||||
提取码:wrgg
|
||||
链接:https://pan.xunlei.com/s/VMX5JAhFN7ZmPaaCVsHQEVkrA1
|
||||
提取码:hmtz
|
||||
|
||||
比赛数据在本次组队学习中只用到了hy_round1_testA_20200102与hy_round1_train_20200102文件。其中DF.csv和df_gpd_change.pkl 分别是Task1中所需要的数据。 其中DF.csv是将轨迹数据进行异常处理之后的数据,而df_gpd_change.pkl是将异常处理之后的数据进行douglas-peucker算法进行压缩之后的数据。
|
||||
|
||||
|
||||
@@ -670,16 +670,10 @@
|
||||
"\n",
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"hide_input": false,
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
@@ -696,6 +690,19 @@
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.5"
|
||||
},
|
||||
"toc": {
|
||||
"base_numbering": 1,
|
||||
"nav_menu": {},
|
||||
"number_sections": true,
|
||||
"sideBar": true,
|
||||
"skip_h1_title": false,
|
||||
"title_cell": "Table of Contents",
|
||||
"title_sidebar": "Contents",
|
||||
"toc_cell": false,
|
||||
"toc_position": {},
|
||||
"toc_section_display": true,
|
||||
"toc_window_display": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
||||
Reference in New Issue
Block a user