lyj_commit

This commit is contained in:
liyunjia
2021-04-06 16:40:09 +08:00
parent 7842d69867
commit 2191631ba1
9 changed files with 7671 additions and 11 deletions

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,710 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Datawhale 智慧海洋建设-Task5 模型融合"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5.1 学习目标"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"学习融合策略\n",
"\n",
"完成相应学习打卡任务"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5.2 内容介绍"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"https://mlwave.com/kaggle-ensembling-guide/ \n",
"https://github.com/MLWave/Kaggle-Ensemble-Guide"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"模型融合是比赛后期一个重要的环节,大体来说有如下的类型方式。\n",
"\n",
"1. 简单加权融合:\n",
" - 回归分类概率算术平均融合Arithmetic mean几何平均融合Geometric mean\n",
" - 分类投票Voting)\n",
"\n",
"\n",
"2. boosting/bagging在xgboostAdaboost,GBDT中已经用到:\n",
" - 多树的提升方法\n",
" \n",
" \n",
"3. stacking/blending:\n",
" - 构建多层模型,并利用预测结果再拟合预测。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5.3 相关理论介绍"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5.3.1 简单加权融合"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**平均法-Averaging**\n",
"\n",
"1. 对于回归问题,一个简单直接的思路是取平均。将多个模型的回归结果取平均值作为最终预测结果,进而把多个弱分类器荣和城强分类器。\n",
"\n",
"2. 稍稍改进的方法是进行加权平均权值可以用排序的方法确定举个例子比如A、B、C三种基本模型模型效果进行排名假设排名分别是123那么给这三个模型赋予的权值分别是3/6、2/6、1/6。\n",
"\n",
"3. 平均法或加权平均法看似简单其实后面的高级算法也可以说是基于此而产生的Bagging或者Boosting都是一种把许多弱分类器这样融合成强分类器的思想。\n",
"\n",
"4. Averaging也可以用于对分类问题的概率进行平均。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**投票法-voting**\n",
"\n",
"1. 对于一个二分类问题有3个基础模型现在我们可以在这些基学习器的基础上得到一个投票的分类器把票数最多的类作为我们要预测的类别。\n",
"\n",
"2. 投票法有硬投票hard voting和软投票soft voting\n",
"\n",
"3. 硬投票: 对多个模型直接进行投票,不区分模型结果的相对重要度,最终投票数最多的类为最终被预测的类。\n",
"\n",
"4. 软投票:增加了设置权重的功能,可以为不同模型设置不同权重,进而区别模型不同的重要度。\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5.3.2 stacking/blending"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 堆叠法-stacking "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"**基本思想**:用初始训练数据学习出若干个基学习器后,将这几个学习器的预测结果作为新的训练集(第一层),来学习一个新的学习器(第二层)。\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**背景**: 为了帮助大家理解模型的原理,我们先假定一下数据背景。\n",
"1. 训练集数据大小为`10000*100`,测试集大小为`3000*100`。即训练集有10000条数据、100个特征测试集有3000条数据、100个特征。该数据对应**回归问题**。\n",
"\n",
"2. 第一层使用三种算法-XGB、LGB、NN。第二层使用GBDT。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**算法解读**\n",
"1. **stacking 第一层**\n",
"\n",
" 1. XGB算法 - 对应图中`model 1`部分\n",
" - 输入使用训练集进行5-fold处理\n",
" - 处理:具体处理细节如下\n",
" - 使用1、2、3、4折作为训练集训练一个XGB模型并预测第5折和测试集将预测结果分别称为**XGB-pred-tran5**(shape `2000*1`)和**XGB-pred-test1**(shape `3000*1`).\n",
" - 使用1、2、3、5折作为训练集训练一个XGB模型并预测第4折和测试集将预测结果分别称为**XGB-pred-tran4**(shape `2000*1`)和**XGB-pred-test2**(shape `3000*1`).\n",
" - 使用1、2、4、5折作为训练集训练一个XGB模型并预测第3折和测试集将预测结果分别称为**XGB-pred-tran3**(shape `2000*1`)和**XGB-pred-test3**(shape `3000*1`).\n",
" - 使用1、3、4、5折作为训练集训练一个XGB模型并预测第2折和测试集将预测结果分别称为**XGB-pred-tran2**(shape `2000*1`)和**XGB-pred-test4**(shape `3000*1`).\n",
" - 使用2、3、4、5折作为训练集训练一个XGB模型并预测第1折和测试集将预测结果分别称为**XGB-pred-tran1**(shape `2000*1`)和**XGB-pred-test5**(shape `3000*1`).\n",
" - 输出:\n",
" - 将XGB分别对1、2、3、4、5折进行预测的结果合并得到**XGB-pred-tran**(shape `10000*1`)。并且根据5-fold的原理可以知道与原数据可以形成对应关系。因此在图中称为NEW FEATURE。\n",
" - 将XGB-pred-test1 - 5 的结果使用Averaging的方法求平均值最终得到**XGB-pred-test**(shape `3000*1`)。\n",
" \n",
" 2. LGB算法 - 同样对应图中`model 1`部分\n",
" - 输入与XGB算法一致\n",
" - 处理与XGB算法一致。只需更改预测结果的命名即可如**LGB-pred-tran5**和**LGB-pred-test1**\n",
" - 输出:\n",
" - 将LGB分别对1、2、3、4、5折进行预测的结果合并得到**LGB-pred-tran**(shape `10000*1`)。\n",
" - 将LGB-pred-test1 - 5 的结果使用Averaging的方法求平均值最终得到**LGB-pred-test**(shape `3000*1`)。\n",
" \n",
" 3. NN算法 - 同样对应图中`model 1`部分\n",
" - 输入与XGB算法一致\n",
" - 处理与XGB算法一致。只需更改预测结果的命名即可如**NN-pred-tran5**和**NN-pred-test1**\n",
" - 输出:\n",
" - 将NN分别对1、2、3、4、5折进行预测的结果合并得到**NN-pred-tran**(shape `10000*1`)。\n",
" - 将NN-pred-test1 - 5 的结果使用Averaging的方法求平均值最终得到**NN-pred-test**(shape `3000*1`)。\n",
"\n",
"2. **stacking 第二层**\n",
" - 训练集:将三个新特征 **XGB-pred-tran**、**LGB-pred-tran**、**NN-pred-tran**合并得到新的训练集(shape `10000*3`)\n",
" - 测试集:将三个新测试集**XGB-pred-test**、**LGB-pred-test**、**NN-pred-test**合并得到新的测试集(shape `30000*3`)\n",
" - 用新训练集和测试集构造第二层的预测器即GBDT模型"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![在这里插入图片描述](https://img-blog.csdnimg.cn/20210401090352724.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80NDU4NTgzOQ==,size_16,color_FFFFFF,t_70)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 混合法 - blending"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Blending与Stacking大致相同只是Blending的主要区别在于训练集不是通过K-Fold的CV策略来获得预测值从而生成第二阶段模型的特征而是建立一个Holdout集。简单来说Blending直接用不相交的数据集用于不同层的训练。\n",
"\n",
"同样以上述数据集为例构造一个两层的Blending模型。\n",
"\n",
"首先将训练集划分为两部分(d1d2)例如d1为4000条数据用于blending的第一层d2是6000条数据用于blending的第二层。\n",
"\n",
"第一层用d1训练多个模型将其对d2和test的预测结果作为第二层的New Features。例如同样适用上述三个模型对d2生成`6000*3`的新特征数据对test生成`3000*3`的新特征矩阵。\n",
"\n",
"第二层用d2的New Features和标签训练新的分类器然后把test的New Features输入作为最终的测试集对test预测出的结果就是最终的模型融合的值。\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 优缺点对比"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Blending的优点在于\n",
"\n",
"1. 比stacking简单因为不用进行k次的交叉验证来获得stacker feature\n",
"\n",
"2. 避开了一个信息泄露问题generlizers和stacker使用了不一样的数据集\n",
"\n",
"3. 在团队建模过程中,不需要给队友分享自己的随机种子\n",
"\n",
"而缺点在于:\n",
"\n",
"1. 使用了很少的数据是划分hold-out作为测试集并非cv\n",
"\n",
"2. blender可能会过拟合其实大概率是第一点导致的\n",
"\n",
"3. stacking使用多次的CV会比较稳健"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5.4 代码实现"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import warnings\n",
"import matplotlib\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"warnings.filterwarnings('ignore')\n",
"%matplotlib inline\n",
"\n",
"import itertools\n",
"import matplotlib.gridspec as gridspec\n",
"from sklearn import datasets\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn.naive_bayes import GaussianNB \n",
"from sklearn.ensemble import RandomForestClassifier,RandomForestRegressor\n",
"from sklearn.linear_model import LogisticRegression\n",
"# from mlxtend.classifier import StackingClassifier\n",
"from sklearn.model_selection import cross_val_score, train_test_split\n",
"# from mlxtend.plotting import plot_learning_curves\n",
"# from mlxtend.plotting import plot_decision_regions\n",
"\n",
"from sklearn.model_selection import StratifiedKFold\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.model_selection import StratifiedKFold\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.ensemble import AdaBoostClassifier\n",
"from sklearn.ensemble import VotingClassifier\n",
"import lightgbm as lgb\n",
"from sklearn.neural_network import MLPClassifier,MLPRegressor\n",
"from sklearn.metrics import mean_squared_error, mean_absolute_error"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5.4.1 load data"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.metrics import classification_report, f1_score\n",
"from sklearn.model_selection import StratifiedKFold, KFold,train_test_split"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def reduce_mem_usage(df):\n",
" start_mem = df.memory_usage().sum() / 1024**2 \n",
" print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))\n",
" \n",
" for col in df.columns:\n",
" col_type = df[col].dtype\n",
" \n",
" if col_type != object:\n",
" c_min = df[col].min()\n",
" c_max = df[col].max()\n",
" if str(col_type)[:3] == 'int':\n",
" if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:\n",
" df[col] = df[col].astype(np.int8)\n",
" elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:\n",
" df[col] = df[col].astype(np.int16)\n",
" elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:\n",
" df[col] = df[col].astype(np.int32)\n",
" elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:\n",
" df[col] = df[col].astype(np.int64) \n",
" else:\n",
" if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:\n",
" df[col] = df[col].astype(np.float16)\n",
" elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:\n",
" df[col] = df[col].astype(np.float32)\n",
" else:\n",
" df[col] = df[col].astype(np.float64)\n",
" else:\n",
" df[col] = df[col].astype('category')\n",
"\n",
" end_mem = df.memory_usage().sum() / 1024**2 \n",
" print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))\n",
" print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))\n",
" \n",
" return df"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Memory usage of dataframe is 30.28 MB\n",
"Memory usage after optimization is: 7.59 MB\n",
"Decreased by 74.9%\n"
]
}
],
"source": [
"all_df = pd.read_csv('data/group_df.csv',index_col=0)\n",
"all_df = reduce_mem_usage(all_df)\n",
"all_df = all_df.fillna(99)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(9000, 440)"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_df.shape"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
" 2 4361\n",
"-1 2000\n",
" 0 1621\n",
" 1 1018\n",
"Name: label, dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_df['label'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"all_df中label为0/1/2的为训练集一共有7000条label为-1的为测试集一共有2000条。\n",
"1. label为-1的测试集没有label这部分数据用于模拟真实比赛提交数据。\n",
"\n",
"2. train数据均有标签我们将从中分出30%作为验证集其余作为训练集。在验证集上比较模型性能优劣模型性能均使用f1作为评分。\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"train = all_df[all_df['label'] != -1]\n",
"test = all_df[all_df['label'] == -1]\n",
"feats = [c for c in train.columns if c not in ['ID', 'label']]\n",
"\n",
"# 根据73划分训练集和测试集\n",
"X_train,X_val,y_train,y_val= train_test_split(train[feats],train['label'],test_size=0.3,random_state=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5.4.2 单模及加权融合"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"这里训练三个单模分别是用了一个三种不同的RF/LGB/LGB模型。事实上模型融合需要基础分类器之间存在差异一般不会选用相同的分类器模型。这里只是作为展示。"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# 单模函数\n",
"def build_model_rf(X_train,y_train):\n",
" model = RandomForestClassifier(n_estimators = 100)\n",
" model.fit(X_train, y_train)\n",
" return model\n",
"\n",
"\n",
"def build_model_lgb(X_train,y_train):\n",
" model = lgb.LGBMClassifier(num_leaves=127,learning_rate = 0.1,n_estimators = 200)\n",
" model.fit(X_train, y_train)\n",
" return model\n",
"\n",
"\n",
"def build_model_lgb2(X_train,y_train):\n",
" model = lgb.LGBMClassifier(num_leaves=63,learning_rate = 0.05,n_estimators = 400)\n",
" model.fit(X_train, y_train)\n",
" return model\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"predict rf ...\n",
"0.8987051046527208\n",
"predict lgb...\n",
"0.9144414270113281\n",
"predict lgb 2...\n",
"0.9183965870229657\n"
]
}
],
"source": [
"# 这里针对三个单模进行训练其中subA_rf/lgb/nn都是可以提交的模型\n",
"# 单模没有进行调参,因此是弱分类器,效果可能不是很好。\n",
"\n",
"print('predict rf ...')\n",
"model_rf = build_model_rf(X_train,y_train)\n",
"val_rf = model_rf.predict(X_val)\n",
"subA_rf = model_rf.predict(test[feats])\n",
"rf_f1_score = f1_score(y_val,val_rf,average='macro')\n",
"print(rf_f1_score)\n",
"\n",
"print('predict lgb...')\n",
"model_lgb = build_model_lgb(X_train,y_train)\n",
"val_lgb = model_lgb.predict(X_val)\n",
"subA_lgb = model_lgb.predict(test[feats])\n",
"lgb_f1_score = f1_score(y_val,val_lgb,average='macro')\n",
"print(lgb_f1_score)\n",
"\n",
"\n",
"print('predict lgb 2...')\n",
"model_lgb2 = build_model_lgb2(X_train,y_train)\n",
"val_lgb2 = model_lgb2.predict(X_val)\n",
"subA_lgb2 = model_lgb2.predict(test[feats])\n",
"lgb2_f1_score = f1_score(y_val,val_lgb2,average='macro')\n",
"print(lgb2_f1_score)\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.9142736444973326\n"
]
}
],
"source": [
"voting_clf = VotingClassifier(estimators=[('rf',model_rf ),\n",
" ('lgb',model_lgb),\n",
" ('lgb2',model_lgb2 )],voting='hard')\n",
"\n",
"voting_clf.fit(X_train,y_train)\n",
"val_voting = voting_clf.predict(X_val)\n",
"subA_voting = voting_clf.predict(test[feats])\n",
"voting_f1_score = f1_score(y_val,val_voting,average='macro')\n",
"print(voting_f1_score)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5.4.3 Stacking融合"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"_N_FOLDS = 5 # 采用5折交叉验证\n",
"kf = KFold(n_splits=_N_FOLDS, random_state=42) # sklearn的交叉验证模块用于划分数据\n",
"\n",
"\n",
"def get_oof(clf, X_train, y_train, X_test):\n",
" oof_train = np.zeros((X_train.shape[0], 1)) \n",
" oof_test_skf = np.empty((_N_FOLDS, X_test.shape[0], 1)) \n",
" \n",
" for i, (train_index, test_index) in enumerate(kf.split(X_train)): # 交叉验证划分此时的训练集和验证集\n",
" kf_X_train = X_train.iloc[train_index,]\n",
" kf_y_train = y_train.iloc[train_index,]\n",
" kf_X_val = X_train.iloc[test_index,]\n",
" \n",
" clf.fit(kf_X_train, kf_y_train)\n",
" \n",
" oof_train[test_index] = clf.predict(kf_X_val).reshape(-1, 1) \n",
" oof_test_skf[i, :] = clf.predict(X_test).reshape(-1, 1) \n",
" \n",
" oof_test = oof_test_skf.mean(axis=0) # 对每一则交叉验证的结果取平均\n",
" return oof_train, oof_test # 返回当前分类器对训练集和测试集的预测结果"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# 将你的每个分类器都调用get_oof函数并把它们的结果合并就得到了新的训练和测试数据new_train,new_test\n",
"new_train, new_test = [], []\n",
"\n",
"\n",
"model1 = RandomForestClassifier(n_estimators = 100)\n",
"model2 = lgb.LGBMClassifier(num_leaves=127,learning_rate = 0.1,n_estimators = 200)\n",
"model3 = lgb.LGBMClassifier(num_leaves=63,learning_rate = 0.05,n_estimators = 400)\n",
"\n",
"for clf in [model1, model2, model3]:\n",
" oof_train, oof_test = get_oof(clf, X_train, y_train, X_val)\n",
" new_train.append(oof_train)\n",
" new_test.append(oof_test)\n",
" \n",
"new_train = np.concatenate(new_train, axis=1)\n",
"new_test = np.concatenate(new_test, axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.8816601744239989\n"
]
}
],
"source": [
"# 用新的训练数据new_train作为新的模型的输入stacking第二层\n",
"# 使用LogisticRegression作为第二层是为了防止模型过拟合\n",
"# 这里使用的模型还有待优化,因此模型融合效果并不是很好\n",
"clf = LogisticRegression()\n",
"clf.fit(new_train, y_train)\n",
"result = clf.predict(new_test)\n",
"\n",
"stacking_f1_score = f1_score(y_val,result,average='macro')\n",
"print(stacking_f1_score)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5.5 思考题"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. 如何基于stacking改进出blending - stacking使用了foldCVblending使用了holdout.\n",
"\n",
"2. stacking还可以进行哪些优化提升F1-score - 从第一层模型数量?模型差异性?角度出发"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**参考内容**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"https://blog.csdn.net/weixin_44585839/article/details/110148396\n",
"\n",
"https://blog.csdn.net/weixin_39962758/article/details/111101263"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"**END.**\n",
"\n",
"【 张晋 Datawhale成员算法竞赛爱好者。CSDNhttps://blog.csdn.net/weixin_44585839/】\n",
"\n",
"\n",
"\n",
"关于Datawhale\n",
"\n",
"> Datawhale是一个专注于数据科学与AI领域的开源组织汇集了众多领域院校和知名企业的优秀学习者聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner和学习者一起成长”为愿景鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。\n",
"\n",
"本次数据挖掘路径学习专题知识将在天池分享详情可关注Datawhale\n",
"\n",
"![logo.png](https://img-blog.csdnimg.cn/2020090509294089.png)"
]
}
],
"metadata": {
"hide_input": false,
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@@ -1116,7 +1116,7 @@
" tmp_df['lat'] = tmp_df['lat'].astype(float)\n",
" tmp_df['lon'] = tmp_df['lon'].astype(float)\n",
" tmp_df['speed'] = tmp_df['speed'].astype(float)\n",
" tmp_df['direction'] = tmp_df['direction'].astype(int)\n",
" tmp_df['direction'] = tmp_df['direction'].astype(int)#如果该行代码运行失败请尝试更新pandas的版本\n",
" return tmp_df\n",
"# 平面坐标转经纬度,供初赛数据使用\n",
"# 选择标准为NAD83 / California zone 6 (ftUS) (EPSG:2230)查询链接https://mygeodata.cloud/cs2cs/\n",
@@ -1674,6 +1674,20 @@
"进阶作业: \n",
"2.在这个模块中我们介绍了各种库以及他们常用的方法。如果可以请同学们尝试在原有剔除异常点的数据DF中保留douglas-peucker算法所识别的关键点的数据删除douglas-peucker未保存的数据并尝试对这些坐标点进行geohash编码"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 参考内容"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"https://tianchi.aliyun.com/forum/postDetail?spm=5176.12586969.1002.3.163c24d1HiGiFo&postId=110644"
]
}
],
"metadata": {

View File

@@ -1556,6 +1556,7 @@
}
],
"metadata": {
"hide_input": false,
"kernelspec": {
"display_name": "Python [conda env:seacom]",
"language": "python",

View File

@@ -3560,6 +3560,7 @@
}
],
"metadata": {
"hide_input": false,
"kernelspec": {
"display_name": "Python [conda env:all] *",
"language": "python",

View File

@@ -37,9 +37,8 @@ https://tianchi.aliyun.com/competition/entrance/231768/information
二、比赛数据和地理数据分析常用工具介绍中的附件数据
链接https://pan.baidu.com/s/1AEWhNkSzx6Ls8XmVXFOQMg
提取码wrgg
链接https://pan.xunlei.com/s/VMX5JAhFN7ZmPaaCVsHQEVkrA1
提取码hmtz
比赛数据在本次组队学习中只用到了hy_round1_testA_20200102与hy_round1_train_20200102文件。其中DF.csv和df_gpd_change.pkl 分别是Task1中所需要的数据。 其中DF.csv是将轨迹数据进行异常处理之后的数据而df_gpd_change.pkl是将异常处理之后的数据进行douglas-peucker算法进行压缩之后的数据。

View File

@@ -670,16 +670,10 @@
"\n",
"![logo.png](https://img-blog.csdnimg.cn/2020090509294089.png)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"hide_input": false,
"kernelspec": {
"display_name": "Python 3",
"language": "python",
@@ -696,6 +690,19 @@
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,