Files
team-learning-data-mining/SecondHandCarPriceForecast/Baseline.md
2020-07-30 09:17:18 +08:00

1211 lines
28 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Datawhale 零基础入门数据挖掘-Baseline
## Baseline-v1.0 版
Tip:这是一个最初始baseline版本,抛砖引玉,为大家提供一个基本Baseline和一个竞赛流程的基本介绍欢迎大家多多交流。
**赛题:零基础入门数据挖掘 - 二手车交易价格预测**
地址https://tianchi.aliyun.com/competition/entrance/231784/introduction?spm=5176.12281957.1004.1.38b02448ausjSX
```python
# 查看数据文件目录 list datalab files
!ls datalab/
```
231784
### Step 1:导入函数工具箱
```python
## 基础工具
import numpy as np
import pandas as pd
import warnings
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.special import jn
from IPython.display import display, clear_output
import time
warnings.filterwarnings('ignore')
%matplotlib inline
## 模型预测的
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
## 数据降维处理的
from sklearn.decomposition import PCA,FastICA,FactorAnalysis,SparsePCA
import lightgbm as lgb
import xgboost as xgb
## 参数搜索和评价的
from sklearn.model_selection import GridSearchCV,cross_val_score,StratifiedKFold,train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
```
### Step 2:数据读取
```python
## 通过Pandas对于数据进行读取 (pandas是一个很友好的数据读取函数库)
Train_data = pd.read_csv('datalab/231784/used_car_train_20200313.csv', sep=' ')
TestA_data = pd.read_csv('datalab/231784/used_car_testA_20200313.csv', sep=' ')
## 输出数据的大小信息
print('Train data shape:',Train_data.shape)
print('TestA data shape:',TestA_data.shape)
```
Train data shape: (150000, 31)
TestA data shape: (50000, 30)
#### 1) 数据简要浏览
```python
## 通过.head() 简要浏览读取数据的形式
Train_data.head()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>SaleID</th>
<th>name</th>
<th>regDate</th>
<th>model</th>
<th>brand</th>
<th>bodyType</th>
<th>fuelType</th>
<th>gearbox</th>
<th>power</th>
<th>kilometer</th>
<th>...</th>
<th>v_5</th>
<th>v_6</th>
<th>v_7</th>
<th>v_8</th>
<th>v_9</th>
<th>v_10</th>
<th>v_11</th>
<th>v_12</th>
<th>v_13</th>
<th>v_14</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0</td>
<td>736</td>
<td>20040402</td>
<td>30.0</td>
<td>6</td>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
<td>60</td>
<td>12.5</td>
<td>...</td>
<td>0.235676</td>
<td>0.101988</td>
<td>0.129549</td>
<td>0.022816</td>
<td>0.097462</td>
<td>-2.881803</td>
<td>2.804097</td>
<td>-2.420821</td>
<td>0.795292</td>
<td>0.914762</td>
</tr>
<tr>
<th>1</th>
<td>1</td>
<td>2262</td>
<td>20030301</td>
<td>40.0</td>
<td>1</td>
<td>2.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0</td>
<td>15.0</td>
<td>...</td>
<td>0.264777</td>
<td>0.121004</td>
<td>0.135731</td>
<td>0.026597</td>
<td>0.020582</td>
<td>-4.900482</td>
<td>2.096338</td>
<td>-1.030483</td>
<td>-1.722674</td>
<td>0.245522</td>
</tr>
<tr>
<th>2</th>
<td>2</td>
<td>14874</td>
<td>20040403</td>
<td>115.0</td>
<td>15</td>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
<td>163</td>
<td>12.5</td>
<td>...</td>
<td>0.251410</td>
<td>0.114912</td>
<td>0.165147</td>
<td>0.062173</td>
<td>0.027075</td>
<td>-4.846749</td>
<td>1.803559</td>
<td>1.565330</td>
<td>-0.832687</td>
<td>-0.229963</td>
</tr>
<tr>
<th>3</th>
<td>3</td>
<td>71865</td>
<td>19960908</td>
<td>109.0</td>
<td>10</td>
<td>0.0</td>
<td>0.0</td>
<td>1.0</td>
<td>193</td>
<td>15.0</td>
<td>...</td>
<td>0.274293</td>
<td>0.110300</td>
<td>0.121964</td>
<td>0.033395</td>
<td>0.000000</td>
<td>-4.509599</td>
<td>1.285940</td>
<td>-0.501868</td>
<td>-2.438353</td>
<td>-0.478699</td>
</tr>
<tr>
<th>4</th>
<td>4</td>
<td>111080</td>
<td>20120103</td>
<td>110.0</td>
<td>5</td>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
<td>68</td>
<td>5.0</td>
<td>...</td>
<td>0.228036</td>
<td>0.073205</td>
<td>0.091880</td>
<td>0.078819</td>
<td>0.121534</td>
<td>-1.896240</td>
<td>0.910783</td>
<td>0.931110</td>
<td>2.834518</td>
<td>1.923482</td>
</tr>
</tbody>
</table>
<p>5 rows × 31 columns</p>
</div>
#### 2) 数据信息查看
```python
## 通过 .info() 简要可以看到对应一些数据列名以及NAN缺失信息
Train_data.info()
```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
SaleID 150000 non-null int64
name 150000 non-null int64
regDate 150000 non-null int64
model 149999 non-null float64
brand 150000 non-null int64
bodyType 145494 non-null float64
fuelType 141320 non-null float64
gearbox 144019 non-null float64
power 150000 non-null int64
kilometer 150000 non-null float64
notRepairedDamage 150000 non-null object
regionCode 150000 non-null int64
seller 150000 non-null int64
offerType 150000 non-null int64
creatDate 150000 non-null int64
price 150000 non-null int64
v_0 150000 non-null float64
v_1 150000 non-null float64
v_2 150000 non-null float64
v_3 150000 non-null float64
v_4 150000 non-null float64
v_5 150000 non-null float64
v_6 150000 non-null float64
v_7 150000 non-null float64
v_8 150000 non-null float64
v_9 150000 non-null float64
v_10 150000 non-null float64
v_11 150000 non-null float64
v_12 150000 non-null float64
v_13 150000 non-null float64
v_14 150000 non-null float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB
```python
## 通过 .columns 查看列名
Train_data.columns
```
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
'seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3',
'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12',
'v_13', 'v_14'],
dtype='object')
```python
TestA_data.info()
```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 30 columns):
SaleID 50000 non-null int64
name 50000 non-null int64
regDate 50000 non-null int64
model 50000 non-null float64
brand 50000 non-null int64
bodyType 48587 non-null float64
fuelType 47107 non-null float64
gearbox 48090 non-null float64
power 50000 non-null int64
kilometer 50000 non-null float64
notRepairedDamage 50000 non-null object
regionCode 50000 non-null int64
seller 50000 non-null int64
offerType 50000 non-null int64
creatDate 50000 non-null int64
v_0 50000 non-null float64
v_1 50000 non-null float64
v_2 50000 non-null float64
v_3 50000 non-null float64
v_4 50000 non-null float64
v_5 50000 non-null float64
v_6 50000 non-null float64
v_7 50000 non-null float64
v_8 50000 non-null float64
v_9 50000 non-null float64
v_10 50000 non-null float64
v_11 50000 non-null float64
v_12 50000 non-null float64
v_13 50000 non-null float64
v_14 50000 non-null float64
dtypes: float64(20), int64(9), object(1)
memory usage: 11.4+ MB
#### 3) 数据统计信息浏览
```python
## 通过 .describe() 可以查看数值特征列的一些统计信息
Train_data.describe()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>SaleID</th>
<th>name</th>
<th>regDate</th>
<th>model</th>
<th>brand</th>
<th>bodyType</th>
<th>fuelType</th>
<th>gearbox</th>
<th>power</th>
<th>kilometer</th>
<th>...</th>
<th>v_5</th>
<th>v_6</th>
<th>v_7</th>
<th>v_8</th>
<th>v_9</th>
<th>v_10</th>
<th>v_11</th>
<th>v_12</th>
<th>v_13</th>
<th>v_14</th>
</tr>
</thead>
<tbody>
<tr>
<th>count</th>
<td>150000.000000</td>
<td>150000.000000</td>
<td>1.500000e+05</td>
<td>149999.000000</td>
<td>150000.000000</td>
<td>145494.000000</td>
<td>141320.000000</td>
<td>144019.000000</td>
<td>150000.000000</td>
<td>150000.000000</td>
<td>...</td>
<td>150000.000000</td>
<td>150000.000000</td>
<td>150000.000000</td>
<td>150000.000000</td>
<td>150000.000000</td>
<td>150000.000000</td>
<td>150000.000000</td>
<td>150000.000000</td>
<td>150000.000000</td>
<td>150000.000000</td>
</tr>
<tr>
<th>mean</th>
<td>74999.500000</td>
<td>68349.172873</td>
<td>2.003417e+07</td>
<td>47.129021</td>
<td>8.052733</td>
<td>1.792369</td>
<td>0.375842</td>
<td>0.224943</td>
<td>119.316547</td>
<td>12.597160</td>
<td>...</td>
<td>0.248204</td>
<td>0.044923</td>
<td>0.124692</td>
<td>0.058144</td>
<td>0.061996</td>
<td>-0.001000</td>
<td>0.009035</td>
<td>0.004813</td>
<td>0.000313</td>
<td>-0.000688</td>
</tr>
<tr>
<th>std</th>
<td>43301.414527</td>
<td>61103.875095</td>
<td>5.364988e+04</td>
<td>49.536040</td>
<td>7.864956</td>
<td>1.760640</td>
<td>0.548677</td>
<td>0.417546</td>
<td>177.168419</td>
<td>3.919576</td>
<td>...</td>
<td>0.045804</td>
<td>0.051743</td>
<td>0.201410</td>
<td>0.029186</td>
<td>0.035692</td>
<td>3.772386</td>
<td>3.286071</td>
<td>2.517478</td>
<td>1.288988</td>
<td>1.038685</td>
</tr>
<tr>
<th>min</th>
<td>0.000000</td>
<td>0.000000</td>
<td>1.991000e+07</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.500000</td>
<td>...</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>-9.168192</td>
<td>-5.558207</td>
<td>-9.639552</td>
<td>-4.153899</td>
<td>-6.546556</td>
</tr>
<tr>
<th>25%</th>
<td>37499.750000</td>
<td>11156.000000</td>
<td>1.999091e+07</td>
<td>10.000000</td>
<td>1.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>75.000000</td>
<td>12.500000</td>
<td>...</td>
<td>0.243615</td>
<td>0.000038</td>
<td>0.062474</td>
<td>0.035334</td>
<td>0.033930</td>
<td>-3.722303</td>
<td>-1.951543</td>
<td>-1.871846</td>
<td>-1.057789</td>
<td>-0.437034</td>
</tr>
<tr>
<th>50%</th>
<td>74999.500000</td>
<td>51638.000000</td>
<td>2.003091e+07</td>
<td>30.000000</td>
<td>6.000000</td>
<td>1.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>110.000000</td>
<td>15.000000</td>
<td>...</td>
<td>0.257798</td>
<td>0.000812</td>
<td>0.095866</td>
<td>0.057014</td>
<td>0.058484</td>
<td>1.624076</td>
<td>-0.358053</td>
<td>-0.130753</td>
<td>-0.036245</td>
<td>0.141246</td>
</tr>
<tr>
<th>75%</th>
<td>112499.250000</td>
<td>118841.250000</td>
<td>2.007111e+07</td>
<td>66.000000</td>
<td>13.000000</td>
<td>3.000000</td>
<td>1.000000</td>
<td>0.000000</td>
<td>150.000000</td>
<td>15.000000</td>
<td>...</td>
<td>0.265297</td>
<td>0.102009</td>
<td>0.125243</td>
<td>0.079382</td>
<td>0.087491</td>
<td>2.844357</td>
<td>1.255022</td>
<td>1.776933</td>
<td>0.942813</td>
<td>0.680378</td>
</tr>
<tr>
<th>max</th>
<td>149999.000000</td>
<td>196812.000000</td>
<td>2.015121e+07</td>
<td>247.000000</td>
<td>39.000000</td>
<td>7.000000</td>
<td>6.000000</td>
<td>1.000000</td>
<td>19312.000000</td>
<td>15.000000</td>
<td>...</td>
<td>0.291838</td>
<td>0.151420</td>
<td>1.404936</td>
<td>0.160791</td>
<td>0.222787</td>
<td>12.357011</td>
<td>18.819042</td>
<td>13.847792</td>
<td>11.147669</td>
<td>8.658418</td>
</tr>
</tbody>
</table>
<p>8 rows × 30 columns</p>
</div>
```python
TestA_data.describe()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>SaleID</th>
<th>name</th>
<th>regDate</th>
<th>model</th>
<th>brand</th>
<th>bodyType</th>
<th>fuelType</th>
<th>gearbox</th>
<th>power</th>
<th>kilometer</th>
<th>...</th>
<th>v_5</th>
<th>v_6</th>
<th>v_7</th>
<th>v_8</th>
<th>v_9</th>
<th>v_10</th>
<th>v_11</th>
<th>v_12</th>
<th>v_13</th>
<th>v_14</th>
</tr>
</thead>
<tbody>
<tr>
<th>count</th>
<td>50000.000000</td>
<td>50000.000000</td>
<td>5.000000e+04</td>
<td>50000.000000</td>
<td>50000.000000</td>
<td>48587.000000</td>
<td>47107.000000</td>
<td>48090.000000</td>
<td>50000.000000</td>
<td>50000.000000</td>
<td>...</td>
<td>50000.000000</td>
<td>50000.000000</td>
<td>50000.000000</td>
<td>50000.000000</td>
<td>50000.000000</td>
<td>50000.000000</td>
<td>50000.000000</td>
<td>50000.000000</td>
<td>50000.000000</td>
<td>50000.000000</td>
</tr>
<tr>
<th>mean</th>
<td>174999.500000</td>
<td>68542.223280</td>
<td>2.003393e+07</td>
<td>46.844520</td>
<td>8.056240</td>
<td>1.782185</td>
<td>0.373405</td>
<td>0.224350</td>
<td>119.883620</td>
<td>12.595580</td>
<td>...</td>
<td>0.248669</td>
<td>0.045021</td>
<td>0.122744</td>
<td>0.057997</td>
<td>0.062000</td>
<td>-0.017855</td>
<td>-0.013742</td>
<td>-0.013554</td>
<td>-0.003147</td>
<td>0.001516</td>
</tr>
<tr>
<th>std</th>
<td>14433.901067</td>
<td>61052.808133</td>
<td>5.368870e+04</td>
<td>49.469548</td>
<td>7.819477</td>
<td>1.760736</td>
<td>0.546442</td>
<td>0.417158</td>
<td>185.097387</td>
<td>3.908979</td>
<td>...</td>
<td>0.044601</td>
<td>0.051766</td>
<td>0.195972</td>
<td>0.029211</td>
<td>0.035653</td>
<td>3.747985</td>
<td>3.231258</td>
<td>2.515962</td>
<td>1.286597</td>
<td>1.027360</td>
</tr>
<tr>
<th>min</th>
<td>150000.000000</td>
<td>0.000000</td>
<td>1.991000e+07</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.500000</td>
<td>...</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>-9.160049</td>
<td>-5.411964</td>
<td>-8.916949</td>
<td>-4.123333</td>
<td>-6.112667</td>
</tr>
<tr>
<th>25%</th>
<td>162499.750000</td>
<td>11203.500000</td>
<td>1.999091e+07</td>
<td>10.000000</td>
<td>1.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>75.000000</td>
<td>12.500000</td>
<td>...</td>
<td>0.243762</td>
<td>0.000044</td>
<td>0.062644</td>
<td>0.035084</td>
<td>0.033714</td>
<td>-3.700121</td>
<td>-1.971325</td>
<td>-1.876703</td>
<td>-1.060428</td>
<td>-0.437920</td>
</tr>
<tr>
<th>50%</th>
<td>174999.500000</td>
<td>52248.500000</td>
<td>2.003091e+07</td>
<td>29.000000</td>
<td>6.000000</td>
<td>1.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>109.000000</td>
<td>15.000000</td>
<td>...</td>
<td>0.257877</td>
<td>0.000815</td>
<td>0.095828</td>
<td>0.057084</td>
<td>0.058764</td>
<td>1.613212</td>
<td>-0.355843</td>
<td>-0.142779</td>
<td>-0.035956</td>
<td>0.138799</td>
</tr>
<tr>
<th>75%</th>
<td>187499.250000</td>
<td>118856.500000</td>
<td>2.007110e+07</td>
<td>65.000000</td>
<td>13.000000</td>
<td>3.000000</td>
<td>1.000000</td>
<td>0.000000</td>
<td>150.000000</td>
<td>15.000000</td>
<td>...</td>
<td>0.265328</td>
<td>0.102025</td>
<td>0.125438</td>
<td>0.079077</td>
<td>0.087489</td>
<td>2.832708</td>
<td>1.262914</td>
<td>1.764335</td>
<td>0.941469</td>
<td>0.681163</td>
</tr>
<tr>
<th>max</th>
<td>199999.000000</td>
<td>196805.000000</td>
<td>2.015121e+07</td>
<td>246.000000</td>
<td>39.000000</td>
<td>7.000000</td>
<td>6.000000</td>
<td>1.000000</td>
<td>20000.000000</td>
<td>15.000000</td>
<td>...</td>
<td>0.291618</td>
<td>0.153265</td>
<td>1.358813</td>
<td>0.156355</td>
<td>0.214775</td>
<td>12.338872</td>
<td>18.856218</td>
<td>12.950498</td>
<td>5.913273</td>
<td>2.624622</td>
</tr>
</tbody>
</table>
<p>8 rows × 29 columns</p>
</div>
### Step 3:特征与标签构建
#### 1) 提取数值类型特征列名
```python
numerical_cols = Train_data.select_dtypes(exclude = 'object').columns
print(numerical_cols)
```
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
'gearbox', 'power', 'kilometer', 'regionCode', 'seller', 'offerType',
'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6',
'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],
dtype='object')
```python
categorical_cols = Train_data.select_dtypes(include = 'object').columns
print(categorical_cols)
```
Index(['notRepairedDamage'], dtype='object')
#### 2) 构建训练和测试样本
```python
## 选择特征列
feature_cols = [col for col in numerical_cols if col not in ['SaleID','name','regDate','creatDate','price','model','brand','regionCode','seller']]
feature_cols = [col for col in feature_cols if 'Type' not in col]
## 提前特征列,标签列构造训练样本和测试样本
X_data = Train_data[feature_cols]
Y_data = Train_data['price']
X_test = TestA_data[feature_cols]
print('X train shape:',X_data.shape)
print('X test shape:',X_test.shape)
```
X train shape: (150000, 18)
X test shape: (50000, 18)
```python
## 定义了一个统计函数,方便后续信息统计
def Sta_inf(data):
print('_min',np.min(data))
print('_max:',np.max(data))
print('_mean',np.mean(data))
print('_ptp',np.ptp(data))
print('_std',np.std(data))
print('_var',np.var(data))
```
#### 3) 统计标签的基本分布信息
```python
print('Sta of label:')
Sta_inf(Y_data)
```
Sta of label:
_min 11
_max: 99999
_mean 5923.32733333
_ptp 99988
_std 7501.97346988
_var 56279605.9427
```python
## 绘制标签的统计图,查看标签分布
plt.hist(Y_data)
plt.show()
plt.close()
```
![output_24_0](https://img-blog.csdnimg.cn/20200321232442489.png)
#### 4) 缺省值用-1填补
```python
X_data = X_data.fillna(-1)
X_test = X_test.fillna(-1)
```
### Step 4:模型训练与预测
#### 1) 利用xgb进行五折交叉验证查看模型的参数效果
```python
## xgb-Model
xgr = xgb.XGBRegressor(n_estimators=120, learning_rate=0.1, gamma=0, subsample=0.8,\
colsample_bytree=0.9, max_depth=7) #,objective ='reg:squarederror'
scores_train = []
scores = []
## 5折交叉验证方式
sk=StratifiedKFold(n_splits=5,shuffle=True,random_state=0)
for train_ind,val_ind in sk.split(X_data,Y_data):
train_x=X_data.iloc[train_ind].values
train_y=Y_data.iloc[train_ind]
val_x=X_data.iloc[val_ind].values
val_y=Y_data.iloc[val_ind]
xgr.fit(train_x,train_y)
pred_train_xgb=xgr.predict(train_x)
pred_xgb=xgr.predict(val_x)
score_train = mean_absolute_error(train_y,pred_train_xgb)
scores_train.append(score_train)
score = mean_absolute_error(val_y,pred_xgb)
scores.append(score)
print('Train mae:',np.mean(score_train))
print('Val mae',np.mean(scores))
```
Train mae: 628.086664863
Val mae 715.990013454
#### 2 定义xgb和lgb模型函数
```python
def build_model_xgb(x_train,y_train):
model = xgb.XGBRegressor(n_estimators=150, learning_rate=0.1, gamma=0, subsample=0.8,\
colsample_bytree=0.9, max_depth=7) #, objective ='reg:squarederror'
model.fit(x_train, y_train)
return model
def build_model_lgb(x_train,y_train):
estimator = lgb.LGBMRegressor(num_leaves=127,n_estimators = 150)
param_grid = {
'learning_rate': [0.01, 0.05, 0.1, 0.2],
}
gbm = GridSearchCV(estimator, param_grid)
gbm.fit(x_train, y_train)
return gbm
```
#### 3切分数据集Train,Val进行模型训练评价和预测
```python
## Split data with val
x_train,x_val,y_train,y_val = train_test_split(X_data,Y_data,test_size=0.3)
```
```python
print('Train lgb...')
model_lgb = build_model_lgb(x_train,y_train)
val_lgb = model_lgb.predict(x_val)
MAE_lgb = mean_absolute_error(y_val,val_lgb)
print('MAE of val with lgb:',MAE_lgb)
print('Predict lgb...')
model_lgb_pre = build_model_lgb(X_data,Y_data)
subA_lgb = model_lgb_pre.predict(X_test)
print('Sta of Predict lgb:')
Sta_inf(subA_lgb)
```
Train lgb...
MAE of val with lgb: 689.084070621
Predict lgb...
Sta of Predict lgb:
_min -519.150259864
_max: 88575.1087721
_mean 5922.98242599
_ptp 89094.259032
_std 7377.29714126
_var 54424513.1104
```python
print('Train xgb...')
model_xgb = build_model_xgb(x_train,y_train)
val_xgb = model_xgb.predict(x_val)
MAE_xgb = mean_absolute_error(y_val,val_xgb)
print('MAE of val with xgb:',MAE_xgb)
print('Predict xgb...')
model_xgb_pre = build_model_xgb(X_data,Y_data)
subA_xgb = model_xgb_pre.predict(X_test)
print('Sta of Predict xgb:')
Sta_inf(subA_xgb)
```
Train xgb...
MAE of val with xgb: 715.37757816
Predict xgb...
Sta of Predict xgb:
_min -165.479
_max: 90051.8
_mean 5922.9
_ptp 90217.3
_std 7361.13
_var 5.41862e+07
#### 4进行两模型的结果加权融合
```python
## 这里我们采取了简单的加权融合的方式
val_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*val_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*val_xgb
val_Weighted[val_Weighted<0]=10 # 由于我们发现预测的最小值有负数而真实情况下price为负是不存在的由此我们进行对应的后修正
print('MAE of val with Weighted ensemble:',mean_absolute_error(y_val,val_Weighted))
```
MAE of val with Weighted ensemble: 687.275745703
```python
sub_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*subA_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*subA_xgb
## 查看预测值的统计进行
plt.hist(Y_data)
plt.show()
plt.close()
```
![output_38_0](https://img-blog.csdnimg.cn/20200321232503927.png)
#### 5输出结果
```python
sub = pd.DataFrame()
sub['SaleID'] = X_test.SaleID
sub['price'] = sub_Weighted
sub.to_csv('./sub_Weighted.csv',index=False)
```
```python
sub.head()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>SaleID</th>
<th>price</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0</td>
<td>39533.727414</td>
</tr>
<tr>
<th>1</th>
<td>1</td>
<td>386.081960</td>
</tr>
<tr>
<th>2</th>
<td>2</td>
<td>7791.974571</td>
</tr>
<tr>
<th>3</th>
<td>3</td>
<td>11835.211966</td>
</tr>
<tr>
<th>4</th>
<td>4</td>
<td>585.420407</td>
</tr>
</tbody>
</table>
</div>
**Baseline END.**
--- By: ML67
Email: maolinw67@163.com
PS: 华中科技大学研究生, 长期混迹Tianchi等希望和大家多多交流。
github: https://github.com/mlw67 (近期会做一些书籍推导和代码的整理)
--- By: AI蜗牛车
PS东南大学研究生研究方向主要是时空序列预测和时间序列数据挖掘
公众号: AI蜗牛车
知乎: https://www.zhihu.com/people/seu-aigua-niu-che
github: https://github.com/chehongshu
--- By: 阿泽
PS复旦大学计算机研究生
知乎:阿泽 https://www.zhihu.com/people/is-aze主要面向初学者的知识整理
--- By: 小雨姑娘
PS数据挖掘爱好者多次获得比赛TOP名次。
知乎小雨姑娘的机器学习笔记https://zhuanlan.zhihu.com/mlbasic
**关于Datawhale**
> Datawhale是一个专注于数据科学与AI领域的开源组织汇集了众多领域院校和知名企业的优秀学习者聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner和学习者一起成长”为愿景鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。
本次数据挖掘路径学习专题知识将在天池分享详情可关注Datawhale
![](http://jupter-oss.oss-cn-hangzhou.aliyuncs.com/public/files/image/2326541042/1584426326920_9FOUExG2be.jpg)