Files
2020-09-07 16:10:26 +08:00

975 lines
30 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# baseline
```python
import pandas as pd
import os
import gc
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor
from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge
from sklearn.preprocessing import MinMaxScaler
import math
import numpy as np
from tqdm import tqdm
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
import matplotlib.pyplot as plt
import time
import warnings
warnings.filterwarnings('ignore')
```
```python
train = pd.read_csv('train.csv')
testA = pd.read_csv('testA.csv')
```
```python
train.head()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>id</th>
<th>loanAmnt</th>
<th>term</th>
<th>interestRate</th>
<th>installment</th>
<th>grade</th>
<th>subGrade</th>
<th>employmentTitle</th>
<th>employmentLength</th>
<th>homeOwnership</th>
<th>...</th>
<th>n5</th>
<th>n6</th>
<th>n7</th>
<th>n8</th>
<th>n9</th>
<th>n10</th>
<th>n11</th>
<th>n12</th>
<th>n13</th>
<th>n14</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0</td>
<td>35000.0</td>
<td>5</td>
<td>19.52</td>
<td>917.97</td>
<td>E</td>
<td>E2</td>
<td>320.0</td>
<td>2 years</td>
<td>2</td>
<td>...</td>
<td>9.0</td>
<td>8.0</td>
<td>4.0</td>
<td>12.0</td>
<td>2.0</td>
<td>7.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>2.0</td>
</tr>
<tr>
<th>1</th>
<td>1</td>
<td>18000.0</td>
<td>5</td>
<td>18.49</td>
<td>461.90</td>
<td>D</td>
<td>D2</td>
<td>219843.0</td>
<td>5 years</td>
<td>0</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>13.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>2</td>
<td>12000.0</td>
<td>5</td>
<td>16.99</td>
<td>298.17</td>
<td>D</td>
<td>D3</td>
<td>31698.0</td>
<td>8 years</td>
<td>0</td>
<td>...</td>
<td>0.0</td>
<td>21.0</td>
<td>4.0</td>
<td>5.0</td>
<td>3.0</td>
<td>11.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>4.0</td>
</tr>
<tr>
<th>3</th>
<td>3</td>
<td>11000.0</td>
<td>3</td>
<td>7.26</td>
<td>340.96</td>
<td>A</td>
<td>A4</td>
<td>46854.0</td>
<td>10+ years</td>
<td>1</td>
<td>...</td>
<td>16.0</td>
<td>4.0</td>
<td>7.0</td>
<td>21.0</td>
<td>6.0</td>
<td>9.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>1.0</td>
</tr>
<tr>
<th>4</th>
<td>4</td>
<td>3000.0</td>
<td>3</td>
<td>12.99</td>
<td>101.07</td>
<td>C</td>
<td>C2</td>
<td>54.0</td>
<td>NaN</td>
<td>1</td>
<td>...</td>
<td>4.0</td>
<td>9.0</td>
<td>10.0</td>
<td>15.0</td>
<td>7.0</td>
<td>12.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>4.0</td>
</tr>
</tbody>
</table>
<p>5 rows × 47 columns</p>
</div>
```python
data = pd.concat([train, testA], axis=0, ignore_index=True)
```
## 数据预处理
- 可以看到很多变量不能直接训练比如grade、subGrade、employmentLength、issueDate、earliesCreditLine需要进行预处理
```python
print(sorted(data['grade'].unique()))
print(sorted(data['subGrade'].unique()))
```
['A', 'B', 'C', 'D', 'E', 'F', 'G']
['A1', 'A2', 'A3', 'A4', 'A5', 'B1', 'B2', 'B3', 'B4', 'B5', 'C1', 'C2', 'C3', 'C4', 'C5', 'D1', 'D2', 'D3', 'D4', 'D5', 'E1', 'E2', 'E3', 'E4', 'E5', 'F1', 'F2', 'F3', 'F4', 'F5', 'G1', 'G2', 'G3', 'G4', 'G5']
```python
data['employmentLength'].value_counts(dropna=False).sort_index()
```
1 year 65671
10+ years 328525
2 years 90565
3 years 80163
4 years 59818
5 years 62645
6 years 46582
7 years 44230
8 years 45168
9 years 37866
< 1 year 80226
NaN 58541
Name: employmentLength, dtype: int64
- 首先对employmentLength进行转换到数值
```python
data['employmentLength'].replace(to_replace='10+ years', value='10 years', inplace=True)
data['employmentLength'].replace('< 1 year', '0 years', inplace=True)
def employmentLength_to_int(s):
if pd.isnull(s):
return s
else:
return np.int8(s.split()[0])
data['employmentLength'] = data['employmentLength'].apply(employmentLength_to_int)
```
```python
data['employmentLength'].value_counts(dropna=False).sort_index()
```
0.0 80226
1.0 65671
2.0 90565
3.0 80163
4.0 59818
5.0 62645
6.0 46582
7.0 44230
8.0 45168
9.0 37866
10.0 328525
NaN 58541
Name: employmentLength, dtype: int64
- 对earliesCreditLine进行预处理
```python
data['earliesCreditLine'].sample(5)
```
375743 Jun-2003
361340 Jul-1999
716602 Aug-1995
893559 Oct-1982
221525 Nov-2004
Name: earliesCreditLine, dtype: object
```python
data['earliesCreditLine'] = data['earliesCreditLine'].apply(lambda s: int(s[-4:]))
```
```python
data['earliesCreditLine'].describe()
```
count 1000000.000000
mean 1998.688632
std 7.606231
min 1944.000000
25% 1995.000000
50% 2000.000000
75% 2004.000000
max 2015.000000
Name: earliesCreditLine, dtype: float64
```python
data.head()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>id</th>
<th>loanAmnt</th>
<th>term</th>
<th>interestRate</th>
<th>installment</th>
<th>grade</th>
<th>subGrade</th>
<th>employmentTitle</th>
<th>employmentLength</th>
<th>homeOwnership</th>
<th>...</th>
<th>n7</th>
<th>n8</th>
<th>n9</th>
<th>n10</th>
<th>n11</th>
<th>n12</th>
<th>n13</th>
<th>n14</th>
<th>n2.2</th>
<th>n2.3</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0</td>
<td>35000.0</td>
<td>5</td>
<td>19.52</td>
<td>917.97</td>
<td>E</td>
<td>E2</td>
<td>320.0</td>
<td>2.0</td>
<td>2</td>
<td>...</td>
<td>4.0</td>
<td>12.0</td>
<td>2.0</td>
<td>7.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>2.0</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>1</th>
<td>1</td>
<td>18000.0</td>
<td>5</td>
<td>18.49</td>
<td>461.90</td>
<td>D</td>
<td>D2</td>
<td>219843.0</td>
<td>5.0</td>
<td>0</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>13.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>2</td>
<td>12000.0</td>
<td>5</td>
<td>16.99</td>
<td>298.17</td>
<td>D</td>
<td>D3</td>
<td>31698.0</td>
<td>8.0</td>
<td>0</td>
<td>...</td>
<td>4.0</td>
<td>5.0</td>
<td>3.0</td>
<td>11.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>4.0</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>3</th>
<td>3</td>
<td>11000.0</td>
<td>3</td>
<td>7.26</td>
<td>340.96</td>
<td>A</td>
<td>A4</td>
<td>46854.0</td>
<td>10.0</td>
<td>1</td>
<td>...</td>
<td>7.0</td>
<td>21.0</td>
<td>6.0</td>
<td>9.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>1.0</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>4</th>
<td>4</td>
<td>3000.0</td>
<td>3</td>
<td>12.99</td>
<td>101.07</td>
<td>C</td>
<td>C2</td>
<td>54.0</td>
<td>NaN</td>
<td>1</td>
<td>...</td>
<td>10.0</td>
<td>15.0</td>
<td>7.0</td>
<td>12.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>4.0</td>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>
<p>5 rows × 49 columns</p>
</div>
- 类别特征处理
```python
# 部分类别特征
cate_features = ['grade', 'subGrade', 'employmentTitle', 'homeOwnership', 'verificationStatus', 'purpose', 'postCode', 'regionCode', \
'applicationType', 'initialListStatus', 'title', 'policyCode']
for f in cate_features:
print(f, '类型数:', data[f].nunique())
```
grade 类型数: 7
subGrade 类型数: 35
employmentTitle 类型数: 298101
homeOwnership 类型数: 6
verificationStatus 类型数: 3
purpose 类型数: 14
postCode 类型数: 935
regionCode 类型数: 51
applicationType 类型数: 2
initialListStatus 类型数: 2
title 类型数: 47903
policyCode 类型数: 1
```python
# 类型数在2之上又不是高维稀疏的
data = pd.get_dummies(data, columns=['grade', 'subGrade', 'homeOwnership', 'verificationStatus', 'purpose', 'regionCode'], drop_first=True)
```
```python
# 高维类别特征需要进行转换
for f in ['employmentTitle', 'postCode', 'title']:
data[f+'_cnts'] = data.groupby([f])['id'].transform('count')
data[f+'_rank'] = data.groupby([f])['id'].rank(ascending=False).astype(int)
del data[f]
```
## 训练数据/测试数据准备
```python
features = [f for f in data.columns if f not in ['id','issueDate','isDefault']]
train = data[data.isDefault.notnull()].reset_index(drop=True)
test = data[data.isDefault.isnull()].reset_index(drop=True)
x_train = train[features]
x_test = test[features]
y_train = train['isDefault']
```
## 模型训练
- 直接构建了一个函数,可以调用三种树模型,方便快捷
```python
def cv_model(clf, train_x, train_y, test_x, clf_name):
folds = 5
seed = 2020
kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
train = np.zeros(train_x.shape[0])
test = np.zeros(test_x.shape[0])
cv_scores = []
for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
print('************************************ {} ************************************'.format(str(i+1)))
trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
if clf_name == "lgb":
train_matrix = clf.Dataset(trn_x, label=trn_y)
valid_matrix = clf.Dataset(val_x, label=val_y)
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'min_child_weight': 5,
'num_leaves': 2 ** 5,
'lambda_l2': 10,
'feature_fraction': 0.8,
'bagging_fraction': 0.8,
'bagging_freq': 4,
'learning_rate': 0.1,
'seed': 2020,
'nthread': 28,
'n_jobs':24,
'silent': True,
'verbose': -1,
}
model = clf.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], verbose_eval=200,early_stopping_rounds=200)
val_pred = model.predict(val_x, num_iteration=model.best_iteration)
test_pred = model.predict(test_x, num_iteration=model.best_iteration)
# print(list(sorted(zip(features, model.feature_importance("gain")), key=lambda x: x[1], reverse=True))[:20])
if clf_name == "xgb":
train_matrix = clf.DMatrix(trn_x , label=trn_y)
valid_matrix = clf.DMatrix(val_x , label=val_y)
test_matrix = clf.DMatrix(test_x)
params = {'booster': 'gbtree',
'objective': 'binary:logistic',
'eval_metric': 'auc',
'gamma': 1,
'min_child_weight': 1.5,
'max_depth': 5,
'lambda': 10,
'subsample': 0.7,
'colsample_bytree': 0.7,
'colsample_bylevel': 0.7,
'eta': 0.04,
'tree_method': 'exact',
'seed': 2020,
'nthread': 36,
"silent": True,
}
watchlist = [(train_matrix, 'train'),(valid_matrix, 'eval')]
model = clf.train(params, train_matrix, num_boost_round=50000, evals=watchlist, verbose_eval=200, early_stopping_rounds=200)
val_pred = model.predict(valid_matrix, ntree_limit=model.best_ntree_limit)
test_pred = model.predict(test_matrix , ntree_limit=model.best_ntree_limit)
if clf_name == "cat":
params = {'learning_rate': 0.05, 'depth': 5, 'l2_leaf_reg': 10, 'bootstrap_type': 'Bernoulli',
'od_type': 'Iter', 'od_wait': 50, 'random_seed': 11, 'allow_writing_files': False}
model = clf(iterations=20000, **params)
model.fit(trn_x, trn_y, eval_set=(val_x, val_y),
cat_features=[], use_best_model=True, verbose=500)
val_pred = model.predict(val_x)
test_pred = model.predict(test_x)
train[valid_index] = val_pred
test = test_pred / kf.n_splits
cv_scores.append(roc_auc_score(val_y, val_pred))
print(cv_scores)
print("%s_scotrainre_list:" % clf_name, cv_scores)
print("%s_score_mean:" % clf_name, np.mean(cv_scores))
print("%s_score_std:" % clf_name, np.std(cv_scores))
return train, test
```
```python
def lgb_model(x_train, y_train, x_test):
lgb_train, lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb")
return lgb_train, lgb_test
def xgb_model(x_train, y_train, x_test):
xgb_train, xgb_test = cv_model(xgb, x_train, y_train, x_test, "xgb")
return xgb_train, xgb_test
def cat_model(x_train, y_train, x_test):
cat_train, cat_test = cv_model(CatBoostRegressor, x_train, y_train, x_test, "cat")
return cat_train, cat_test
```
```python
lgb_train, lgb_test = lgb_model(x_train, y_train, x_test)
```
************************************ 1 ************************************
Training until validation scores don't improve for 200 rounds.
[200] training's auc: 0.742884 valid_1's auc: 0.73055
[400] training's auc: 0.755686 valid_1's auc: 0.731888
[600] training's auc: 0.766421 valid_1's auc: 0.731988
[800] training's auc: 0.776244 valid_1's auc: 0.731868
Early stopping, best iteration is:
[656] training's auc: 0.769146 valid_1's auc: 0.732081
[0.7320814878889421]
************************************ 2 ************************************
Training until validation scores don't improve for 200 rounds.
[200] training's auc: 0.74372 valid_1's auc: 0.726466
[400] training's auc: 0.756459 valid_1's auc: 0.727727
[600] training's auc: 0.767156 valid_1's auc: 0.727776
Early stopping, best iteration is:
[520] training's auc: 0.762985 valid_1's auc: 0.727902
[0.7320814878889421, 0.7279015876934286]
************************************ 3 ************************************
Training until validation scores don't improve for 200 rounds.
[200] training's auc: 0.742884 valid_1's auc: 0.731466
[400] training's auc: 0.755466 valid_1's auc: 0.732748
[600] training's auc: 0.766313 valid_1's auc: 0.733069
[800] training's auc: 0.776349 valid_1's auc: 0.732892
Early stopping, best iteration is:
[694] training's auc: 0.771133 valid_1's auc: 0.73312
[0.7320814878889421, 0.7279015876934286, 0.7331203287449972]
************************************ 4 ************************************
Training until validation scores don't improve for 200 rounds.
[200] training's auc: 0.742632 valid_1's auc: 0.730114
[400] training's auc: 0.755357 valid_1's auc: 0.731443
[600] training's auc: 0.765983 valid_1's auc: 0.731566
[800] training's auc: 0.776112 valid_1's auc: 0.731805
Early stopping, best iteration is:
[706] training's auc: 0.771324 valid_1's auc: 0.731887
[0.7320814878889421, 0.7279015876934286, 0.7331203287449972, 0.731886588682118]
************************************ 5 ************************************
Training until validation scores don't improve for 200 rounds.
[200] training's auc: 0.743113 valid_1's auc: 0.729226
[400] training's auc: 0.7559 valid_1's auc: 0.730816
[600] training's auc: 0.766388 valid_1's auc: 0.73092
[800] training's auc: 0.77627 valid_1's auc: 0.731029
[1000] training's auc: 0.785791 valid_1's auc: 0.730933
Early stopping, best iteration is:
[883] training's auc: 0.780369 valid_1's auc: 0.731096
[0.7320814878889421, 0.7279015876934286, 0.7331203287449972, 0.731886588682118, 0.7310960057774112]
lgb_scotrainre_list: [0.7320814878889421, 0.7279015876934286, 0.7331203287449972, 0.731886588682118, 0.7310960057774112]
lgb_score_mean: 0.7312171997573793
lgb_score_std: 0.001779041696522632
```python
xgb_train, xgb_test = xgb_model(x_train, y_train, x_test)
```
************************************ 1 ************************************
[0] train-auc:0.677293 eval-auc:0.678869
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.
Will train until eval-auc hasn't improved in 200 rounds.
[200] train-auc:0.727527 eval-auc:0.723771
[400] train-auc:0.73516 eval-auc:0.727725
[600] train-auc:0.740458 eval-auc:0.729631
[800] train-auc:0.744963 eval-auc:0.730829
[1000] train-auc:0.748802 eval-auc:0.731495
[1200] train-auc:0.752295 eval-auc:0.732074
[1400] train-auc:0.755574 eval-auc:0.732421
[1600] train-auc:0.758671 eval-auc:0.732674
[1800] train-auc:0.761605 eval-auc:0.732964
[2000] train-auc:0.764627 eval-auc:0.733111
[2200] train-auc:0.767443 eval-auc:0.733201
[2400] train-auc:0.770204 eval-auc:0.733224
Stopping. Best iteration:
[2328] train-auc:0.7692 eval-auc:0.733246
[0.7332460852050292]
************************************ 2 ************************************
[0] train-auc:0.677718 eval-auc:0.672523
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.
Will train until eval-auc hasn't improved in 200 rounds.
[200] train-auc:0.728628 eval-auc:0.720255
[400] train-auc:0.736149 eval-auc:0.724308
[600] train-auc:0.741354 eval-auc:0.726443
[800] train-auc:0.745611 eval-auc:0.72746
[1000] train-auc:0.749627 eval-auc:0.728194
[1200] train-auc:0.753176 eval-auc:0.728711
[1400] train-auc:0.756476 eval-auc:0.72899
[1600] train-auc:0.759574 eval-auc:0.729224
[1800] train-auc:0.762608 eval-auc:0.729501
[2000] train-auc:0.765549 eval-auc:0.729627
[2200] train-auc:0.768304 eval-auc:0.729782
[2400] train-auc:0.771131 eval-auc:0.729922
[2600] train-auc:0.773769 eval-auc:0.729961
[2800] train-auc:0.776371 eval-auc:0.72999
Stopping. Best iteration:
[2697] train-auc:0.775119 eval-auc:0.730036
[0.7332460852050292, 0.7300358478747684]
************************************ 3 ************************************
[0] train-auc:0.676641 eval-auc:0.67765
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.
Will train until eval-auc hasn't improved in 200 rounds.
[200] train-auc:0.72757 eval-auc:0.724632
[400] train-auc:0.735185 eval-auc:0.728571
[600] train-auc:0.740671 eval-auc:0.73067
[800] train-auc:0.745049 eval-auc:0.731899
[1000] train-auc:0.748976 eval-auc:0.732787
[1200] train-auc:0.752383 eval-auc:0.73321
[1400] train-auc:0.75564 eval-auc:0.733548
[1600] train-auc:0.758796 eval-auc:0.733825
[1800] train-auc:0.761717 eval-auc:0.734007
[2000] train-auc:0.76459 eval-auc:0.734193
[2200] train-auc:0.767399 eval-auc:0.734261
[2400] train-auc:0.770174 eval-auc:0.734362
[2600] train-auc:0.772818 eval-auc:0.734369
[2800] train-auc:0.775568 eval-auc:0.734391
[3000] train-auc:0.777985 eval-auc:0.73444
[3200] train-auc:0.780514 eval-auc:0.734477
[3400] train-auc:0.782893 eval-auc:0.734427
Stopping. Best iteration:
[3207] train-auc:0.780621 eval-auc:0.734494
[0.7332460852050292, 0.7300358478747684, 0.7344942212088965]
************************************ 4 ************************************
[0] train-auc:0.677768 eval-auc:0.677179
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.
Will train until eval-auc hasn't improved in 200 rounds.
[200] train-auc:0.727614 eval-auc:0.72295
[400] train-auc:0.735165 eval-auc:0.726994
[600] train-auc:0.740498 eval-auc:0.729116
[800] train-auc:0.744884 eval-auc:0.730417
[1000] train-auc:0.748782 eval-auc:0.731318
[1200] train-auc:0.75225 eval-auc:0.731899
[1400] train-auc:0.755505 eval-auc:0.732295
[1600] train-auc:0.758618 eval-auc:0.732629
[1800] train-auc:0.76176 eval-auc:0.733046
[2000] train-auc:0.764736 eval-auc:0.733189
[2200] train-auc:0.767476 eval-auc:0.733276
[2400] train-auc:0.770154 eval-auc:0.733409
[2600] train-auc:0.772874 eval-auc:0.733469
[2800] train-auc:0.77541 eval-auc:0.733405
Stopping. Best iteration:
[2644] train-auc:0.773429 eval-auc:0.733488
[0.7332460852050292, 0.7300358478747684, 0.7344942212088965, 0.7334876284761012]
************************************ 5 ************************************
[0] train-auc:0.677768 eval-auc:0.676353
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.
Will train until eval-auc hasn't improved in 200 rounds.
[200] train-auc:0.728072 eval-auc:0.722913
[400] train-auc:0.735517 eval-auc:0.726582
[600] train-auc:0.740782 eval-auc:0.728449
[800] train-auc:0.745258 eval-auc:0.729653
[1000] train-auc:0.749185 eval-auc:0.730489
[1200] train-auc:0.752723 eval-auc:0.731038
[1400] train-auc:0.755985 eval-auc:0.731466
[1600] train-auc:0.759166 eval-auc:0.731758
[1800] train-auc:0.762205 eval-auc:0.73199
[2000] train-auc:0.765197 eval-auc:0.732145
[2200] train-auc:0.767976 eval-auc:0.732194
Stopping. Best iteration:
[2191] train-auc:0.767852 eval-auc:0.732213
[0.7332460852050292, 0.7300358478747684, 0.7344942212088965, 0.7334876284761012, 0.7322134048106561]
xgb_scotrainre_list: [0.7332460852050292, 0.7300358478747684, 0.7344942212088965, 0.7334876284761012, 0.7322134048106561]
xgb_score_mean: 0.7326954375150903
xgb_score_std: 0.0015147392354657807
```python
cat_train, cat_test = cat_model(x_train, y_train, x_test)
```
************************************ 1 ************************************
0: learn: 0.4415198 test: 0.4387088 best: 0.4387088 (0) total: 111ms remaining: 37m 6s
500: learn: 0.3772118 test: 0.3759665 best: 0.3759665 (500) total: 37.7s remaining: 24m 25s
1000: learn: 0.3756709 test: 0.3752058 best: 0.3752058 (1000) total: 1m 14s remaining: 23m 41s
1500: learn: 0.3745785 test: 0.3748423 best: 0.3748423 (1500) total: 1m 52s remaining: 23m 7s
2000: learn: 0.3736834 test: 0.3746564 best: 0.3746564 (2000) total: 2m 29s remaining: 22m 28s
2500: learn: 0.3728568 test: 0.3745180 best: 0.3745165 (2492) total: 3m 7s remaining: 21m 52s
3000: learn: 0.3720793 test: 0.3744201 best: 0.3744198 (2998) total: 3m 44s remaining: 21m 14s
Stopped by overfitting detector (50 iterations wait)
bestTest = 0.3744006318
bestIteration = 3086
Shrink model to first 3087 iterations.
[0.7326058985428212]
************************************ 2 ************************************
0: learn: 0.4406928 test: 0.4420714 best: 0.4420714 (0) total: 53.3ms remaining: 17m 46s
500: learn: 0.3765250 test: 0.3787287 best: 0.3787287 (500) total: 38.7s remaining: 25m 8s
1000: learn: 0.3749822 test: 0.3779503 best: 0.3779503 (998) total: 1m 16s remaining: 24m 18s
1500: learn: 0.3738772 test: 0.3775654 best: 0.3775654 (1500) total: 1m 54s remaining: 23m 34s
2000: learn: 0.3729354 test: 0.3773407 best: 0.3773401 (1999) total: 2m 33s remaining: 22m 56s
2500: learn: 0.3721077 test: 0.3771987 best: 0.3771971 (2496) total: 3m 10s remaining: 22m 15s
3000: learn: 0.3713621 test: 0.3771114 best: 0.3771114 (3000) total: 3m 49s remaining: 21m 37s
Stopped by overfitting detector (50 iterations wait)
bestTest = 0.3770400469
bestIteration = 3382
Shrink model to first 3383 iterations.
[0.7326058985428212, 0.7292909146788396]
************************************ 3 ************************************
0: learn: 0.4408230 test: 0.4418939 best: 0.4418939 (0) total: 59.1ms remaining: 19m 42s
500: learn: 0.3767851 test: 0.3776319 best: 0.3776319 (500) total: 40.4s remaining: 26m 12s
1000: learn: 0.3752331 test: 0.3768292 best: 0.3768292 (1000) total: 1m 20s remaining: 25m 19s
1500: learn: 0.3741550 test: 0.3764926 best: 0.3764926 (1500) total: 2m remaining: 24m 39s
2000: learn: 0.3732520 test: 0.3762840 best: 0.3762832 (1992) total: 2m 40s remaining: 24m 2s
2500: learn: 0.3724303 test: 0.3761303 best: 0.3761279 (2490) total: 3m 20s remaining: 23m 22s
3000: learn: 0.3716684 test: 0.3760402 best: 0.3760395 (2995) total: 4m remaining: 22m 42s
3500: learn: 0.3709308 test: 0.3759509 best: 0.3759502 (3495) total: 4m 40s remaining: 22m 2s
4000: learn: 0.3702269 test: 0.3759039 best: 0.3759027 (3993) total: 5m 20s remaining: 21m 20s
4500: learn: 0.3695477 test: 0.3758698 best: 0.3758663 (4459) total: 6m remaining: 20m 40s
Stopped by overfitting detector (50 iterations wait)
bestTest = 0.3758663409
bestIteration = 4459
Shrink model to first 4460 iterations.
[0.7326058985428212, 0.7292909146788396, 0.7341207611812285]
************************************ 4 ************************************
0: learn: 0.4408778 test: 0.4413264 best: 0.4413264 (0) total: 46.6ms remaining: 15m 32s
500: learn: 0.3768022 test: 0.3777678 best: 0.3777678 (500) total: 40.3s remaining: 26m 7s
1000: learn: 0.3753097 test: 0.3769403 best: 0.3769403 (1000) total: 1m 20s remaining: 25m 24s
1500: learn: 0.3742418 test: 0.3765698 best: 0.3765698 (1500) total: 2m remaining: 24m 41s
2000: learn: 0.3733478 test: 0.3763500 best: 0.3763496 (1998) total: 2m 40s remaining: 23m 59s
2500: learn: 0.3725263 test: 0.3762101 best: 0.3762093 (2488) total: 3m 20s remaining: 23m 19s
3000: learn: 0.3717486 test: 0.3760966 best: 0.3760966 (2999) total: 3m 59s remaining: 22m 36s
Stopped by overfitting detector (50 iterations wait)
bestTest = 0.3760182133
bestIteration = 3432
Shrink model to first 3433 iterations.
[0.7326058985428212, 0.7292909146788396, 0.7341207611812285, 0.7324483603137153]
************************************ 5 ************************************
0: learn: 0.4409876 test: 0.4409159 best: 0.4409159 (0) total: 52.3ms remaining: 17m 26s
500: learn: 0.3768055 test: 0.3776229 best: 0.3776229 (500) total: 38s remaining: 24m 38s
1000: learn: 0.3752600 test: 0.3768397 best: 0.3768397 (1000) total: 1m 15s remaining: 23m 57s
1500: learn: 0.3741843 test: 0.3764855 best: 0.3764855 (1500) total: 1m 53s remaining: 23m 16s
2000: learn: 0.3732691 test: 0.3762491 best: 0.3762490 (1998) total: 2m 31s remaining: 22m 40s
2500: learn: 0.3724407 test: 0.3761154 best: 0.3761154 (2500) total: 3m 9s remaining: 22m 5s
3000: learn: 0.3716764 test: 0.3760184 best: 0.3760184 (3000) total: 3m 47s remaining: 21m 26s
3500: learn: 0.3709545 test: 0.3759453 best: 0.3759453 (3500) total: 4m 24s remaining: 20m 47s
Stopped by overfitting detector (50 iterations wait)
bestTest = 0.3759421091
bestIteration = 3544
Shrink model to first 3545 iterations.
[0.7326058985428212, 0.7292909146788396, 0.7341207611812285, 0.7324483603137153, 0.7312334660628076]
cat_scotrainre_list: [0.7326058985428212, 0.7292909146788396, 0.7341207611812285, 0.7324483603137153, 0.7312334660628076]
cat_score_mean: 0.7319398801558824
cat_score_std: 0.001610863965629903
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-25-2e9bafef31e8> in <module>
----> 1 cat_train, cat_test = cat_model(x_train, y_train, x_test)
TypeError: 'NoneType' object is not iterable
```python
rh_test = lgb_test*0.5 + xgb_test*0.5
```
```python
testA['isDefault'] = rh_test
```
```python
testA[['id','isDefault']].to_csv('test_sub.csv', index=False)
```
```python
```