此baseline能达到0.42557的分数,分数一般,可以帮助快速进入到比赛。然后结合之前相关比赛的方法,能得到不错的分数。
推荐学习:https://cloud.tencent.com/developer/article/1505672 IJCAI-18比赛完整方案
在开始一场比赛的时候,一般都会找些相关的比赛来熟悉下基本的方法套路。
本次大赛提供了讯飞AI营销云的海量广告投放数据,参赛选手通过人工智能技术构建预测模型预估用户的广告点击概率,即给定广告点击相关的广告、媒体、用户、上下文内容等信息的条件下预测广告点击概率。与之前比赛类似,所以可以参考相关比赛进行学习。
评价函数是logloss
import numpy as np
import pandas as pd
import time
import datetime
import gc
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.cross_validation import StratifiedKFold
from sklearn.metrics import roc_auc_score, log_loss
import lightgbm as lgb
# 加载数据
train = pd.read_table('./data/round1_iflyad_train.txt')
test = pd.read_table('./data/round1_iflyad_test_feature.txt')
# 合并训练集,验证集
data = pd.concat([train,test],axis=0,ignore_index=True)
# 缺失值填充
data['make'] = data['make'].fillna(str(-1))
data['model'] = data['model'].fillna(str(-1))
data['osv'] = data['osv'].fillna(str(-1))
data['app_cate_id'] = data['app_cate_id'].fillna(-1)
data['app_id'] = data['app_id'].fillna(-1)
data['click'] = data['click'].fillna(-1)
data['user_tags'] = data['user_tags'].fillna(str(-1))
data['f_channel'] = data['f_channel'].fillna(str(-1))
# replace
replace = ['creative_is_jump', 'creative_is_download', 'creative_is_js', 'creative_is_voicead', 'creative_has_deeplink', 'app_paid']
for feat in replace:
data[feat] = data[feat].replace([False, True], [0, 1])
# labelencoder 转化
encoder = ['city', 'province', 'make', 'model', 'osv', 'os_name', 'adid', 'advert_id', 'orderid',
'advert_industry_inner', 'campaign_id', 'creative_id', 'app_cate_id',
'app_id', 'inner_slot_id', 'advert_name', 'f_channel', 'creative_tp_dnf']
col_encoder = LabelEncoder()
for feat in encoder:
col_encoder.fit(data[feat])
data[feat] = col_encoder.transform(data[feat])
此处主要进行数据清洗,以及基本的数据转换,方法也是比较常规,缺失值直接-1替换,我一般很少考虑用均值或者众数,个人感觉缺失值也是反应一类情况。字符串转数值。
data['day'] = data['time'].apply(lambda x : int(time.strftime("%d", time.localtime(x))))
data['hour'] = data['time'].apply(lambda x : int(time.strftime("%H", time.localtime(x))))
# 历史点击率
# 时间转换
data['period'] = data['day']
data['period'][data['period']<27] = data['period'][data['period']<27] + 31
for feat_1 in ['advert_id','advert_industry_inner','advert_name','campaign_id', 'creative_height',
'creative_tp_dnf', 'creative_width', 'province', 'f_channel']:
gc.collect()
res=pd.DataFrame()
temp=data[[feat_1,'period','click']]
for period in range(27,35):
if period == 27:
count=temp.groupby([feat_1]).apply(lambda x: x['click'][(x['period']<=period).values].count()).reset_index(name=feat_1+'_all')
count1=temp.groupby([feat_1]).apply(lambda x: x['click'][(x['period']<=period).values].sum()).reset_index(name=feat_1+'_1')
else:
count=temp.groupby([feat_1]).apply(lambda x: x['click'][(x['period']<period).values].count()).reset_index(name=feat_1+'_all')
count1=temp.groupby([feat_1]).apply(lambda x: x['click'][(x['period']<period).values].sum()).reset_index(name=feat_1+'_1')
count[feat_1+'_1']=count1[feat_1+'_1']
count.fillna(value=0, inplace=True)
count[feat_1+'_rate'] = round(count[feat_1+'_1'] / count[feat_1+'_all'], 5)
count['period']=period
count.drop([feat_1+'_all', feat_1+'_1'],axis=1,inplace=True)
count.fillna(value=0, inplace=True)
res=res.append(count,ignore_index=True)
print(feat_1,' over')
data = pd.merge(data,res, how='left', on=[feat_1,'period'])
此处也是基本的构造,获取day,hour特征, 另外构造了历史点击率特征,来反映不同特征的点击率情况,同时可以考虑更多组合点击率特征。可以从点击次数,点击不同类别的占比,来反映偏好问题。
# 删除没用的特征
drop = ['click', 'time', 'instance_id', 'user_tags',
'app_paid', 'creative_is_js', 'creative_is_voicead']
train = data[:train.shape[0]]
test = data[train.shape[0]:]
y_train = train.loc[:,'click']
res = test.loc[:, ['instance_id']]
train.drop(drop, axis=1, inplace=True)
print('train:',train.shape)
test.drop(drop, axis=1, inplace=True)
print('test:',test.shape)
X_loc_train = train.values
y_loc_train = y_train.values
X_loc_test = test.values
# 模型部分
model = lgb.LGBMClassifier(boosting_type='gbdt', num_leaves=48, max_depth=-1, learning_rate=0.05, n_estimators=2000,
max_bin=425, subsample_for_bin=50000, objective='binary', min_split_gain=0,
min_child_weight=5, min_child_samples=10, subsample=0.8, subsample_freq=1,
colsample_bytree=1, reg_alpha=3, reg_lambda=5, seed=1000, n_jobs=10, silent=True)
# 五折交叉训练,构造五个模型
skf=list(StratifiedKFold(y_loc_train, n_folds=5, shuffle=True, random_state=1024))
baseloss = []
loss = 0
for i, (train_index, test_index) in enumerate(skf):
print("Fold", i)
lgb_model = model.fit(X_loc_train[train_index], y_loc_train[train_index],
eval_names =['train','valid'],
eval_metric='logloss',
eval_set=[(X_loc_train[train_index], y_loc_train[train_index]),
(X_loc_train[test_index], y_loc_train[test_index])],early_stopping_rounds=100)
baseloss.append(lgb_model.best_score_['valid']['binary_logloss'])
loss += lgb_model.best_score_['valid']['binary_logloss']
test_pred= lgb_model.predict_proba(X_loc_test, num_iteration=lgb_model.best_iteration_)[:, 1]
print('test mean:', test_pred.mean())
res['prob_%s' % str(i)] = test_pred
print('logloss:', baseloss, loss/5)
# 加权平均
res['predicted_score'] = 0
for i in range(5):
res['predicted_score'] += res['prob_%s' % str(i)]
res['predicted_score'] = res['predicted_score']/5
# 提交结果
mean = res['predicted_score'].mean()
print('mean:',mean)
now = datetime.datetime.now()
now = now.strftime('%m-%d-%H-%M')
res[['instance_id', 'predicted_score']].to_csv("./result_sub/lgb_baseline_%s.csv" % now, index=False)
模型部分比较简单,利用不同折数加参数,特征,样本(随机数种子)扰动,再加权平均得到最终成绩,目前尝试过5折,10折,效果还是5折比较好。另外如果用完整数据训练,logloss会涨3-5个万。
后续还会继续完善补充,祝大家比赛开心