专栏首页大数据文摘10万奖金、190万真实问答数据,智源-看山杯请你为100万个问题找到最合适的“谢邀”

10万奖金、190万真实问答数据,智源-看山杯请你为100万个问题找到最合适的“谢邀”

大数据文摘出品

知识分享服务成为目前全球互联网的重要、最受欢迎的应用类型之一,在知识分享或问答社区中,问题数远远超过有质量的回复数。如何连接知识、专家和用户,增加专家的回答意愿,成为了此类服务的中心课题。

近日,智源与知乎联合主办的智源·看山杯专家发现赛正在如火如荼的进行中,本次比赛于8月28日开始,11月28日截止,赛题为“为100万个问题根据兴趣和专业能力寻找合适的专家回答”。

知乎为本次大赛提供了10万个话题数据、180万个问题和475万个回答数据、190万个脱敏的用户画像和回答记录数据,以及1000万条邀请数据。

构造问题、用户和问题用户交叉三个方向上的特征

以下是参赛者CChan,一位华南理工大学研究生,公开的自己的模型及代码。

本模型构造了问题、用户和问题用户交叉三个方向上的特征,采用Catboost模型和5折交叉验证,显卡采用RTX2080Ti。

最后的线上线下分数在0.70–0.71之间。

特征说明

用户特征

特征

特征说明

'gender', 'freq', 'A1', 'B1', 'C1' ...

用户原始特征

'numattentopic', 'numinteresttopic'

用户关注和感兴趣的topic数

'mostinteresttopic'

用户最感兴趣的topic

'mininterestvalues', 'max...', 'std...', 'mean...'

用户topic兴趣值的统计特征

问题特征

特征

特征说明

'numtitlesw', 'numtitlew'

标题 词计数

'numdescsw', 'numdescw'

描述 词计数

'num_qtopic'

topic计数

用户问题交叉特征

特征

特征说明

'numtopicattent_intersection'

关注topic交集计数

'numtopicinterest_intersection'

兴趣topic交集计数

'mintopicinterest...', 'max...', 'std...', 'mean...'

交集topic兴趣值统计

代码及说明

preprocess

数据预处理,包括解析列表,重编码id,pickle保存。运行时间1388s,内存占用峰值125G*30%。

1. import pandas as pd2. import numpy as np3. import pickle4. import gc5. from tqdm import tqdm_notebook6. import os7. import time1. tic = time.time()1. # 减少内存占用2. def reduce_mem_usage(df):3.     """ iterate through all the columns of a dataframe and modify the data type4.         to reduce memory usage.        5.     """6.     start_mem = df.memory_usage().sum() / 1024**27.     print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))8. 9.     for col in df.columns:10.         col_type = df[col].dtype11. 12.         if col_type != object:13.             c_min = df[col].min()14.             c_max = df[col].max()15.             if str(col_type)[:3] == 'int':16.                 if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:17.                     df[col] = df[col].astype(np.int8)18.                 elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:19.                     df[col] = df[col].astype(np.int16)20.                 elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:21.                     df[col] = df[col].astype(np.int32)22.                 elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:23.                     df[col] = df[col].astype(np.int64)  24. 25.     end_mem = df.memory_usage().sum() / 1024**226.     print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))27.     print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))28. 29.     return df1. # 解析列表, 重编码id2. def parse_str(d):3.     return np.array(list(map(float, d.split())))4. 5. def parse_list_1(d):6.     if d == '-1':7.         return [0]8.     return list(map(lambda x: int(x[1:]), str(d).split(',')))9. 10. def parse_list_2(d):11.     if d == '-1':12.         return [0]13.     return list(map(lambda x: int(x[2:]), str(d).split(',')))14. 15. def parse_map(d):16.     if d == '-1':17.         return {}18.     return dict([int(z.split(':')[0][1:]), float(z.split(':')[1])] for z in d.split(','))1. PATH = '../data'2. SAVE_PATH = '../pkl'3. if not os.path.exists(SAVE_PATH):4.     print('create dir: %s' % SAVE_PATH)5.     os.mkdir(SAVE_PATH)

single word:

1. single_word = pd.read_csv(os.path.join(PATH, 'single_word_vectors_64d.txt'),
2.                           names=['id', 'embed'], sep='\t')
3. single_word.head()
1. single_word['embed'] = single_word['embed'].apply(parse_str)
2. single_word['id'] = single_word['id'].apply(lambda x: int(x[2:]))
3. single_word.head()
1. with open('../pkl/single_word.pkl', 'wb') as file:
2.     pickle.dump(single_word, file)
3. 
4. del single_word
5. gc.collect()

word:

1. word = pd.read_csv(os.path.join(PATH, 'word_vectors_64d.txt'),
2.                           names=['id', 'embed'], sep='\t')
3. word.head()
1. word['embed'] = word['embed'].apply(parse_str)
2. word['id'] = word['id'].apply(lambda x: int(x[1:]))
3. word.head()
1. with open('../pkl/word.pkl', 'wb') as file:
2.     pickle.dump(word, file)
3. 
4. del word
5. gc.collect()

topic:

1. topic = pd.read_csv(os.path.join(PATH, 'topic_vectors_64d.txt'),
2.                           names=['id', 'embed'], sep='\t')
3. topic.head()
1. topic['embed'] = topic['embed'].apply(parse_str)
2. topic['id'] = topic['id'].apply(lambda x: int(x[1:]))
3. topic.head()
1. with open('../pkl/topic.pkl', 'wb') as file:
2.     pickle.dump(topic, file)
3. 
4. del topic
5. gc.collect()

invite:

1. invite_info = pd.read_csv(os.path.join(PATH, 'invite_info_0926.txt'),
2.                           names=['question_id', 'author_id', 'invite_time', 'label'], sep='\t')
3. invite_info_evaluate = pd.read_csv(os.path.join(PATH, 'invite_info_evaluate_1_0926.txt'),
4.                           names=['question_id', 'author_id', 'invite_time'], sep='\t')
5. invite_info.head()
1. invite_info['invite_day'] = invite_info['invite_time'].apply(lambda x: int(x.split('-')[0][1:])).astype(np.int16)
2. invite_info['invite_hour'] = invite_info['invite_time'].apply(lambda x: int(x.split('-')[1][1:])).astype(np.int8)
1. invite_info_evaluate['invite_day'] = invite_info_evaluate['invite_time'].apply(lambda x: int(x.split('-')[0][1:])).astype(np.int16)
2. invite_info_evaluate['invite_hour'] = invite_info_evaluate['invite_time'].apply(lambda x: int(x.split('-')[1][1:])).astype(np.int8)
1. invite_info = reduce_mem_usage(invite_info)
1. with open('../pkl/invite_info.pkl', 'wb') as file:
2.     pickle.dump(invite_info, file)
3. 
4. with open('../pkl/invite_info_evaluate.pkl', 'wb') as file:
5.     pickle.dump(invite_info_evaluate, file)
6. 
7. del invite_info, invite_info_evaluate
8. gc.collect()

member:

1. member_info = pd.read_csv(os.path.join(PATH, 'member_info_0926.txt'),
2.                           names=['author_id', 'gender', 'keyword', 'grade', 'hotness', 'reg_type','reg_plat','freq',
3.                                  'A1', 'B1', 'C1', 'D1', 'E1', 'A2', 'B2', 'C2', 'D2', 'E2',
4.                                  'score', 'topic_attent', 'topic_interest'], sep='\t')
5. member_info.head()
1. member_info['topic_attent'] = member_info['topic_attent'].apply(parse_list_1)
2. member_info['topic_interest'] = member_info['topic_interest'].apply(parse_map)
1. member_info = reduce_mem_usage(member_info)
1. with open('../pkl/member_info.pkl', 'wb') as file:
2.     pickle.dump(member_info, file)
3. 
4. del member_info
5. gc.collect()

question:

1. question_info = pd.read_csv(os.path.join(PATH, 'question_info_0926.txt'),
2.                           names=['question_id', 'question_time', 'title_sw_series', 'title_w_series', 'desc_sw_series', 'desc_w_series', 'topic'], sep='\t')
3. question_info.head()
1. question_info['title_sw_series'] = question_info['title_sw_series'].apply(parse_list_2)#.apply(sw_lbl_enc.transform).apply(list)
2. question_info['title_w_series'] = question_info['title_w_series'].apply(parse_list_1)#.apply(w_lbl_enc.transform).apply(list)
3. question_info['desc_sw_series'] = question_info['desc_sw_series'].apply(parse_list_2)#.apply(sw_lbl_enc.transform).apply(list)
4. question_info['desc_w_series'] = question_info['desc_w_series'].apply(parse_list_1)#.apply(w_lbl_enc.transform).apply(list)
5. question_info['topic'] = question_info['topic'].apply(parse_list_1)# .apply(topic_lbl_enc.transform).apply(list)
6. question_info.head()
1. question_info['question_day'] = question_info['question_time'].apply(lambda x: int(x.split('-')[0][1:])).astype(np.int16)
2. question_info['question_hour'] = question_info['question_time'].apply(lambda x: int(x.split('-')[1][1:])).astype(np.int8)
3. del question_info['question_time']
4. gc.collect()
1. question_info = reduce_mem_usage(question_info)
1. with open('../pkl/question_info.pkl', 'wb') as file:
2.     pickle.dump(question_info, file)
3. 
4. del question_info
5. gc.collect()

answer:

1. %%time
2. answer_info = pd.read_csv(os.path.join(PATH, 'answer_info_0926.txt'),
3.                           names=['answer_id', 'question_id', 'author_id', 'answer_time', 'content_sw_series', 'content_w_series',
4.                                  'excellent', 'recommend', 'round_table', 'figure', 'video',
5.                                  'num_word', 'num_like', 'num_unlike', 'num_comment',
6.                                  'num_favor', 'num_thank', 'num_report', 'num_nohelp', 'num_oppose'], sep='\t')
7. answer_info.head()
1. answer_info['content_sw_series'] = answer_info['content_sw_series'].apply(parse_list_2)
2. answer_info['content_w_series'] = answer_info['content_w_series'].apply(parse_list_1)
3. answer_info.head()
1. answer_info['answer_day'] = answer_info['answer_time'].apply(lambda x: int(x.split('-')[0][1:])).astype(np.int16)
2. answer_info['answer_hour'] = answer_info['answer_time'].apply(lambda x: int(x.split('-')[1][1:])).astype(np.int8)
3. del answer_info['answer_time']
4. gc.collect()
1. answer_info = reduce_mem_usage(answer_info)
1. with open('../pkl/answer_info.pkl', 'wb') as file:
2.     pickle.dump(answer_info, file)
3. 
4. del answer_info
5. gc.collect()
1. toc = time.time()
1. print('Used time: %d' % int(toc-tic))

gen_feat

构造特征,特征说明如上述。运行时间(32核)1764s,内存占用峰值 125G * 20%。

1. import warnings
2. warnings.filterwarnings('ignore')
1. import pandas as pd
2. import numpy as np
3. import pickle
4. import gc
5. import os
6. import time
7. import multiprocessing as mp
1. from sklearn.preprocessing import LabelEncoder
1. tic = time.time()
1. SAVE_PATH = './feats'
2. if not os.path.exists(SAVE_PATH):
3.     print('create dir: %s' % SAVE_PATH)
4.     os.mkdir(SAVE_PATH)

member_info: 用户特征

1. with open('../pkl/member_info.pkl', 'rb') as file:
2.     member_info = pickle.load(file)
3. member_info.head(2)
1. # 原始类别特征
2. member_cat_feats = ['gender', 'freq', 'A1', 'B1', 'C1', 'D1', 'E1', 'A2', 'B2', 'C2', 'D2', 'E2']
3. for feat in member_cat_feats:
4.     member_info[feat] = LabelEncoder().fit_transform(member_info[feat])
1. # 用户关注和感兴趣的topic数
2. member_info['num_atten_topic'] = member_info['topic_attent'].apply(len)
3. member_info['num_interest_topic'] = member_info['topic_interest'].apply(len)
1. def most_interest_topic(d):
2.     if len(d) == 0:
3.         return -1
4.     return list(d.keys())[np.argmax(list(d.values()))]
1. # 用户最感兴趣的topic
2. member_info['most_interest_topic'] = member_info['topic_interest'].apply(most_interest_topic)
3. member_info['most_interest_topic'] = LabelEncoder().fit_transform(member_info['most_interest_topic'])
1. def get_interest_values(d):
2.     if len(d) == 0:
3.         return [0]
4.     return list(d.values())
1. # 用户topic兴趣值的统计特征
2. member_info['interest_values'] = member_info['topic_interest'].apply(get_interest_values)
3. member_info['min_interest_values'] = member_info['interest_values'].apply(np.min)
4. member_info['max_interest_values'] = member_info['interest_values'].apply(np.max)
5. member_info['mean_interest_values'] = member_info['interest_values'].apply(np.mean)
6. member_info['std_interest_values'] = member_info['interest_values'].apply(np.std)
1. # 汇总
2. feats = ['author_id', 'gender', 'freq', 'A1', 'B1', 'C1', 'D1', 'E1', 'A2', 'B2', 'C2', 'D2', 'E2', 'score']
3. feats += ['num_atten_topic', 'num_interest_topic', 'most_interest_topic']
4. feats += ['min_interest_values', 'max_interest_values', 'mean_interest_values', 'std_interest_values']
5. member_feat = member_info[feats]
1. member_feat.head(3)
1. member_feat.to_hdf('./feats/member_feat.h5', key='data')
2. 
3. del member_feat, member_info
4. gc.collect()

question_info:问题特征

1. with open('../pkl/question_info.pkl', 'rb') as file:
2.     question_info = pickle.load(file)
3. 
4. question_info.head(2)
1. # title、desc词计数,topic计数
2. question_info['num_title_sw'] = question_info['title_sw_series'].apply(len)
3. question_info['num_title_w'] = question_info['title_w_series'].apply(len)
4. question_info['num_desc_sw'] = question_info['desc_sw_series'].apply(len)
5. question_info['num_desc_w'] = question_info['desc_w_series'].apply(len)
6. question_info['num_qtopic'] = question_info['topic'].apply(len)
1. feats = ['question_id', 'num_title_sw', 'num_title_w', 'num_desc_sw', 'num_desc_w', 'num_qtopic', 'question_hour']
2. feats += []
3. question_feat = question_info[feats]
1. question_feat.head(3)
1. question_feat.to_hdf('./feats/question_feat.h5', key='data')
1. del question_info, question_feat
2. gc.collect()

memberinfo&questioninfo: 用户和问题的交互特征:

1. with open('../pkl/invite_info.pkl', 'rb') as file:
2.     invite_info = pickle.load(file)
3. with open('../pkl/invite_info_evaluate.pkl', 'rb') as file:
4.     invite_info_evaluate = pickle.load(file)
5. with open('../pkl/member_info.pkl', 'rb') as file:
6.     member_info = pickle.load(file)
7. with open('../pkl/question_info.pkl', 'rb') as file:
8.     question_info = pickle.load(file)
1. # 合并 author_id,question_id
2. invite = pd.concat([invite_info, invite_info_evaluate])
3. invite_id = invite[['author_id', 'question_id']]
4. invite_id['author_question_id'] = invite_id['author_id'] + invite_id['question_id']
5. invite_id.drop_duplicates(subset='author_question_id',inplace=True)
6. invite_id_qm = invite_id.merge(member_info[['author_id', 'topic_attent', 'topic_interest']], 'left', 'author_id').merge(question_info[['question_id', 'topic']], 'left', 'question_id')
7. invite_id_qm.head(2)

注:这里为了加快运算,所以用了多进程 multiprocessing,windows + multiprocessing + jupyter可能有bug,建议linux上跑。

1. # 分割 df,方便多进程跑
2. def split_df(df, n):
3.     chunk_size = int(np.ceil(len(df) / n))
4.     return [df[i*chunk_size:(i+1)*chunk_size] for i in range(n)]
1. def gc_mp(pool, ret, chunk_list):
2.     del pool
3.     for r in ret:
4.         del r
5.     del ret
6.     for cl in chunk_list:
7.         del cl
8.     del chunk_list
9.     gc.collect()
1. # 用户关注topic和问题 topic的交集
2. def process(df):
3.     return df.apply(lambda row: list(set(row['topic_attent']) & set(row['topic'])),axis=1)
4. 
5. pool = mp.Pool()
6. chunk_list = split_df(invite_id_qm, 100)
7. ret = pool.map(process, chunk_list)
8. invite_id_qm['topic_attent_intersection'] = pd.concat(ret)
9. gc_mp(pool, ret, chunk_list)
1. # 用户感兴趣topic和问题 topic的交集
2. def process(df):
3.     return df.apply(lambda row: list(set(row['topic_interest'].keys()) & set(row['topic'])),axis=1)
4. 
5. pool = mp.Pool()
6. chunk_list = split_df(invite_id_qm, 100)
7. ret = pool.map(process, chunk_list)
8. invite_id_qm['topic_interest_intersection'] = pd.concat(ret)
9. gc_mp(pool, ret, chunk_list)
1. # 用户感兴趣topic和问题 topic的交集的兴趣值
2. def process(df):
3.     return df.apply(lambda row: [row['topic_interest'][t] for t in row['topic_interest_intersection']],axis=1)
4. 
5. pool = mp.Pool()
6. chunk_list = split_df(invite_id_qm, 100)
7. ret = pool.map(process, chunk_list)
8. invite_id_qm['topic_interest_intersection_values'] = pd.concat(ret)
9. gc_mp(pool, ret, chunk_list)
1. # 交集topic计数
2. invite_id_qm['num_topic_attent_intersection'] = invite_id_qm['topic_attent_intersection'].apply(len)
3. invite_id_qm['num_topic_interest_intersection'] = invite_id_qm['topic_interest_intersection'].apply(len)
1. # 交集topic兴趣值统计
2. invite_id_qm['topic_interest_intersection_values'] = invite_id_qm['topic_interest_intersection_values'].apply(lambda x: [0] if len(x) == 0 else x)
3. invite_id_qm['min_topic_interest_intersection_values'] = invite_id_qm['topic_interest_intersection_values'].apply(np.min)
4. invite_id_qm['max_topic_interest_intersection_values'] = invite_id_qm['topic_interest_intersection_values'].apply(np.max)
5. invite_id_qm['mean_topic_interest_intersection_values'] = invite_id_qm['topic_interest_intersection_values'].apply(np.mean)
6. invite_id_qm['std_topic_interest_intersection_values'] = invite_id_qm['topic_interest_intersection_values'].apply(np.std)
1. feats = ['author_question_id', 'num_topic_attent_intersection', 'num_topic_interest_intersection', 'min_topic_interest_intersection_values', 'max_topic_interest_intersection_values', 'mean_topic_interest_intersection_values', 'std_topic_interest_intersection_values']
2. feats += []
3. member_question_feat = invite_id_qm[feats]
4. member_question_feat.head(3)
1. member_question_feat.to_hdf('./feats/member_question_feat.h5', key='data')
1. del invite_id_qm, member_question_feat
2. gc.collect()
1. toc = time.time()
2. print('Used time: %d' % int(toc-tic))

baseline

模型训练预测。

1. import warnings2. warnings.filterwarnings('ignore')1. import pandas as pd2. import numpy as np3. import gc4. import pickle5. import time1. from sklearn.preprocessing import LabelEncoder2. from sklearn.model_selection import StratifiedKFold1. from catboost import CatBoostClassifier, Pool1. tic = time.time()1. with open('../pkl/invite_info.pkl', 'rb') as file:2.     invite_info = pickle.load(file)3. with open('../pkl/invite_info_evaluate.pkl', 'rb') as file:4.     invite_info_evaluate = pickle.load(file)1. member_feat = pd.read_hdf('./feats/member_feat.h5', key='data')  # 0.6894382. question_feat = pd.read_hdf('./feats/question_feat.h5', key='data')  # 0.7068481. member_question_feat = pd.read_hdf('./feats/member_question_feat.h5', key='data')  # 719116 d122. invite_info['author_question_id'] = invite_info['author_id'] + invite_info['question_id']3. invite_info_evaluate['author_question_id'] = invite_info_evaluate['author_id'] + invite_info_evaluate['question_id']1. train = invite_info.merge(member_feat, 'left', 'author_id')2. test = invite_info_evaluate.merge(member_feat, 'left', 'author_id')1. train = train.merge(question_feat, 'left', 'question_id')2. test = test.merge(question_feat, 'left', 'question_id')1. train = train.merge(member_question_feat, 'left', 'author_question_id')2. test = test.merge(member_question_feat, 'left', 'author_question_id')1. del member_feat, question_feat, member_question_feat2. gc.collect()1. drop_feats = ['question_id', 'author_id', 'author_question_id', 'invite_time', 'label', 'invite_day']2. 3. used_feats = [f for f in train.columns if f not in drop_feats]4. print(len(used_feats))5. print(used_feats)1. train_x = train[used_feats].reset_index(drop=True)2. train_y = train['label'].reset_index(drop=True)3. test_x = test[used_feats].reset_index(drop=True)1. preds = np.zeros((test_x.shape[0], 2))2. scores = []3. has_saved = False4. imp = pd.DataFrame()5. imp['feat'] = used_feats6. 7. kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)8. for index, (tr_idx, va_idx) in enumerate(kfold.split(train_x, train_y)):9.     print('*' * 30)10.     X_train, y_train, X_valid, y_valid = train_x.iloc[tr_idx], train_y.iloc[tr_idx], train_x.iloc[va_idx], train_y.iloc[va_idx]11.     cate_features = []12.     train_pool = Pool(X_train, y_train, cat_features=cate_features)13.     eval_pool = Pool(X_valid, y_valid,cat_features=cate_features)14.     if not has_saved:15.         cbt_model = CatBoostClassifier(iterations=10000,16.                            learning_rate=0.1,17.                            eval_metric='AUC',18.                            use_best_model=True,19.                            random_seed=42,20.                            logging_level='Verbose',21.                            task_type='GPU',22.                            devices='0',23.                            early_stopping_rounds=300,24.                            loss_function='Logloss',25.                            depth=12,26.                            )27.         cbt_model.fit(train_pool, eval_set=eval_pool, verbose=100)28. #         with open('./models/fold%d_cbt_v1.mdl' % index, 'wb') as file:29. #             pickle.dump(cbt_model, file)30.     else:31.         with open('./models/fold%d_cbt_v1.mdl' % index, 'rb') as file:32.             cbt_model = pickle.load(file)33. 34.     imp['score%d' % (index+1)] = cbt_model.feature_importances_35. 36.     score = cbt_model.best_score_['validation']['AUC']37.     scores.append(score)38.     print('fold %d round %d : score: %.6f | mean score %.6f' % (index+1, cbt_model.best_iteration_, score,np.mean(scores)))39.     preds += cbt_model.predict_proba(test_x)  40. 41.     del cbt_model, train_pool, eval_pool42.     del X_train, y_train, X_valid, y_valid43.     import gc44.     gc.collect()45. 46. #     mdls.append(cbt_model)1. imp.sort_values(by='score1', ascending=False)1. result = invite_info_evaluate[['question_id', 'author_id', 'invite_time']]2. result['result'] = preds[:, 1] / 53. result.head()1. result = invite_info_evaluate[['question_id', 'author_id', 'invite_time']]2. result['result'] = preds[:, 1] / 53. result.head()1. result.to_csv('./result.txt', sep='\t', index=False, header=False)1. toc = time.time()2. print('Used time: %d' % int(toc - tic))

探索性别、盐值等,利用LightGBM分类

代码作者:刘康,合肥工业大学计算机硕士在读, 2018 年知乎-CCIR复赛第二名。

这一baseline在探索了性别、盐值等特征,然后采用LightGBM进行分类。本模型分数达到0.76,在一台笔记本上即可运行。

赛题分析

我们先给出最简单的赛题说明:将一个问题Q推荐给用户U,计算用户U会回答这个问题Q的概率。

具体来说,比赛提供了问题信息(questioninfo0926.txt,可以为问题Q创建额外的特征信息)、用户画像(memberinfo0926.txt,可以为用户U创建额外的特征信息)、回答信息(answerinfo0926.txt,可以同时为问题Q和用户U创建额外的特征信息)

数据分析

用户数据集和问题数据集

1. import pandas as pd
2. user_info = pd.read_csv('member_info_0926.txt', header=None, sep='\t')
3. user_info.columns = ['用户id','性别','创作关键词','创作数量等级','创作热度等级','注册类型','注册平台','访问评率','用户二分类特征a','用户二分类特征b','用户二分类特征c','用户二分类特征d','用户二分类特征e','用户多分类特征a','用户多分类特征b','用户多分类特征c','用户多分类特征d','用户多分类特征e','盐值','关注话题','感兴趣话题']
4. for col in user_info.columns:
5.     print(col, len(user_info[col].unique()))
6. 
7. question_info = pd.read_csv('question_info_0926.txt', header=None, sep='\t')
8. question_info.columns = ['问题id','问题创建时间','问题标题单字编码','问题标题切词编码','问题描述单字编码','问题描述切词编码','问题绑定话题']
9. for col in question_info.columns:
10.     print(col, len(question_info[col].unique()))

从上面的数据分析可以看出,用户数据中有21个特征,其中5个特征(创作关键词、创作数量等级、创作热度等级、注册类型、注册平台)在数据集中只有一个取值,说明这5个特征是完全无用的,可以直接去掉。

数据集合并

为了分析上述两个数据集中的特征是否有对预测结果有影响(或者说这些特征是否是有区分度的强特征),我们首先将这两个数据集和训练集(inviteinfo0926.txt)合并, 然后通过图表来对部分特征进行分析。

1. train = pd.read_csv('invite_info_0926.txt', header=None, sep='\t')
2. train.columns = ['问题id', '用户id', '邀请创建时间','是否回答']
3. train = pd.merge(train, user_info, how='left', on='用户id')
4. train = pd.merge(train, question_info, how='left', on='问题id')
5. print(train.columns)

Index(['问题id', '用户id', '邀请创建时间', '是否回答', '性别', '创作关键词', '创作数量等级', '创作热度等级', '注册类型', '注册平台', '访问评率', '用户二分类特征a', '用户二分类特征b', '用户二分类特征c', '用户二分类特征d', '用户二分类特征e', '用户多分类特征a', '用户多分类特征b', '用户多分类特征c', '用户多分类特征d', '用户多分类特征e', '盐值', '关注话题', '感兴趣话题', '问题创建时间', '问题标题单字编码', '问题标题切词编码', '问题描述单字编码', '问题描述切词编码', '问题绑定话题'], dtype='object')

性别特征: 性别特征有三个类别,分别是 男性,女性和未知,下面的柱状图可以看出,男性和女性分布非常相似,未知的分布相比较之下有 较大的区别,显然,该特征具有较好的区分度。

1. import matplotlib.pyplot as plt
2. import seaborn as sns
3. % matplotlib inline
4. sns.set_style('whitegrid')
5. 
6. sns.countplot(x='性别',hue='是否回答',data=train)

访问频率特征:该特征总计5中类别,[weekly, monthly, daily, new, unknown],从下面的柱状图可以看出,不同的类别具有完全不同的分布,这特征显然是一种很有区分度很强的特征。

1. sns.countplot(x='访问评率',hue='是否回答',data=train)

用户二分类特征a: 该特征是二分类特征,下图表示该特征具备很好的区分度(剩下的二分类和多分类特征也是同理,不赘述)。

1. sns.countplot(x='用户二分类特征a',hue='是否回答',data=train)

盐值: 我们先对盐值进行分桶,然后查看不同区间盐值的分布情况。下图表示不同区间盐值的用户具有很有的区分度,在处理这个特征时,至于是否分桶,如何通过更加详细的数据分析自由发挥,给出的baseline对该特征未做处理。

1. def trans(x):
2.     if x <= 0:
3.         return x
4.     if 1 <= x <= 10:
5.         return 1
6.     if 10 < x <= 100:
7.         return 2
8.     if 100 < x <= 200:
9.         return 3
10.     if 200 < x <= 300:
11.         return 4
12.     if 400 < x <= 500:
13.         return 5
14.     if 500 < x <= 600:
15.         return 6
16.     if x > 600:
17.         return 7
18. train['盐值'] = train['盐值'].apply(lambda x: trans(x))
19. sns.countplot(x='盐值',hue='是否回答',data=train)

时间: 数据集中的时间都采用“D×-H×”的格式,D代表天数,H代表小时,我们需要将这一个特征转化为两个特征,天和小时。

数据处理

从这一部分开始都是baseline的全部代码(前面仅仅是简单的数据分析,不作为baseline的代码)。

1. import pandas as pd
2. 
3. # 导入数据
4. user_info = pd.read_csv('member_info_0926.txt', header=None, sep='\t')
5. question_info = pd.read_csv('question_info_0926.txt', header=None, sep='\t')
6. train = pd.read_csv('train.txt', header=None, sep='\t')
7. test = pd.read_csv('test.txt', header=None, sep='\t')
8. 
9. user_info.columns = ['用户id','性别','创作关键词','创作数量等级','创作热度等级','注册类型','注册平台','访问评率','用户二分类特征a','用户二分类特征b','用户二分类特征c','用户二分类特征d','用户二分类特征e','用户多分类特征a','用户多分类特征b','用户多分类特征c','用户多分类特征d','用户多分类特征e','盐值','关注话题','感兴趣话题']
10. user_info  = user_info.drop(['创作关键词','创作数量等级','创作热度等级','注册类型','注册平台'], axis=1)
11. question_info.columns = ['问题id','问题创建时间','问题标题单字编码','问题标题切词编码','问题描述单字编码','问题描述切词编码','问题绑定话题']
12. 
13. train.columns = ['问题id', '用户id', '邀请创建时间','是否回答']
14. train = pd.merge(train, user_info, how='left', on='用户id')
15. train = pd.merge(train, question_info, how='left', on='问题id')
16. 
17. test.columns = ['问题id', '用户id', '邀请创建时间']
18. test = pd.merge(test, user_info, how='left', on='用户id')
19. test = pd.merge(test, question_info, how='left', on='问题id')
20. 
21. # 数据合并
22. data = pd.concat([train, test], axis=0, sort=True)
23. 
24. # 用于保存提交结果
25. result_append = data[['问题id', '用户id', '邀请创建时间']][train.shape[0]:]
26. 
27. data['邀请创建时间-day'] = data['邀请创建时间'].apply(lambda x:x.split('-')[0].split('D')[1])
28. data['邀请创建时间-hour'] = data['邀请创建时间'].apply(lambda x:x.split('-')[1].split('H')[1])
29. 
30. data['问题创建时间-day'] = data['问题创建时间'].apply(lambda x:x.split('-')[0].split('D')[1])
31. data['问题创建时间-hour'] = data['问题创建时间'].apply(lambda x:x.split('-')[1].split('H')[1])
32. 
33. # 删除的特征并非不重要,相反这部分的数据很重要,如何处理这部分特征有很大的发挥空间,本baseline不涉及这些特征。
34. drop_feat = ['问题标题单字编码','问题标题切词编码','问题描述单字编码','问题描述切词编码','问题绑定话题', '关注话题','感兴趣话题','问题创建时间','邀请创建时间']
35. data  = data.drop(drop_feat, axis=1)
36. 
37. print(data.columns)

Index(['性别', '是否回答', '用户id', '用户二分类特征a', '用户二分类特征b', '用户二分类特征c', '用户二分类特征d', '用户二分类特征e', '用户多分类特征a', '用户多分类特征b', '用户多分类特征c', '用户多分类特征d', '用户多分类特征e', '盐值', '访问评率', '问题id', '邀请创建时间-day', '邀请创建时间-hour', '问题创建时间-day', '问题创建时间-hour'], dtype='object')

特征处理

编码: 将离散型的特征通过LabelEncoder进行数字编码。

1. from sklearn.preprocessing import LabelEncoder
2. class_feat =  ['用户id','问题id','性别', '访问评率','用户多分类特征a','用户多分类特征b','用户多分类特征c','用户多分类特征d','用户多分类特征e']
3. encoder = LabelEncoder()
4. for feat in class_feat:
5.     encoder.fit(data[feat])
6.     data[feat] = encoder.transform(data[feat])

构造计数特征: 对具有很好区分度的特征进行单特征计数(有明显提升)。

1. for feat in ['用户id','问题id','性别', '访问评率','用户二分类特征a', '用户二分类特征b', '用户二分类特征c', '用户二分类特征d',
2.        '用户二分类特征e','用户多分类特征a','用户多分类特征b','用户多分类特征c','用户多分类特征d','用户多分类特征e']:
3.     col_name = '{}_count'.format(feat)
4.     data[col_name] = data[feat].map(data[feat].value_counts().astype(int))
5.     data.loc[data[col_name] < 2, feat] = -1
6.     data[feat] += 1
7.     data[col_name] = data[feat].map(data[feat].value_counts().astype(int))
8.     data[col_name] = (data[col_name] - data[col_name].min()) / (data[col_name].max() - data[col_name].min())

模型训练和预测

1. from lightgbm import LGBMClassifier
2. 
3. # 划分训练集和测试集
4. y_train = data[:train.shape[0]]['是否回答'].values
5. X_train = data[:train.shape[0]].drop(['是否回答'], axis=1).values
6. 
7. X_test = data[train.shape[0]:].drop(['是否回答'], axis=1).values
8. 
9. model_lgb = LGBMClassifier(boosting_type='gbdt', num_leaves=64, learning_rate=0.01, n_estimators=2000,
10.                            max_bin=425, subsample_for_bin=50000, objective='binary', min_split_gain=0,
11.                            min_child_weight=5, min_child_samples=10, subsample=0.8, subsample_freq=1,
12.                            colsample_bytree=1, reg_alpha=3, reg_lambda=5, seed=1000, n_jobs=-1, silent=True)
13. # 建议使用CV的方式训练预测。
14. model_lgb.fit(X_train, y_train,
15.                   eval_names=['train'],
16.                   eval_metric=['logloss','auc'],
17.                   eval_set=[(X_train, y_train)],
18.                   early_stopping_rounds=10)
19. y_pred = model_lgb.predict_proba(X_test)[:, 1]
20. result_append['是否回答'] = y_pred
21. result_append.to_csv('result.txt', index=False, header=False, sep='\t')

压缩提交结果文件result.txt,可以得到得分 0.763213863070066

本文分享自微信公众号 - 大数据文摘(BigDataDigest)

原文出处及转载信息见文内详细说明,如有侵权,请联系 yunjia_community@tencent.com 删除。

原始发表时间:2019-10-16

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • 数据洞悉全球色情行业

    大数据文摘
  • 《港囧》火爆,徐峥收益多少?体验一下“业务思维×资本思维”的分析方法

    大数据文摘
  • 毕马威报告:《2016中国展望》(62页下载)

    大数据文摘
  • 了解NGS临床数据仓库VSWarehouse—出完报告是否分析人员的工作就能翻篇了

    http://kuaibao.qq.com/s/20171210G0MCZX00?refer=cp_1026

    用户5653884
  • 测试开发进阶(二十)

    Django makes it easier to build better Web apps more quickly and with less code.

    zx钟
  • 通过@Enable*注解触发Spring Boot配置

    在Spring Boot:定制自己的starter一文最后提到,触发Spring Boot的配置过程有两种方法:

    阿杜
  • 树莓派4b烧机+第一次远程连接建立

    tf卡(大于等于8g)、系统、烧写软件(win32diskimager-binary)、PUTTY

    李小白是一只喵
  • 如何将win10家庭版升级到win10专业版

    家庭版系统是绝大多数新电脑厂商出厂默认的系统,很多功能都无法使用,有些软件也不支持家庭版本,那么如何将家庭版系统升级为专业版呢,接下就给大家说下步骤,

    用户5459522
  • Vue4.x配置env开发环境、测试环境、生产环境

    配置完成重新npm run serve启动 打印process.env看到自己的配置项

    任我行RQ
  • Java设计模式-单例模式-枚举

    1)这借助于JDK1.5中添加的枚举来实现单例模式。不仅能避免多线程同步问题,而且还能防止反序列化重新创建新的对象

    桑鱼

扫码关注云+社区

领取腾讯云代金券