一年一度的NFL大数据碗,今年的预测目标是通过两队球员的静态数据,预测该次进攻推进的码数,并转换为该概率分布;
https://www.kaggle.com/c/nfl-big-data-bowl-2020
https://www.kaggle.com/holoong9291/nfl-big-data-bowl
https://github.com/NemoHoHaloAi/Competition/tree/master/kaggle/Top61%-0.01404-zzz-NFL-Big-Data-Bowl
字段信息:
GameId
- a unique game identifier - 比赛IDPlayId
- a unique play identifier -Team
- home or away - 主场还是客场X
- player position along the long axis of the field. See figure below. - 在球场的位置xY
- player position along the short axis of the field. See figure below. - 在球场的位置yS
- speed in yards/second - 速度,码/秒A
- acceleration in yards/second^2Dis
- distance traveled from prior time point, in yardsOrientation
- orientation of player (deg) 球员面向Dir
- angle of player motion (deg) 球员移动方向NflId
- a unique identifier of the player - NFL球员IDDisplayName
- player's name - 球员名JerseyNumber
- jersey number - 球衣号码Season
- year of the seasonYardLine
- the yard line of the line of scrimmageQuarter
- game quarter (1-5, 5 == overtime) - 当前是第几节比赛,5为加时GameClock
- time on the game clock - 比赛时间PossessionTeam
- team with possession - 持球方Down
- the down (1-4) - 达阵Distance
- yards needed for a first down - 距离拿首攻所需距离FieldPosition
- which side of the field the play is happening onHomeScoreBeforePlay
- home team score before play started - 赛前主队分数VisitorScoreBeforePlay
- visitor team score before play started - 赛前客队分数NflIdRusher
- the NflId of the rushing playerOffenseFormation
- offense formationOffensePersonnel
- offensive team positional groupingDefendersInTheBox
- number of defenders lined up near the line of scrimmage, spanning the width of the offensive lineDefensePersonnel
- defensive team positional groupingPlayDirection
- direction the play is headedTimeHandoff
- UTC time of the handoff - 传球时间TimeSnap
- UTC time of the snap - 发球时间Yards
- the yardage gained on the play (you are predicting this) - 目标PlayerHeight
- player height (ft-in) - 球员身高PlayerWeight
- player weight (lbs) - 球员体重PlayerBirthDate
- birth date (mm/dd/yyyy) - 生日、岁数PlayerCollegeName
- where the player attended college - 大学Position
- the player's position (the specific role on the field that they typically play) - 场上位置HomeTeamAbbr
- home team abbreviation - 主队缩写VisitorTeamAbbr
- visitor team abbreviation - 客队缩写Week
- week into the seasonStadium
- stadium where the game is being played - 体育场Location
- city where the game is being player - 城市StadiumType
- description of the stadium environment - 体育场类型Turf
- description of the field surface - 草皮GameWeather
- description of the game weather - 比赛天气Temperature
- temperature (deg F) - 温度Humidity
- humidity - 湿度WindSpeed
- wind speed in miles/hour - 风速WindDirection
- wind direction - 风向回归预测,Target是码数,但是最终结果需要转换为条件概率分布;
Continuous Ranked Probability Score (CRPS);
这里竞赛需要的并不是具体的码数,而是码数对应的概率分布,也就是所有码数在一次进攻中的概率,所以需要这样一个转换类,如下:
训练数据上看,缺失情况不严重,缺失字段如下:
这里对缺失的处理根据不同类型的字段采取不同的方式:
下面是通过matplotlib绘制的一场比赛中的多个进攻防守回合的展示图,黑色三角形是QB,红色是进攻方,淡蓝色是防守方:
可以清楚的看到每次进攻不同的站位,以及整个推进的过程,这里我记录的一份NFL比赛手记,爱国者vs乌鸦,新老QB的正面交锋,非常精彩,可以对照着看一下;
这里由于我个人对橄榄球的了解也并不是很多(强推电影弱点),所以特征工程部分做的并不是很好,从结果看Top61%也反映除了这个问题,但是我依然觉得具有一定的参考意义,下面我会分析每个新特征构建的目的,以及我的想法;
这里要注意,训练数据每一行表示的是一次进攻中一个球员的情况,我们预测的是每次进攻,因此需要把每22条数据聚合为1条,这个过程中会有一些数据统计特征的产生,下面简介整个流程:
一次进攻的成败,大部分情况下取决于四分卫的发挥,而对其发挥其重要作用的,除了他自己,就是他身边的队友以及对手的数量,这一定程度上影响了他的可选择空间大小;
这一段的处理代码较多,只截取了一部分,如下:
测试数据处理与训练数据保持一致即可;
到此,数据处理完毕,后续就是建模、调参、combine等优化处理了,这一步我没有花太多精力,模型选择ExtraTreesRegressor,由于其使用了oob,因此不需要CV,结果如下: