前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >关于空难数据集的探索分析导入数据集伤亡分析机型处理时间分析

关于空难数据集的探索分析导入数据集伤亡分析机型处理时间分析

作者头像
月见樽
发布2018-04-27 12:10:24
2K0
发布2018-04-27 12:10:24
举报

写在前面: 这是我见过的最严肃的数据集,几乎每一行数据背后都是生命和鲜血的代价。这次探索分析并不妄图说明什么,仅仅是对数据处理能力的锻炼。因此本次的探索分析只会展示数据该有的样子而不会进行太多的评价。有一句话叫“因为珍爱和平,我们回首战争”。这里也是,因为珍爱生命,所以回首空难。现在安全的飞行是10万多无辜的人通过性命换来的,向这些伟大的探索者致敬。

代码语言:javascript
复制
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

导入数据集

代码语言:javascript
复制
crash = pd.read_csv("./Airplane_Crashes_and_Fatalities_Since_1908.csv")
crash.info()
代码语言:javascript
复制
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5268 entries, 0 to 5267
Data columns (total 13 columns):
Date            5268 non-null object
Time            3049 non-null object
Location        5248 non-null object
Operator        5250 non-null object
Flight #        1069 non-null object
Route           3562 non-null object
Type            5241 non-null object
Registration    4933 non-null object
cn/In           4040 non-null object
Aboard          5246 non-null float64
Fatalities      5256 non-null float64
Ground          5246 non-null float64
Summary         4878 non-null object
dtypes: float64(3), object(10)
memory usage: 535.1+ KB
代码语言:javascript
复制
crash = crash.drop(["Summary","cn/In","Flight #","Route","Location"],axis=1)
crash.info()
代码语言:javascript
复制
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5268 entries, 0 to 5267
Data columns (total 8 columns):
Date            5268 non-null object
Time            3049 non-null object
Operator        5250 non-null object
Type            5241 non-null object
Registration    4933 non-null object
Aboard          5246 non-null float64
Fatalities      5256 non-null float64
Ground          5246 non-null float64
dtypes: float64(3), object(5)
memory usage: 329.3+ KB
代码语言:javascript
复制
print(crash[2200:2205])
代码语言:javascript
复制
            Date   Time                      Operator              Type  \
2200  03/06/1968   8:00     Military - U.S. Air Force  Fairchild C-123K   
2201  03/08/1968  19:18                    Air Manila    Fairchild F-27   
2202  03/09/1968  23:20   Military - French Air Force      Douglas DC6B   
2203  03/19/1968  19:37     Viking Airways - Air Taxi        Cessna 182   
2204  03/23/1968  13:00  Fortaire Aviation - Air Taxi       Brantly 305   

     Registration  Aboard  Fatalities  Ground  
2200      54-0590    49.0        49.0     0.0  
2201      PI-C871    14.0        14.0     0.0  
2202        43748    20.0        19.0     0.0  
2203       N2623F     2.0         2.0     0.0  
2204       N2224U     5.0         3.0     0.0  

伤亡分析

伤亡排序

代码语言:javascript
复制
print(crash["Fatalities"].sum())
fatal_crash = crash[crash["Fatalities"].notnull()]
fatal_crash = fatal_crash.sort_values(by="Fatalities")
print(fatal_crash[-5:])
代码语言:javascript
复制
105479.0
            Date   Time                                    Operator  \
3562  06/23/1985   7:15                                   Air India   
2726  03/03/1974  11:41                      Turkish Airlines (THY)   
4455  11/12/1996  18:40  Saudi Arabian Airlines / Kazastan Airlines   
3568  08/12/1985  18:56                             Japan Air Lines   
2963  03/27/1977  17:07            Pan American World Airways / KLM   

                                      Type    Registration  Aboard  \
3562                     Boeing B-747-237B          VT-EFO   329.0   
2726            McDonnell Douglas DC-10-10          TC-JAV   346.0   
4455  Boeing B-747-168B / Ilyushin IL-76TD  HZAIH/UN-76435   349.0   
3568                     Boeing B-747-SR46          JA8119   524.0   
2963  Boeing B-747-121 / Boeing B-747-206B   N736PA/PH-BUF   644.0   

      Fatalities  Ground  
3562       329.0     0.0  
2726       346.0     0.0  
4455       349.0     0.0  
3568       520.0     0.0  
2963       583.0     0.0  
  • 内特里费空难:两架波音-747相撞,死亡583人,又称世纪大空难
  • 日航123空难:波音747撞富士山,单架飞机失事最高死亡记录
  • 恰尔基达德里撞机事件,最严重的的空中撞机事件
  • 土耳其航空981号班机空难:货舱门未锁定导致爆炸性施压
  • 印度航空182号班机:恐怖袭击

伤亡概率

代码语言:javascript
复制
print(fatal_crash["Fatalities"].sum() / fatal_crash["Aboard"].sum())
代码语言:javascript
复制
0.729700936002
代码语言:javascript
复制
print(fatal_crash[fatal_crash["Fatalities"] != 0]["Fatalities"].sum() / fatal_crash[fatal_crash["Fatalities"] != 0]["Aboard"].sum())
代码语言:javascript
复制
0.754191781605    

机型处理

处理函数

代码语言:javascript
复制
type_crash = fatal_crash["Type"]
def type_handle(x):
    x = str(x)
    if "McDonnell Douglas" in x:
        return "McDonnell Douglas"
    elif "Douglas" in x:
        return "Douglas"
    elif "Boeing" in x:
        return "Boeing"
    elif "Airbus" in x:
        return "Airbus"
    elif "Embraer" in x:
        return "Embraer"
    elif "Ilyushin" in x:
        return "Ilyushin"
    else:
        return "other"
company_crash = type_crash.map(type_handle)
print(pd.value_counts(company_crash))
代码语言:javascript
复制
other                3581
Douglas               984
Boeing                376
McDonnell Douglas     123
Ilyushin               96
Embraer                61
Airbus                 35
Name: Type, dtype: int64
代码语言:javascript
复制
fatal_crash["company"] = company_crash
print(fatal_crash[:2])
代码语言:javascript
复制
            Date   Time          Operator               Type Registration  \
108   10/21/1926  13:15  Imperial Airways  Handley Page W-10       G-EBMS   
5178  11/08/2007   8:00    Juba Air Cargo         Antonov 12       ST-JUA   

      Aboard  Fatalities  Ground  year month company  
108     12.0         0.0     0.0  1926    10   other  
5178     4.0         0.0     2.0  2007    11   other  

处理结果

代码语言:javascript
复制
def airplane_counte(x):
    fatal_ratio = x["Fatalities"].sum() / x["Aboard"].sum()
    crash_time = x.shape[0]
    fatal_num = x["Fatalities"].sum()
    return pd.Series({"fatal_num":fatal_num,"crash_time":crash_time,"fatal_ratio":fatal_ratio})

company = fatal_crash.groupby(['company']).apply(airplane_counte)
print(company)
plt.close()
plt.figure(figsize=(16,4))
plt.subplot(1,3,1)
company['crash_time'].drop("other").plot(kind='bar',title="time")
plt.subplot(1,3,2)
company['fatal_num'].drop("other").plot(kind='bar',title="fatal_num")
plt.subplot(1,3,3)
company['fatal_ratio'].plot(kind='bar',title="fatal_ratio")
plt.show()
代码语言:javascript
复制
                   crash_time  fatal_num  fatal_ratio
company                                              
Airbus                   35.0     2980.0     0.510711
Boeing                  376.0    18705.0     0.649434
Douglas                 984.0    16899.0     0.794350
Embraer                  61.0      644.0     0.779661
Ilyushin                 96.0     4547.0     0.883084
McDonnell Douglas       123.0     6827.0     0.531946
other                  3581.0    54877.0     0.785854

分厂商分析结果

时间分析

代码语言:javascript
复制
def get_year(x):
    return x.split("/")[-1]
fatal_crash['year'] = fatal_crash["Date"].map(get_year)
year_fatal = fatal_crash[fatal_crash["year"] != np.NaN][["year","Fatalities","Aboard"]]
year_fatal.info()
代码语言:javascript
复制
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5256 entries, 108 to 2963
Data columns (total 3 columns):
year          5256 non-null object
Fatalities    5256 non-null float64
Aboard        5246 non-null float64
dtypes: float64(2), object(1)
memory usage: 164.2+ KB
代码语言:javascript
复制
def year_analysis(x):
    return pd.Series({"fatal_num":x["Fatalities"].sum(),"time":x.shape[0],"fatal_ratio":x["Fatalities"].sum() / x["Aboard"].sum()})
year = year_fatal.groupby(["year"]).apply(year_analysis)
year = year.sort_index()
year.info()
代码语言:javascript
复制
<class 'pandas.core.frame.DataFrame'>
Index: 98 entries, 1908 to 2009
Data columns (total 3 columns):
fatal_num      98 non-null float64
fatal_ratio    98 non-null float64
time           98 non-null float64
dtypes: float64(3)
memory usage: 3.1+ KB
代码语言:javascript
复制
plt.close()
plt.figure(figsize=(16,4))
plt.subplot(1,3,1)
year["fatal_num"].plot(title="fatal_num")
plt.subplot(1,3,2)
year["time"].plot(title="crash_time")
plt.subplot(1,3,3)
year["fatal_ratio"].plot(title="fatal_ratio")
plt.show()

按年分析结果

代码语言:javascript
复制
def get_month(x):
    return x.split("/")[0]
fatal_crash['month'] = fatal_crash["Date"].map(get_month)
month_fatal = fatal_crash[fatal_crash["month"] != np.NaN][["month","Fatalities","Aboard"]]
month_fatal.info()
代码语言:javascript
复制
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5256 entries, 108 to 2963
Data columns (total 3 columns):
month         5256 non-null object
Fatalities    5256 non-null float64
Aboard        5246 non-null float64
dtypes: float64(2), object(1)
memory usage: 164.2+ KB
代码语言:javascript
复制
def month_analysis(x):
    return pd.Series({"fatal_num":x["Fatalities"].sum(),"time":x.shape[0],"fatal_ratio":x["Fatalities"].sum() / x["Aboard"].sum()})
month = month_fatal.groupby(["month"]).apply(year_analysis)
month = month.sort_index()
print(month)
代码语言:javascript
复制
       fatal_num  fatal_ratio   time
month                               
01        8425.0     0.768354  494.0
02        7966.0     0.693057  395.0
03        8708.0     0.787057  453.0
04        6769.0     0.711852  378.0
05        7130.0     0.731807  370.0
06        7909.0     0.681399  385.0
07        9232.0     0.700349  427.0
08       10174.0     0.729162  474.0
09       10286.0     0.760349  458.0
10        8388.0     0.778758  452.0
11       10033.0     0.766522  454.0
12       10459.0     0.668478  516.0
代码语言:javascript
复制
plt.close()
plt.figure(figsize=(16,4))
plt.subplot(1,3,1)
month["fatal_num"].plot(kind="bar",title="fatal_num")
plt.subplot(1,3,2)
month["time"].plot(kind="bar",title="crash_time")
plt.subplot(1,3,3)
month["fatal_ratio"].plot(kind="bar",title="fatal_ratio")
plt.show()

按月份分析

小时

代码语言:javascript
复制
def get_hour(x):
    hour = x.split(":")[0]
    try:
        hour = float(hour)
        if int(hour) == hour and hour < 24:
            return hour
        else:
            return np.nan
    except:
        return np.nan
    
time_fatal = fatal_crash[fatal_crash["Time"].isnull() == False]
time_fatal["hour"] = time_fatal["Time"].map(get_hour)
time_fatal.info()
代码语言:javascript
复制
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3049 entries, 108 to 2963
Data columns (total 13 columns):
Date            3049 non-null object
Time            3049 non-null object
Operator        3046 non-null object
Type            3048 non-null object
Registration    2952 non-null object
Aboard          3049 non-null float64
Fatalities      3049 non-null float64
Ground          3046 non-null float64
airplane        3049 non-null object
company         3049 non-null object
year            3049 non-null object
month           3049 non-null object
hour            3036 non-null float64
dtypes: float64(4), object(9)
memory usage: 333.5+ KB


c:\users\qiank\appdata\local\programs\python\python35\lib\site-packages\ipykernel_launcher.py:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]
代码语言:javascript
复制
def hour_analysis(x):
    return pd.Series({"fatal_num":x["Fatalities"].sum(),"time":x.shape[0],"fatal_ratio":x["Fatalities"].sum() / x["Aboard"].sum()})
hour = time_fatal.groupby(["hour"]).apply(hour_analysis)
代码语言:javascript
复制
plt.close()
plt.figure(figsize=(16,4))
plt.subplot(1,3,1)
hour["fatal_num"].plot(title="fatal_num")
plt.subplot(1,3,2)
hour["time"].plot(title="crash_time")
plt.subplot(1,3,3)
hour["fatal_ratio"].plot(title="fatal_ratio")
plt.show()

按时间分析

本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2018.01.10 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 导入数据集
  • 伤亡分析
    • 伤亡排序
      • 伤亡概率
      • 机型处理
        • 处理函数
          • 处理结果
          • 时间分析
                • 小时
                领券
                问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档