直方图主要用来查看数据分布情况
In [55]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
In [56]:
df = pd.read_csv('/Users/spark/Downloads/nyc_fare.csv')
In [4]:
df.describe()
Out[4]:
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
fare_amount | surcharge | mta_tax | tip_amount | tolls_amount | total_amount | |
---|---|---|---|---|---|---|
count | 846945.000000 | 846945.000000 | 846945.000000 | 846945.00000 | 846945.000000 | 846945.000000 |
mean | 12.190578 | 0.320303 | 0.499305 | 1.34466 | 0.232142 | 14.587073 |
std | 9.514150 | 0.772642 | 0.057844 | 2.09149 | 1.109164 | 11.380950 |
min | -648.420000 | -1.000000 | -0.500000 | 0.00000 | 0.000000 | -52.500000 |
25% | 6.500000 | 0.000000 | 0.500000 | 0.00000 | 0.000000 | 8.000000 |
50% | 9.500000 | 0.000000 | 0.500000 | 1.00000 | 0.000000 | 11.000000 |
75% | 14.000000 | 0.500000 | 0.500000 | 2.00000 | 0.000000 | 16.500000 |
max | 620.010000 | 628.840000 | 41.490000 | 200.00000 | 100.660000 | 620.010000 |
这里可以看到fare_amount的最大值虽然是620,但是75%分位数是14,所以大部分数字都应该不是很大,我们后面采用50来观察他的分布情况
In [57]:
bin_array = np.linspace(start=0., stop=50., num=100)
In [58]:
df.fare_amount.hist(bins=bin_array)
Out[58]:
<matplotlib.axes._subplots.AxesSubplot at 0x116bdff60>
this is english
这是英语