前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Datawhale组队学习动手学数据分析第一章

Datawhale组队学习动手学数据分析第一章

作者头像
用户7886150
修改2021-01-08 10:18:55
7790
修改2021-01-08 10:18:55
举报
文章被收录于专栏:bit哲学院

参考链接: Python中的Inplace运算符| 2(ixor(),iand(),ipow()等)

1.1载入数据 

任务1:导入numpy和pandas 

import numpy as np

import pandas as pd

import os

任务二:载入数据 

(1) 使用相对路径载入 

cwd = os.getcwd()

os.chdir("D:\datasets\Titanic")

df = pd.read_csv('train.csv')

df.head()

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S4503Allen, Mr. William Henrymale35.0003734508.0500NaNS 

(2) 使用绝对路径载入数据 

df = pd.read_csv('D:\\datasets\\Titanic\\train.csv')

df.head()

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S4503Allen, Mr. William Henrymale35.0003734508.0500NaNS 

任务三:每1000行为一个数据模块,逐块读取 

chunker = pd.read_csv('train.csv', chunksize=1000)

chunker

<pandas.io.parsers.TextFileReader at 0x1f6383329a0>

任务四:将表头改成中文,索引改为乘客ID 

df = pd.read_csv('train.csv', names=['乘客ID','是否幸存','仓位等级','姓名','性别','年龄','兄弟姐妹个数','父母子女个数','船票信息','票价','客舱','登船港口'],index_col='乘客ID',header=0)

df.head()

是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口乘客ID103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S503Allen, Mr. William Henrymale35.0003734508.0500NaNS 

1.2初步观察 

任务一:查看数据的基本信息 

df.info

<bound method DataFrame.info of       是否幸存  仓位等级                                                 姓名      性别  \

乘客ID                                                                          

1        0     3                            Braund, Mr. Owen Harris    male   

2        1     1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female   

3        1     3                             Heikkinen, Miss. Laina  female   

4        1     1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female   

5        0     3                           Allen, Mr. William Henry    male   

...    ...   ...                                                ...     ...   

887      0     2                              Montvila, Rev. Juozas    male   

888      1     1                       Graham, Miss. Margaret Edith  female   

889      0     3           Johnston, Miss. Catherine Helen "Carrie"  female   

890      1     1                              Behr, Mr. Karl Howell    male   

891      0     3                                Dooley, Mr. Patrick    male   

        年龄  兄弟姐妹个数  父母子女个数              船票信息       票价    客舱 登船港口  

乘客ID                                                              

1     22.0       1       0         A/5 21171   7.2500   NaN    S  

2     38.0       1       0          PC 17599  71.2833   C85    C  

3     26.0       0       0  STON/O2. 3101282   7.9250   NaN    S  

4     35.0       1       0            113803  53.1000  C123    S  

5     35.0       0       0            373450   8.0500   NaN    S  

...    ...     ...     ...               ...      ...   ...  ...  

887   27.0       0       0            211536  13.0000   NaN    S  

888   19.0       0       0            112053  30.0000   B42    S  

889    NaN       1       2        W./C. 6607  23.4500   NaN    S  

890   26.0       0       0            111369  30.0000  C148    C  

891   32.0       0       0            370376   7.7500   NaN    Q  

[891 rows x 11 columns]>

df.head(10)

是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口乘客ID103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S503Allen, Mr. William Henrymale35.0003734508.0500NaNS603Moran, Mr. JamesmaleNaN003308778.4583NaNQ701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333NaNS1012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC 

df.tail(15)

是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口乘客ID87703Gustafsson, Mr. Alfred Ossianmale20.00075349.8458NaNS87803Petroff, Mr. Nedeliomale19.0003492127.8958NaNS87903Laleff, Mr. KristomaleNaN003492177.8958NaNS88011Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)female56.0011176783.1583C50C88112Shelley, Mrs. William (Imanita Parrish Hall)female25.00123043326.0000NaNS88203Markun, Mr. Johannmale33.0003492577.8958NaNS88303Dahlberg, Miss. Gerda Ulrikafemale22.000755210.5167NaNS88402Banfield, Mr. Frederick Jamesmale28.000C.A./SOTON 3406810.5000NaNS88503Sutehall, Mr. Henry Jrmale25.000SOTON/OQ 3920767.0500NaNS88603Rice, Mrs. William (Margaret Norton)female39.00538265229.1250NaNQ88702Montvila, Rev. Juozasmale27.00021153613.0000NaNS88811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S88903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS89011Behr, Mr. Karl Howellmale26.00011136930.0000C148C89103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ 

任务三:判断数据是否为空,为空的地方返回True,其余地方返回False 

df.isnull()

是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口乘客ID1FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse2FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse3FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse4FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse5FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse....................................887FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse888FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse889FalseFalseFalseFalseTrueFalseFalseFalseFalseTrueFalse890FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse891FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse 

891 rows × 11 columns 

1.3 保存数据 

任务一:将你加载并做出改变的数据,在工作目录下保存为一个新文件train_chinese.csv 

df.to_csv('train.chinese.csv')

2.1知道你的数据叫什么 

任务一:pandas中有两个数据类型DateFrame和Series,通过查找简单了解他们。然后自己写一个关于这两个数据类型的小例子 

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

example_1 = pd.Series(sdata)

example_1

Ohio      35000

Texas     71000

Oregon    16000

Utah       5000

dtype: int64

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002, 2003],'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

example_2 = pd.DataFrame(data)

example_2

stateyearpop0Ohio20001.51Ohio20011.72Ohio20023.63Nevada20012.44Nevada20022.95Nevada20033.2 

任务二:根据上节课的方法载入"train.csv"文件 

df=pd.read_csv('train.chinese.csv')

df.head()

乘客ID是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S4503Allen, Mr. William Henrymale35.0003734508.0500NaNS 

任务三:查看DataFrame数据的每列的项 

df.columns

Index(['乘客ID', '是否幸存', '仓位等级', '姓名', '性别', '年龄', '兄弟姐妹个数', '父母子女个数', '船票信息',

       '票价', '客舱', '登船港口'],

      dtype='object')

任务四:查看"cabin"这列的所有项 

dir(df['客舱'])

['T',

 '_AXIS_ALIASES',

 '_AXIS_IALIASES',

 '_AXIS_LEN',

 '_AXIS_NAMES',

 '_AXIS_NUMBERS',

 '_AXIS_ORDERS',

 '_AXIS_REVERSED',

 '_HANDLED_TYPES',

 '__abs__',

 '__add__',

 '__and__',

 '__annotations__',

 '__array__',

 '__array_priority__',

 '__array_ufunc__',

 '__array_wrap__',

 '__bool__',

 '__class__',

 '__contains__',

 '__copy__',

 '__deepcopy__',

 '__delattr__',

 '__delitem__',

 '__dict__',

 '__dir__',

 '__div__',

 '__divmod__',

 '__doc__',

 '__eq__',

 '__finalize__',

 '__float__',

 '__floordiv__',

 '__format__',

 '__ge__',

 '__getattr__',

 '__getattribute__',

 '__getitem__',

 '__getstate__',

 '__gt__',

 '__hash__',

 '__iadd__',

 '__iand__',

 '__ifloordiv__',

 '__imod__',

 '__imul__',

 '__init__',

 '__init_subclass__',

 '__int__',

 '__invert__',

 '__ior__',

 '__ipow__',

 '__isub__',

 '__iter__',

 '__itruediv__',

 '__ixor__',

 '__le__',

 '__len__',

 '__long__',

 '__lt__',

 '__matmul__',

 '__mod__',

 '__module__',

 '__mul__',

 '__ne__',

 '__neg__',

 '__new__',

 '__nonzero__',

 '__or__',

 '__pos__',

 '__pow__',

 '__radd__',

 '__rand__',

 '__rdiv__',

 '__rdivmod__',

 '__reduce__',

 '__reduce_ex__',

 '__repr__',

 '__rfloordiv__',

 '__rmatmul__',

 '__rmod__',

 '__rmul__',

 '__ror__',

 '__round__',

 '__rpow__',

 '__rsub__',

 '__rtruediv__',

 '__rxor__',

 '__setattr__',

 '__setitem__',

 '__setstate__',

 '__sizeof__',

 '__str__',

 '__sub__',

 '__subclasshook__',

 '__truediv__',

 '__weakref__',

 '__xor__',

 '_accessors',

 '_add_numeric_operations',

 '_add_series_or_dataframe_operations',

 '_agg_by_level',

 '_agg_examples_doc',

 '_agg_see_also_doc',

 '_aggregate',

 '_aggregate_multiple_funcs',

 '_align_frame',

 '_align_series',

 '_binop',

 '_box_item_values',

 '_builtin_table',

 '_can_hold_na',

 '_check_inplace_setting',

 '_check_is_chained_assignment_possible',

 '_check_label_or_level_ambiguity',

 '_check_setitem_copy',

 '_clear_item_cache',

 '_clip_with_one_bound',

 '_clip_with_scalar',

 '_consolidate',

 '_consolidate_inplace',

 '_construct_axes_dict',

 '_construct_axes_dict_from',

 '_construct_axes_from_arguments',

 '_constructor',

 '_constructor_expanddim',

 '_constructor_sliced',

 '_convert',

 '_convert_dtypes',

 '_create_indexer',

 '_cython_table',

 '_deprecations',

 '_dir_additions',

 '_dir_deletions',

 '_drop_axis',

 '_drop_labels_or_levels',

 '_find_valid_index',

 '_from_axes',

 '_get_axis',

 '_get_axis_name',

 '_get_axis_number',

 '_get_axis_resolvers',

 '_get_block_manager_axis',

 '_get_bool_data',

 '_get_cacher',

 '_get_cleaned_column_resolvers',

 '_get_cython_func',

 '_get_index_resolvers',

 '_get_item_cache',

 '_get_label_or_level_values',

 '_get_numeric_data',

 '_get_value',

 '_get_values',

 '_get_values_tuple',

 '_get_with',

 '_gotitem',

 '_iget_item_cache',

 '_index',

 '_indexed_same',

 '_info_axis',

 '_info_axis_name',

 '_info_axis_number',

 '_init_dict',

 '_init_mgr',

 '_internal_get_values',

 '_internal_names',

 '_internal_names_set',

 '_is_builtin_func',

 '_is_cached',

 '_is_copy',

 '_is_datelike_mixed_type',

 '_is_label_or_level_reference',

 '_is_label_reference',

 '_is_level_reference',

 '_is_mixed_type',

 '_is_numeric_mixed_type',

 '_is_view',

 '_ix',

 '_ixs',

 '_map_values',

 '_maybe_cache_changed',

 '_maybe_update_cacher',

 '_metadata',

 '_ndarray_values',

 '_needs_reindex_multi',

 '_obj_with_exclusions',

 '_protect_consolidate',

 '_reduce',

 '_reindex_axes',

 '_reindex_indexer',

 '_reindex_multi',

 '_reindex_with_indexers',

 '_repr_data_resource_',

 '_repr_latex_',

 '_reset_cache',

 '_reset_cacher',

 '_selected_obj',

 '_selection',

 '_selection_list',

 '_selection_name',

 '_set_as_cached',

 '_set_axis',

 '_set_axis_name',

 '_set_is_copy',

 '_set_item',

 '_set_labels',

 '_set_name',

 '_set_subtyp',

 '_set_value',

 '_set_values',

 '_set_with',

 '_set_with_engine',

 '_setup_axes',

 '_slice',

 '_stat_axis',

 '_stat_axis_name',

 '_stat_axis_number',

 '_take_with_is_copy',

 '_to_dict_of_blocks',

 '_try_aggregate_string_function',

 '_typ',

 '_unpickle_series_compat',

 '_update_inplace',

 '_validate_dtype',

 '_values',

 '_where',

 '_xs',

 'abs',

 'add',

 'add_prefix',

 'add_suffix',

 'agg',

 'aggregate',

 'align',

 'all',

 'any',

 'append',

 'apply',

 'argmax',

 'argmin',

 'argsort',

 'array',

 'asfreq',

 'asof',

 'astype',

 'at',

 'at_time',

 'attrs',

 'autocorr',

 'axes',

 'between',

 'between_time',

 'bfill',

 'bool',

 'clip',

 'combine',

 'combine_first',

 'convert_dtypes',

 'copy',

 'corr',

 'count',

 'cov',

 'cummax',

 'cummin',

 'cumprod',

 'cumsum',

 'describe',

 'diff',

 'div',

 'divide',

 'divmod',

 'dot',

 'drop',

 'drop_duplicates',

 'droplevel',

 'dropna',

 'dtype',

 'dtypes',

 'duplicated',

 'empty',

 'eq',

 'equals',

 'ewm',

 'expanding',

 'explode',

 'factorize',

 'ffill',

 'fillna',

 'filter',

 'first',

 'first_valid_index',

 'floordiv',

 'ge',

 'get',

 'groupby',

 'gt',

 'hasnans',

 'head',

 'hist',

 'iat',

 'idxmax',

 'idxmin',

 'iloc',

 'index',

 'infer_objects',

 'interpolate',

 'is_monotonic',

 'is_monotonic_decreasing',

 'is_monotonic_increasing',

 'is_unique',

 'isin',

 'isna',

 'isnull',

 'item',

 'items',

 'iteritems',

 'keys',

 'kurt',

 'kurtosis',

 'last',

 'last_valid_index',

 'le',

 'loc',

 'lt',

 'mad',

 'map',

 'mask',

 'max',

 'mean',

 'median',

 'memory_usage',

 'min',

 'mod',

 'mode',

 'mul',

 'multiply',

 'name',

 'nbytes',

 'ndim',

 'ne',

 'nlargest',

 'notna',

 'notnull',

 'nsmallest',

 'nunique',

 'pct_change',

 'pipe',

 'plot',

 'pop',

 'pow',

 'prod',

 'product',

 'quantile',

 'radd',

 'rank',

 'ravel',

 'rdiv',

 'rdivmod',

 'reindex',

 'reindex_like',

 'rename',

 'rename_axis',

 'reorder_levels',

 'repeat',

 'replace',

 'resample',

 'reset_index',

 'rfloordiv',

 'rmod',

 'rmul',

 'rolling',

 'round',

 'rpow',

 'rsub',

 'rtruediv',

 'sample',

 'searchsorted',

 'sem',

 'set_axis',

 'shape',

 'shift',

 'size',

 'skew',

 'slice_shift',

 'sort_index',

 'sort_values',

 'squeeze',

 'std',

 'str',

 'sub',

 'subtract',

 'sum',

 'swapaxes',

 'swaplevel',

 'tail',

 'take',

 'to_clipboard',

 'to_csv',

 'to_dict',

 'to_excel',

 'to_frame',

 'to_hdf',

 'to_json',

 'to_latex',

 'to_list',

 'to_markdown',

 'to_numpy',

 'to_period',

 'to_pickle',

 'to_sql',

 'to_string',

 'to_timestamp',

 'to_xarray',

 'transform',

 'transpose',

 'truediv',

 'truncate',

 'tshift',

 'tz_convert',

 'tz_localize',

 'unique',

 'unstack',

 'update',

 'value_counts',

 'values',

 'var',

 'view',

 'where',

 'xs']

df['客舱'].head()

0     NaN

1     C85

2     NaN

3    C123

4     NaN

Name: 客舱, dtype: object

任务五:加载文件"test_1.csv",然后对比"train.csv",看看有哪些多出的列,然后将多出的列删除 

test_1 = pd.read_csv("C:\\Users\\Administrator\\Documents\\DataScience\\hands-on-data-analysis\\第一单元项目集合\\test_1.csv")

test_1

Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkeda00103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS10011211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C10022313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS10033411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S10044503Allen, Mr. William Henrymale35.0003734508.0500NaNS100.............................................88688688702Montvila, Rev. Juozasmale27.00021153613.0000NaNS10088788788811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S10088888888903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS10088988989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C10089089089103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ100 

891 rows × 14 columns 

test_1.drop(['a'],axis=1)

Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked00103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS11211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C22313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS33411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S44503Allen, Mr. William Henrymale35.0003734508.0500NaNS..........................................88688688702Montvila, Rev. Juozasmale27.00021153613.0000NaNS88788788811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S88888888903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS88988989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C89089089103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ 

891 rows × 13 columns 

任务六: 将[‘PassengerId’,‘Name’,‘Age’,‘Ticket’]这几个列元素隐藏,只观察其他几个列元素 

df=pd.read_csv('train.csv')

df.drop(['PassengerId','Name','Age','Ticket'],axis=1)

SurvivedPclassSexSibSpParchFareCabinEmbarked003male107.2500NaNS111female1071.2833C85C213female007.9250NaNS311female1053.1000C123S403male008.0500NaNS...........................88602male0013.0000NaNS88711female0030.0000B42S88803female1223.4500NaNS88911male0030.0000C148C89003male007.7500NaNQ 

891 rows × 8 columns 

2.2筛选的逻辑 

任务一: 我们以"Age"为筛选条件,显示年龄在10岁以下的乘客信息。 

df[df['Age']<10]

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked7803Palsson, Master. Gosta Leonardmale2.003134990921.0750NaNS101113Sandstrom, Miss. Marguerite Rutfemale4.0011PP 954916.7000G6S161703Rice, Master. Eugenemale2.004138265229.1250NaNQ242503Palsson, Miss. Torborg Danirafemale8.003134990921.0750NaNS434412Laroche, Miss. Simonne Marie Anne Andreefemale3.0012SC/Paris 212341.5792NaNC.......................................82782812Mallet, Master. Andremale1.0002S.C./PARIS 207937.0042NaNC83183212Richards, Master. George Sibleymale0.83112910618.7500NaNS85085103Andersson, Master. Sigvard Harald Eliasmale4.004234708231.2750NaNS85285303Boulos, Miss. Nourelainfemale9.0011267815.2458NaNC86987013Johnson, Master. Harold Theodormale4.001134774211.1333NaNS 

62 rows × 12 columns 

任务二: 以"Age"为条件,将年龄在10岁以上和50岁以下的乘客信息显示出来,并将这个数据命名为midage 

midage = df[(df['Age']>10)&(df['Age']<50)]

midage.head()

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S4503Allen, Mr. William Henrymale35.0003734508.0500NaNS 

 连接两个逻辑条件需要用括号括起来 

任务三:将midage的数据中第100行的"Pclass"和"Sex"的数据显示出来 

print(midage.iloc[100]['Pclass'])

print(midage.iloc[100]['Sex'])

2

male

 还可以写作 midage.loc[[100],[‘Pclass’,‘Sex’]] 

任务四:使用loc方法将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来 

midage.loc[[100,105,108],['Pclass','Name','Sex']]

PclassNameSex1003Petranec, Miss. Matildafemale1053Mionoff, Mr. Stoytchomale1083Rekic, Mr. Tidomale 

任务五:使用iloc方法将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来 

midage.iloc[[100,105,108],[2,3,4]] #无法用列名索引吗

PclassNameSex1492Byles, Rev. Thomas Roussel Davidsmale1603Cribb, Mr. John Hatfieldmale1633Calic, Mr. Jovomale 

3.1开始之前,导入numpy、pandas包和数据 

text = pd.read_csv('train.chinese.csv')

text.head()

乘客ID是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S4503Allen, Mr. William Henrymale35.0003734508.0500NaNS 

任务一:利用Pandas对示例数据进行排序,要求升序 

frame = pd.DataFrame(np.arange(8).reshape((2, 4)), 

    index=['2', '1'], 

    columns=['d', 'a', 'b', 'c'])

frame.sort_index()

dabc1456720123 

frame.sort_index(axis=1) #按照列索引升序排列 a-b-c-d

abcd2123015674 

frame.sort_index(axis=1,ascending=False) #降序拍了

dcba2032114765 

frame.sort_values(by=['a','c'])

dabc2012314567 

任务二:对泰坦尼克号数据(trian.csv)按票价和年龄两列进行综合排序(降序排列),从数据中你能发现什么 

df.sort_values(['Age','Fare'],ascending=False)

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked63063111Barkworth, Mr. Algernon Henry Wilsonmale80.0002704230.0000A23S85185203Svensson, Mr. Johanmale74.0003470607.7750NaNS49349401Artagaveytia, Mr. Ramonmale71.000PC 1760949.5042NaNC969701Goldschmidt, Mr. George Bmale71.000PC 1775434.6542A5C11611703Connors, Mr. Patrickmale70.5003703697.7500NaNQ.......................................48148202Frost, Mr. Anthony Wood "Archie"maleNaN002398540.0000NaNS63363401Parr, Mr. William Henry MarshmaleNaN001120520.0000NaNS67467502Watson, Mr. Ennis HastingsmaleNaN002398560.0000NaNS73273302Knight, Mr. Robert JmaleNaN002398550.0000NaNS81581601Fry, Mr. RichardmaleNaN001120580.0000B102S 

891 rows × 12 columns 

df.sort_values(['Fare','Age'],ascending=False)

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked67968011Cardeza, Mr. Thomas Drake Martinezmale36.001PC 17755512.3292B51 B53 B55C25825911Ward, Miss. Annafemale35.000PC 17755512.3292NaNC73773811Lesurer, Mr. Gustave Jmale35.000PC 17755512.3292B101C43843901Fortune, Mr. Markmale64.01419950263.0000C23 C25 C27S34134211Fortune, Miss. Alice Elizabethfemale24.03219950263.0000C23 C25 C27S.......................................48148202Frost, Mr. Anthony Wood "Archie"maleNaN002398540.0000NaNS63363401Parr, Mr. William Henry MarshmaleNaN001120520.0000NaNS67467502Watson, Mr. Ennis HastingsmaleNaN002398560.0000NaNS73273302Knight, Mr. Robert JmaleNaN002398550.0000NaNS81581601Fry, Mr. RichardmaleNaN001120580.0000B102S 

891 rows × 12 columns 

任务三:利用Pandas进行算术计算,计算两个DataFrame数据相加结果 

frame1_a = pd.DataFrame(np.arange(9.).reshape(3, 3),

    columns=['a', 'b', 'c'],

    index=['one', 'two', 'three'])

frame1_b = pd.DataFrame(np.arange(12.).reshape(4, 3),

    columns=['a', 'e', 'c'],

    index=['first', 'one', 'two', 'second'])

frame1_a

abcone0.01.02.0two3.04.05.0three6.07.08.0 

frame1_b

aecfirst0.01.02.0one3.04.05.0two6.07.08.0second9.010.011.0 

frame1_a + frame1_b #结果会自动合并

abcefirstNaNNaNNaNNaNone3.0NaN7.0NaNsecondNaNNaNNaNNaNthreeNaNNaNNaNNaNtwo9.0NaN13.0NaN 

两个DataFrame相加后,会返回一个新的DataFrame,对应的行和列的值会相加,没有对应的会变成空值NaN。

任务四:通过泰坦尼克号数据如何计算出在船上最大的家族有多少人? 

max(text['兄弟姐妹个数']+text['父母子女个数'])

10

任务五:学会使用Pandasdescribe()函数查看数据基本统计信息 

frame2 = pd.DataFrame([[1.4, np.nan], 

    [7.1, -4.5],

    [np.nan, np.nan], 

    [0.75, -1.3]

    ], index=['a', 'b', 'c', 'd'], columns=['one', 'two'])

frame2

onetwoa1.40NaNb7.10-4.5cNaNNaNd0.75-1.3 

frame.describe()

dabccount2.0000002.0000002.0000002.000000mean2.0000003.0000004.0000005.000000std2.8284272.8284272.8284272.828427min0.0000001.0000002.0000003.00000025%1.0000002.0000003.0000004.00000050%2.0000003.0000004.0000005.00000075%3.0000004.0000005.0000006.000000max4.0000005.0000006.0000007.000000 

count : 样本数据大小

mean : 样本数据的平均值

std : 样本数据的标准差

min : 样本数据的最小值

25% : 样本数据25%的时候的值

50% : 样本数据50%的时候的值

75% : 样本数据75%的时候的值

max : 样本数据的最大值

任务六:分别看看泰坦尼克号数据集中 票价、父母子女 这列数据的基本统计数据,你能发现什么? 

text['票价'].describe()

count    891.000000

mean      32.204208

std       49.693429

min        0.000000

25%        7.910400

50%       14.454200

75%       31.000000

max      512.329200

Name: 票价, dtype: float64

text['父母子女个数'].describe()

count    891.000000

mean       0.381594

std        0.806057

min        0.000000

25%        0.000000

50%        0.000000

75%        0.000000

max        6.000000

Name: 父母子女个数, dtype: float64

``

本文系转载,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文系转载前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档