参考链接: Python中的Inplace运算符| 2(ixor(),iand(),ipow()等)
1.1载入数据
任务1:导入numpy和pandas
import numpy as np
import pandas as pd
import os
任务二:载入数据
(1) 使用相对路径载入
cwd = os.getcwd()
os.chdir("D:\datasets\Titanic")
df = pd.read_csv('train.csv')
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
(2) 使用绝对路径载入数据
df = pd.read_csv('D:\\datasets\\Titanic\\train.csv')
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
任务三:每1000行为一个数据模块,逐块读取
chunker = pd.read_csv('train.csv', chunksize=1000)
chunker
<pandas.io.parsers.TextFileReader at 0x1f6383329a0>
任务四:将表头改成中文,索引改为乘客ID
df = pd.read_csv('train.csv', names=['乘客ID','是否幸存','仓位等级','姓名','性别','年龄','兄弟姐妹个数','父母子女个数','船票信息','票价','客舱','登船港口'],index_col='乘客ID',header=0)
df.head()
是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口乘客ID103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S503Allen, Mr. William Henrymale35.0003734508.0500NaNS
1.2初步观察
任务一:查看数据的基本信息
df.info
<bound method DataFrame.info of 是否幸存 仓位等级 姓名 性别 \
乘客ID
1 0 3 Braund, Mr. Owen Harris male
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female
3 1 3 Heikkinen, Miss. Laina female
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female
5 0 3 Allen, Mr. William Henry male
... ... ... ... ...
887 0 2 Montvila, Rev. Juozas male
888 1 1 Graham, Miss. Margaret Edith female
889 0 3 Johnston, Miss. Catherine Helen "Carrie" female
890 1 1 Behr, Mr. Karl Howell male
891 0 3 Dooley, Mr. Patrick male
年龄 兄弟姐妹个数 父母子女个数 船票信息 票价 客舱 登船港口
乘客ID
1 22.0 1 0 A/5 21171 7.2500 NaN S
2 38.0 1 0 PC 17599 71.2833 C85 C
3 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 35.0 1 0 113803 53.1000 C123 S
5 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ...
887 27.0 0 0 211536 13.0000 NaN S
888 19.0 0 0 112053 30.0000 B42 S
889 NaN 1 2 W./C. 6607 23.4500 NaN S
890 26.0 0 0 111369 30.0000 C148 C
891 32.0 0 0 370376 7.7500 NaN Q
[891 rows x 11 columns]>
df.head(10)
是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口乘客ID103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S503Allen, Mr. William Henrymale35.0003734508.0500NaNS603Moran, Mr. JamesmaleNaN003308778.4583NaNQ701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333NaNS1012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
df.tail(15)
是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口乘客ID87703Gustafsson, Mr. Alfred Ossianmale20.00075349.8458NaNS87803Petroff, Mr. Nedeliomale19.0003492127.8958NaNS87903Laleff, Mr. KristomaleNaN003492177.8958NaNS88011Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)female56.0011176783.1583C50C88112Shelley, Mrs. William (Imanita Parrish Hall)female25.00123043326.0000NaNS88203Markun, Mr. Johannmale33.0003492577.8958NaNS88303Dahlberg, Miss. Gerda Ulrikafemale22.000755210.5167NaNS88402Banfield, Mr. Frederick Jamesmale28.000C.A./SOTON 3406810.5000NaNS88503Sutehall, Mr. Henry Jrmale25.000SOTON/OQ 3920767.0500NaNS88603Rice, Mrs. William (Margaret Norton)female39.00538265229.1250NaNQ88702Montvila, Rev. Juozasmale27.00021153613.0000NaNS88811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S88903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS89011Behr, Mr. Karl Howellmale26.00011136930.0000C148C89103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ
任务三:判断数据是否为空,为空的地方返回True,其余地方返回False
df.isnull()
是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口乘客ID1FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse2FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse3FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse4FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse5FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse....................................887FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse888FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse889FalseFalseFalseFalseTrueFalseFalseFalseFalseTrueFalse890FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse891FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
891 rows × 11 columns
1.3 保存数据
任务一:将你加载并做出改变的数据,在工作目录下保存为一个新文件train_chinese.csv
df.to_csv('train.chinese.csv')
2.1知道你的数据叫什么
任务一:pandas中有两个数据类型DateFrame和Series,通过查找简单了解他们。然后自己写一个关于这两个数据类型的小例子
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
example_1 = pd.Series(sdata)
example_1
Ohio 35000
Texas 71000
Oregon 16000
Utah 5000
dtype: int64
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002, 2003],'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
example_2 = pd.DataFrame(data)
example_2
stateyearpop0Ohio20001.51Ohio20011.72Ohio20023.63Nevada20012.44Nevada20022.95Nevada20033.2
任务二:根据上节课的方法载入"train.csv"文件
df=pd.read_csv('train.chinese.csv')
df.head()
乘客ID是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
任务三:查看DataFrame数据的每列的项
df.columns
Index(['乘客ID', '是否幸存', '仓位等级', '姓名', '性别', '年龄', '兄弟姐妹个数', '父母子女个数', '船票信息',
'票价', '客舱', '登船港口'],
dtype='object')
任务四:查看"cabin"这列的所有项
dir(df['客舱'])
['T',
'_AXIS_ALIASES',
'_AXIS_IALIASES',
'_AXIS_LEN',
'_AXIS_NAMES',
'_AXIS_NUMBERS',
'_AXIS_ORDERS',
'_AXIS_REVERSED',
'_HANDLED_TYPES',
'__abs__',
'__add__',
'__and__',
'__annotations__',
'__array__',
'__array_priority__',
'__array_ufunc__',
'__array_wrap__',
'__bool__',
'__class__',
'__contains__',
'__copy__',
'__deepcopy__',
'__delattr__',
'__delitem__',
'__dict__',
'__dir__',
'__div__',
'__divmod__',
'__doc__',
'__eq__',
'__finalize__',
'__float__',
'__floordiv__',
'__format__',
'__ge__',
'__getattr__',
'__getattribute__',
'__getitem__',
'__getstate__',
'__gt__',
'__hash__',
'__iadd__',
'__iand__',
'__ifloordiv__',
'__imod__',
'__imul__',
'__init__',
'__init_subclass__',
'__int__',
'__invert__',
'__ior__',
'__ipow__',
'__isub__',
'__iter__',
'__itruediv__',
'__ixor__',
'__le__',
'__len__',
'__long__',
'__lt__',
'__matmul__',
'__mod__',
'__module__',
'__mul__',
'__ne__',
'__neg__',
'__new__',
'__nonzero__',
'__or__',
'__pos__',
'__pow__',
'__radd__',
'__rand__',
'__rdiv__',
'__rdivmod__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__rfloordiv__',
'__rmatmul__',
'__rmod__',
'__rmul__',
'__ror__',
'__round__',
'__rpow__',
'__rsub__',
'__rtruediv__',
'__rxor__',
'__setattr__',
'__setitem__',
'__setstate__',
'__sizeof__',
'__str__',
'__sub__',
'__subclasshook__',
'__truediv__',
'__weakref__',
'__xor__',
'_accessors',
'_add_numeric_operations',
'_add_series_or_dataframe_operations',
'_agg_by_level',
'_agg_examples_doc',
'_agg_see_also_doc',
'_aggregate',
'_aggregate_multiple_funcs',
'_align_frame',
'_align_series',
'_binop',
'_box_item_values',
'_builtin_table',
'_can_hold_na',
'_check_inplace_setting',
'_check_is_chained_assignment_possible',
'_check_label_or_level_ambiguity',
'_check_setitem_copy',
'_clear_item_cache',
'_clip_with_one_bound',
'_clip_with_scalar',
'_consolidate',
'_consolidate_inplace',
'_construct_axes_dict',
'_construct_axes_dict_from',
'_construct_axes_from_arguments',
'_constructor',
'_constructor_expanddim',
'_constructor_sliced',
'_convert',
'_convert_dtypes',
'_create_indexer',
'_cython_table',
'_deprecations',
'_dir_additions',
'_dir_deletions',
'_drop_axis',
'_drop_labels_or_levels',
'_find_valid_index',
'_from_axes',
'_get_axis',
'_get_axis_name',
'_get_axis_number',
'_get_axis_resolvers',
'_get_block_manager_axis',
'_get_bool_data',
'_get_cacher',
'_get_cleaned_column_resolvers',
'_get_cython_func',
'_get_index_resolvers',
'_get_item_cache',
'_get_label_or_level_values',
'_get_numeric_data',
'_get_value',
'_get_values',
'_get_values_tuple',
'_get_with',
'_gotitem',
'_iget_item_cache',
'_index',
'_indexed_same',
'_info_axis',
'_info_axis_name',
'_info_axis_number',
'_init_dict',
'_init_mgr',
'_internal_get_values',
'_internal_names',
'_internal_names_set',
'_is_builtin_func',
'_is_cached',
'_is_copy',
'_is_datelike_mixed_type',
'_is_label_or_level_reference',
'_is_label_reference',
'_is_level_reference',
'_is_mixed_type',
'_is_numeric_mixed_type',
'_is_view',
'_ix',
'_ixs',
'_map_values',
'_maybe_cache_changed',
'_maybe_update_cacher',
'_metadata',
'_ndarray_values',
'_needs_reindex_multi',
'_obj_with_exclusions',
'_protect_consolidate',
'_reduce',
'_reindex_axes',
'_reindex_indexer',
'_reindex_multi',
'_reindex_with_indexers',
'_repr_data_resource_',
'_repr_latex_',
'_reset_cache',
'_reset_cacher',
'_selected_obj',
'_selection',
'_selection_list',
'_selection_name',
'_set_as_cached',
'_set_axis',
'_set_axis_name',
'_set_is_copy',
'_set_item',
'_set_labels',
'_set_name',
'_set_subtyp',
'_set_value',
'_set_values',
'_set_with',
'_set_with_engine',
'_setup_axes',
'_slice',
'_stat_axis',
'_stat_axis_name',
'_stat_axis_number',
'_take_with_is_copy',
'_to_dict_of_blocks',
'_try_aggregate_string_function',
'_typ',
'_unpickle_series_compat',
'_update_inplace',
'_validate_dtype',
'_values',
'_where',
'_xs',
'abs',
'add',
'add_prefix',
'add_suffix',
'agg',
'aggregate',
'align',
'all',
'any',
'append',
'apply',
'argmax',
'argmin',
'argsort',
'array',
'asfreq',
'asof',
'astype',
'at',
'at_time',
'attrs',
'autocorr',
'axes',
'between',
'between_time',
'bfill',
'bool',
'clip',
'combine',
'combine_first',
'convert_dtypes',
'copy',
'corr',
'count',
'cov',
'cummax',
'cummin',
'cumprod',
'cumsum',
'describe',
'diff',
'div',
'divide',
'divmod',
'dot',
'drop',
'drop_duplicates',
'droplevel',
'dropna',
'dtype',
'dtypes',
'duplicated',
'empty',
'eq',
'equals',
'ewm',
'expanding',
'explode',
'factorize',
'ffill',
'fillna',
'filter',
'first',
'first_valid_index',
'floordiv',
'ge',
'get',
'groupby',
'gt',
'hasnans',
'head',
'hist',
'iat',
'idxmax',
'idxmin',
'iloc',
'index',
'infer_objects',
'interpolate',
'is_monotonic',
'is_monotonic_decreasing',
'is_monotonic_increasing',
'is_unique',
'isin',
'isna',
'isnull',
'item',
'items',
'iteritems',
'keys',
'kurt',
'kurtosis',
'last',
'last_valid_index',
'le',
'loc',
'lt',
'mad',
'map',
'mask',
'max',
'mean',
'median',
'memory_usage',
'min',
'mod',
'mode',
'mul',
'multiply',
'name',
'nbytes',
'ndim',
'ne',
'nlargest',
'notna',
'notnull',
'nsmallest',
'nunique',
'pct_change',
'pipe',
'plot',
'pop',
'pow',
'prod',
'product',
'quantile',
'radd',
'rank',
'ravel',
'rdiv',
'rdivmod',
'reindex',
'reindex_like',
'rename',
'rename_axis',
'reorder_levels',
'repeat',
'replace',
'resample',
'reset_index',
'rfloordiv',
'rmod',
'rmul',
'rolling',
'round',
'rpow',
'rsub',
'rtruediv',
'sample',
'searchsorted',
'sem',
'set_axis',
'shape',
'shift',
'size',
'skew',
'slice_shift',
'sort_index',
'sort_values',
'squeeze',
'std',
'str',
'sub',
'subtract',
'sum',
'swapaxes',
'swaplevel',
'tail',
'take',
'to_clipboard',
'to_csv',
'to_dict',
'to_excel',
'to_frame',
'to_hdf',
'to_json',
'to_latex',
'to_list',
'to_markdown',
'to_numpy',
'to_period',
'to_pickle',
'to_sql',
'to_string',
'to_timestamp',
'to_xarray',
'transform',
'transpose',
'truediv',
'truncate',
'tshift',
'tz_convert',
'tz_localize',
'unique',
'unstack',
'update',
'value_counts',
'values',
'var',
'view',
'where',
'xs']
df['客舱'].head()
0 NaN
1 C85
2 NaN
3 C123
4 NaN
Name: 客舱, dtype: object
任务五:加载文件"test_1.csv",然后对比"train.csv",看看有哪些多出的列,然后将多出的列删除
test_1 = pd.read_csv("C:\\Users\\Administrator\\Documents\\DataScience\\hands-on-data-analysis\\第一单元项目集合\\test_1.csv")
test_1
Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkeda00103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS10011211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C10022313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS10033411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S10044503Allen, Mr. William Henrymale35.0003734508.0500NaNS100.............................................88688688702Montvila, Rev. Juozasmale27.00021153613.0000NaNS10088788788811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S10088888888903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS10088988989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C10089089089103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ100
891 rows × 14 columns
test_1.drop(['a'],axis=1)
Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked00103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS11211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C22313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS33411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S44503Allen, Mr. William Henrymale35.0003734508.0500NaNS..........................................88688688702Montvila, Rev. Juozasmale27.00021153613.0000NaNS88788788811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S88888888903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS88988989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C89089089103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ
891 rows × 13 columns
任务六: 将[‘PassengerId’,‘Name’,‘Age’,‘Ticket’]这几个列元素隐藏,只观察其他几个列元素
df=pd.read_csv('train.csv')
df.drop(['PassengerId','Name','Age','Ticket'],axis=1)
SurvivedPclassSexSibSpParchFareCabinEmbarked003male107.2500NaNS111female1071.2833C85C213female007.9250NaNS311female1053.1000C123S403male008.0500NaNS...........................88602male0013.0000NaNS88711female0030.0000B42S88803female1223.4500NaNS88911male0030.0000C148C89003male007.7500NaNQ
891 rows × 8 columns
2.2筛选的逻辑
任务一: 我们以"Age"为筛选条件,显示年龄在10岁以下的乘客信息。
df[df['Age']<10]
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked7803Palsson, Master. Gosta Leonardmale2.003134990921.0750NaNS101113Sandstrom, Miss. Marguerite Rutfemale4.0011PP 954916.7000G6S161703Rice, Master. Eugenemale2.004138265229.1250NaNQ242503Palsson, Miss. Torborg Danirafemale8.003134990921.0750NaNS434412Laroche, Miss. Simonne Marie Anne Andreefemale3.0012SC/Paris 212341.5792NaNC.......................................82782812Mallet, Master. Andremale1.0002S.C./PARIS 207937.0042NaNC83183212Richards, Master. George Sibleymale0.83112910618.7500NaNS85085103Andersson, Master. Sigvard Harald Eliasmale4.004234708231.2750NaNS85285303Boulos, Miss. Nourelainfemale9.0011267815.2458NaNC86987013Johnson, Master. Harold Theodormale4.001134774211.1333NaNS
62 rows × 12 columns
任务二: 以"Age"为条件,将年龄在10岁以上和50岁以下的乘客信息显示出来,并将这个数据命名为midage
midage = df[(df['Age']>10)&(df['Age']<50)]
midage.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
连接两个逻辑条件需要用括号括起来
任务三:将midage的数据中第100行的"Pclass"和"Sex"的数据显示出来
print(midage.iloc[100]['Pclass'])
print(midage.iloc[100]['Sex'])
2
male
还可以写作 midage.loc[[100],[‘Pclass’,‘Sex’]]
任务四:使用loc方法将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来
midage.loc[[100,105,108],['Pclass','Name','Sex']]
PclassNameSex1003Petranec, Miss. Matildafemale1053Mionoff, Mr. Stoytchomale1083Rekic, Mr. Tidomale
任务五:使用iloc方法将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来
midage.iloc[[100,105,108],[2,3,4]] #无法用列名索引吗
PclassNameSex1492Byles, Rev. Thomas Roussel Davidsmale1603Cribb, Mr. John Hatfieldmale1633Calic, Mr. Jovomale
3.1开始之前,导入numpy、pandas包和数据
text = pd.read_csv('train.chinese.csv')
text.head()
乘客ID是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
任务一:利用Pandas对示例数据进行排序,要求升序
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
index=['2', '1'],
columns=['d', 'a', 'b', 'c'])
frame.sort_index()
dabc1456720123
frame.sort_index(axis=1) #按照列索引升序排列 a-b-c-d
abcd2123015674
frame.sort_index(axis=1,ascending=False) #降序拍了
dcba2032114765
frame.sort_values(by=['a','c'])
dabc2012314567
任务二:对泰坦尼克号数据(trian.csv)按票价和年龄两列进行综合排序(降序排列),从数据中你能发现什么
df.sort_values(['Age','Fare'],ascending=False)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked63063111Barkworth, Mr. Algernon Henry Wilsonmale80.0002704230.0000A23S85185203Svensson, Mr. Johanmale74.0003470607.7750NaNS49349401Artagaveytia, Mr. Ramonmale71.000PC 1760949.5042NaNC969701Goldschmidt, Mr. George Bmale71.000PC 1775434.6542A5C11611703Connors, Mr. Patrickmale70.5003703697.7500NaNQ.......................................48148202Frost, Mr. Anthony Wood "Archie"maleNaN002398540.0000NaNS63363401Parr, Mr. William Henry MarshmaleNaN001120520.0000NaNS67467502Watson, Mr. Ennis HastingsmaleNaN002398560.0000NaNS73273302Knight, Mr. Robert JmaleNaN002398550.0000NaNS81581601Fry, Mr. RichardmaleNaN001120580.0000B102S
891 rows × 12 columns
df.sort_values(['Fare','Age'],ascending=False)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked67968011Cardeza, Mr. Thomas Drake Martinezmale36.001PC 17755512.3292B51 B53 B55C25825911Ward, Miss. Annafemale35.000PC 17755512.3292NaNC73773811Lesurer, Mr. Gustave Jmale35.000PC 17755512.3292B101C43843901Fortune, Mr. Markmale64.01419950263.0000C23 C25 C27S34134211Fortune, Miss. Alice Elizabethfemale24.03219950263.0000C23 C25 C27S.......................................48148202Frost, Mr. Anthony Wood "Archie"maleNaN002398540.0000NaNS63363401Parr, Mr. William Henry MarshmaleNaN001120520.0000NaNS67467502Watson, Mr. Ennis HastingsmaleNaN002398560.0000NaNS73273302Knight, Mr. Robert JmaleNaN002398550.0000NaNS81581601Fry, Mr. RichardmaleNaN001120580.0000B102S
891 rows × 12 columns
任务三:利用Pandas进行算术计算,计算两个DataFrame数据相加结果
frame1_a = pd.DataFrame(np.arange(9.).reshape(3, 3),
columns=['a', 'b', 'c'],
index=['one', 'two', 'three'])
frame1_b = pd.DataFrame(np.arange(12.).reshape(4, 3),
columns=['a', 'e', 'c'],
index=['first', 'one', 'two', 'second'])
frame1_a
abcone0.01.02.0two3.04.05.0three6.07.08.0
frame1_b
aecfirst0.01.02.0one3.04.05.0two6.07.08.0second9.010.011.0
frame1_a + frame1_b #结果会自动合并
abcefirstNaNNaNNaNNaNone3.0NaN7.0NaNsecondNaNNaNNaNNaNthreeNaNNaNNaNNaNtwo9.0NaN13.0NaN
两个DataFrame相加后,会返回一个新的DataFrame,对应的行和列的值会相加,没有对应的会变成空值NaN。
任务四:通过泰坦尼克号数据如何计算出在船上最大的家族有多少人?
max(text['兄弟姐妹个数']+text['父母子女个数'])
10
任务五:学会使用Pandasdescribe()函数查看数据基本统计信息
frame2 = pd.DataFrame([[1.4, np.nan],
[7.1, -4.5],
[np.nan, np.nan],
[0.75, -1.3]
], index=['a', 'b', 'c', 'd'], columns=['one', 'two'])
frame2
onetwoa1.40NaNb7.10-4.5cNaNNaNd0.75-1.3
frame.describe()
dabccount2.0000002.0000002.0000002.000000mean2.0000003.0000004.0000005.000000std2.8284272.8284272.8284272.828427min0.0000001.0000002.0000003.00000025%1.0000002.0000003.0000004.00000050%2.0000003.0000004.0000005.00000075%3.0000004.0000005.0000006.000000max4.0000005.0000006.0000007.000000
count : 样本数据大小
mean : 样本数据的平均值
std : 样本数据的标准差
min : 样本数据的最小值
25% : 样本数据25%的时候的值
50% : 样本数据50%的时候的值
75% : 样本数据75%的时候的值
max : 样本数据的最大值
任务六:分别看看泰坦尼克号数据集中 票价、父母子女 这列数据的基本统计数据,你能发现什么?
text['票价'].describe()
count 891.000000
mean 32.204208
std 49.693429
min 0.000000
25% 7.910400
50% 14.454200
75% 31.000000
max 512.329200
Name: 票价, dtype: float64
text['父母子女个数'].describe()
count 891.000000
mean 0.381594
std 0.806057
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 6.000000
Name: 父母子女个数, dtype: float64
``
本文系转载,前往查看
如有侵权,请联系 cloudcommunity@tencent.com 删除。
本文系转载,前往查看
如有侵权,请联系 cloudcommunity@tencent.com 删除。