前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >《Pandas 1.x Cookbook · 第二版》第02章 DataFrame基础运算

《Pandas 1.x Cookbook · 第二版》第02章 DataFrame基础运算

作者头像
SeanCheney
发布2021-02-05 14:47:02
6790
发布2021-02-05 14:47:02
举报
文章被收录于专栏:SeanCheney的专栏SeanCheney的专栏

第01章 Pandas基础

第02章 DataFrame基础运算


2.1 从DataFrame中选择多列

使用列名列表提取DataFrame的多列:

代码语言:javascript
复制
>>> import pandas as pd
>>> import numpy as np
>>> movies = pd.read_csv("data/movie.csv")
>>> movie_actor_director = movies[
...     [
...         "actor_1_name",
...         "actor_2_name",
...         "actor_3_name",
...         "director_name",
...     ]
... ]
>>> movie_actor_director.head()
  actor_1_name actor_2_name actor_3_name director_name
0  CCH Pounder  Joel Dav...    Wes Studi  James Ca...
1  Johnny Depp  Orlando ...  Jack Dav...  Gore Ver...
2  Christop...  Rory Kin...  Stephani...   Sam Mendes
3    Tom Hardy  Christia...  Joseph G...  Christop...
4  Doug Walker   Rob Walker          NaN  Doug Walker
代码语言:javascript
复制
# 提取单列时,列表和键名提取出来的数据类型不同。
>>> type(movies[["director_name"]])
<class 'pandas.core.frame.DataFrame'>   # DataFrame类型
>>> type(movies["director_name"])
<class 'pandas.core.series.Series'>   # Series类型

也可以使用loc提取多列。

代码语言:javascript
复制
>>> type(movies.loc[:, ["director_name"]])
<class 'pandas.core.frame.DataFrame'>
>>> type(movies.loc[:, "director_name"])
<class 'pandas.core.series.Series'>

预先将列名存储在列表中,可以提高代码的可读性。

代码语言:javascript
复制
>>> cols = [
...     "actor_1_name",
...     "actor_2_name",
...     "actor_3_name",
...     "director_name",
... ]
>>> movie_actor_director = movies[cols]

如果没有使用列表,则会报KeyError错误。

代码语言:javascript
复制
>>> movies[
...     "actor_1_name",
...     "actor_2_name",
...     "actor_3_name",
...     "director_name",
... ]
Traceback (most recent call last):
  ...
KeyError: ('actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name')

2.2 使用方法提取多列

缩短列名之后查看每种数据类型的个数:

代码语言:javascript
复制
>>> movies = pd.read_csv("data/movie.csv")
>>> def shorten(col):
...     return (
...         str(col)
...         .replace("facebook_likes", "fb")
...         .replace("_for_reviews", "")
...     )
>>> movies = movies.rename(columns=shorten)
>>> movies.dtypes.value_counts()
float64    13
int64       3
object     12
dtype: int64

使用.select_dtypes方法提取整型的列:

代码语言:javascript
复制
>>> movies.select_dtypes(include="int").head()
   num_voted_users  cast_total_fb  movie_fb
0           886204           4834     33000
1           471220          48350         0
2           275868          11700     85000
3          1144337         106759    164000
4                8            143         0

选择所有数值类型的列:

代码语言:javascript
复制
>>> movies.select_dtypes(include="number").head()
   num_critics  duration  ...  aspect_ratio  movie_fb
0        723.0     178.0  ...         1.78      33000
1        302.0     169.0  ...         2.35          0
2        602.0     148.0  ...         2.35      85000
3        813.0     164.0  ...         2.35     164000
4          NaN       NaN  ...          NaN          0

选择整型和字符串的列:

代码语言:javascript
复制
>>> movies.select_dtypes(include=["int", "object"]).head()
   color        direc/_name  ... conte/ating movie_fb
0  Color      James Cameron  ...       PG-13    33000
1  Color     Gore Verbinski  ...       PG-13        0
2  Color         Sam Mendes  ...       PG-13    85000
3  Color  Christopher Nolan  ...       PG-13   164000
4    NaN        Doug Walker  ...         NaN        0

提取所有非浮点类型的列:

代码语言:javascript
复制
>>> movies.select_dtypes(exclude="float").head()
   color director_name  ... content_rating movie_fb
0  Color  James Ca...   ...        PG-13      33000
1  Color  Gore Ver...   ...        PG-13          0
2  Color   Sam Mendes   ...        PG-13      85000
3  Color  Christop...   ...        PG-13     164000
4    NaN  Doug Walker   ...          NaN          0

使用.filter方法筛选所有列名中包含fb的列:

代码语言:javascript
复制
>>> movies.filter(like="fb").head()
   director_fb  actor_3_fb  ...  actor_2_fb  movie_fb
0          0.0       855.0  ...       936.0     33000
1        563.0      1000.0  ...      5000.0         0
2          0.0       161.0  ...       393.0     85000
3      22000.0     23000.0  ...     23000.0    164000
4        131.0         NaN  ...        12.0         0

items参数可以用来选择多列:

代码语言:javascript
复制
>>> cols = [
...     "actor_1_name",
...     "actor_2_name",
...     "actor_3_name",
...     "director_name",
... ]
>>> movies.filter(items=cols).head()
      actor_1_name  ...      director_name
0      CCH Pounder  ...      James Cameron
1      Johnny Depp  ...     Gore Verbinski
2  Christoph Waltz  ...         Sam Mendes
3        Tom Hardy  ...  Christopher Nolan
4      Doug Walker  ...        Doug Walker

regex参数可以用来进行正则匹配,下面的代码提取出了列名中包含数字的列:

代码语言:javascript
复制
>>> movies.filter(regex=r"\d").head()
   actor_3_fb actor_2_name  ...  actor_3_name actor_2_fb
0       855.0  Joel Dav...  ...    Wes Studi       936.0
1      1000.0  Orlando ...  ...  Jack Dav...      5000.0
2       161.0  Rory Kin...  ...  Stephani...       393.0
3     23000.0  Christia...  ...  Joseph G...     23000.0
4         NaN   Rob Walker  ...          NaN        12.0

2.3 按列名进行排列

对列进行排序的原则:

  • 将列分为分类型和连续型;
  • 按照分类型和连续型对列分组;
  • 分类型排在连续型的前面;

下面是个例子。先读取数据,缩短列名:

代码语言:javascript
复制
>>> movies = pd.read_csv("data/movie.csv")
>>> def shorten(col):
...     return col.replace("facebook_likes", "fb").replace(
...         "_for_reviews", ""
...     )
>>> movies = movies.rename(columns=shorten)

对下面的列名进行

代码语言:javascript
复制
>>> movies.columns
Index(['color', 'director_name', 'num_critic', 'duration', 'director_fb',
       'actor_3_fb', 'actor_2_name', 'actor_1_fb', 'gross', 'genres',
       'actor_1_name', 'movie_title', 'num_voted_users', 'cast_total_fb',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user', 'language', 'country', 'content_rating',
       'budget', 'title_year', 'actor_2_fb', 'imdb_score', 'aspect_ratio',
       'movie_fb'],
      dtype='object')
代码语言:javascript
复制
>>> cat_core = [
...     "movie_title",
...     "title_year",
...     "content_rating",
...     "genres",
... ]
>>> cat_people = [
...     "director_name",
...     "actor_1_name",
...     "actor_2_name",
...     "actor_3_name",
... ]
>>> cat_other = [
...     "color",
...     "country",
...     "language",
...     "plot_keywords",
...     "movie_imdb_link",
... ]
>>> cont_fb = [
...     "director_fb",
...     "actor_1_fb",
...     "actor_2_fb",
...     "actor_3_fb",
...     "cast_total_fb",
...     "movie_fb",
... ]
>>> cont_finance = ["budget", "gross"]
>>> cont_num_reviews = [
...     "num_voted_users",
...     "num_user",
...     "num_critic",
... ]
>>> cont_other = [
...     "imdb_score",
...     "duration",
...     "aspect_ratio",
...     "facenumber_in_poster",
... ]

将上面所有列表连起来,组成最终的列的顺序,并确认没有遗漏任何列:

代码语言:javascript
复制
>>> new_col_order = (
...     cat_core
...     + cat_people
...     + cat_other
...     + cont_fb
...     + cont_finance
...     + cont_num_reviews
...     + cont_other
... )
>>> set(movies.columns) == set(new_col_order)
True

将新的列数组传给movies,得到排好列的对象:

代码语言:javascript
复制
>>> movies[new_col_order].head()
   movie_title  title_year  ... aspect_ratio facenumber_in_poster
0       Avatar      2009.0  ...         1.78          0.0
1  Pirates ...      2007.0  ...         2.35          0.0
2      Spectre      2015.0  ...         2.35          1.0
3  The Dark...      2012.0  ...         2.35          0.0
4  Star War...         NaN  ...          NaN          0.0

2.4 对DataFrame进行概括性分析

查看数据集的属性:shape、size、ndim。

代码语言:javascript
复制
>>> movies = pd.read_csv("data/movie.csv")
>>> movies.shape
(4916, 28)
>>> movies.size
137648
>>> movies.ndim
2

.count方法可以统计所有的非缺失值:

代码语言:javascript
复制
>>> movies.count()
color                      4897
director_name              4814
num_critic_for_reviews     4867
duration                   4901
director_facebook_likes    4814
                           ... 
title_year                 4810
actor_2_facebook_likes     4903
imdb_score                 4916
aspect_ratio               4590
movie_facebook_likes       4916
Length: 28, dtype: int64
本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 2.1 从DataFrame中选择多列
  • 2.2 使用方法提取多列
  • 2.3 按列名进行排列
  • 2.4 对DataFrame进行概括性分析
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档