翻译|给数据科学家的10个提示和技巧Vol.2

庄闪闪

发布于 2021-06-25 17:08:40

8180

发布于 2021-06-25 17:08:40

文章被收录于专栏：庄闪闪的R语言手册

点击下方公众号，回复资料分享，收获惊喜

原文：10 Tips And Tricks For Data Scientists Vol.2[1]

译者：赵西西

原博客简介：Predictive Hacks是与数据科学相关的一切的在线资源中心。该博客由一群数据科学家所运营，专注于讲解在各种领域如何使用大数据技术（从机器学习和人工智能到业务领域）。

1 引言

第一章给出了数据分析的一些技巧(主要用Python和R)，可见：翻译｜给数据科学家的10个提示和技巧Vol.1

2 R

2.1 基于列名获得对应行的值

数据框如下：

set.seed(5)
df<-as.data.frame(matrix(sample(1:100,12),ncol=3))
df$Selection<-c("V1","V3","V2","V3")
df

V1 V2 V3 Selection
1 66 41 19        V1
2 57 85  3        V3
3 79 94 38        V2
4 75 71 58        V3

df$Value<-as.numeric(df[cbind(seq_len(nrow(df)), match(df$Selection,names(df)))])
df

 V1 V2 V3 Selection Value
1 66 41 19        V1    66
2 57 85  3        V3     3
3 79 94 38        V2    94
4 75 71 58        V3    58

2.2 创建时间属性

当数据与时间有关时，可以为模型创建一些时间属性。例如，我们可以创建:

Year
Month
Weekday
Hour
Minute
Week of the year
Quarter

如何在R中对一个DateTime对象创建这些属性，建议将一些特征如weekdays， months，hours，isWeekend等，转换为因子:

一个名为isWeekend的布尔值，周末为1，其他为0。
一天中的时间段(如上午、下午、晚上)。

library(tidyverse)
set.seed(5)
df<- tibble(my_date = lubridate::as_datetime( runif(10, 1530000000, 1577739600)))
df%>%mutate(Year = format(my_date, '%Y'), Month_Number = as.factor(format(my_date, '%m')), 
            Weekday = as.factor(weekdays(my_date)), Hour =as.factor(format(my_date, '%H')),  
            Minute =as.factor(format(my_date, '%M')), Week =(format(my_date, '%W')), 
            Quarter = lubridate::quarter(my_date, with_year = T))

3 Python

3.1 从Jupyter创建文件

要编写文件，只需在jupyter中输入%%writefile filename。例如，创建一个名为myfile.py的新文件:

%%writefile myfile.py
def my_function():
    print("Hello from a function")

查看文件可以输入!cat myfile.py。添加新内容可以使用附加参数-a。例如，想将my_function()添加到文件中:

%%writefile -a myfile.py  
my_function()

这时结果如下所示

可以使用!python myfile.py命令或输入%run -i myfile.py来运行脚本。

3.2 基于列名获得对应行的值

利用pandas库中DataFrame构建一个数据框:

import pandas as pd
df = pd.DataFrame.from_dict({"V1": [66, 57, 79,75], "V2": [41,85,94,71], 
                             "V3":[19,3,38,58], "Selection":['V1','V3', 'V2','V3']})
df

   V1  V2  V3 Selection
0  66  41  19        V1
1  57  85   3        V3
2  79  94  38        V2
3  75  71  58        V3

我们希望根据Selection列获得一个新列，其中第一个值将是V1列的对应值，第二个值将是V3列的对应值，以此类推。这时我们可以使用lookup函数：

df['Value'] = df.lookup(df.index, df.Selection)
df

   V1  V2  V3 Selection  Value
0  66  41  19        V1     66
1  57  85   3        V3      3
2  79  94  38        V2     94
3  75  71  58        V3     58

3.3 从字典中创建词云

通过定义单词的频率来创建词云。

import matplotlib.pyplot as plt
from wordcloud import WordCloud
# assume that this is the dictionary, feel free to change it
word_could_dict = {'Git':100, 'GitHub':100, 'push':50, 'pull':10, 'commit':80, 
                    'add':30, 'diff':10, 'mv':5, 'log':8, 'branch':30, 'checkout':25}
wordcloud = WordCloud(width = 1000, height = 500).generate_from_frequencies(word_could_dict)
plt.figure(figsize=(15,8))
plt.imshow(wordcloud)

3.4 检查pandas数据框的列是否包含一个特定的值

查看字符a是否存在于DataFrame的列中:

import pandas as pd
df = pd.DataFrame({"A"  : ["a", "b", "c"], "B" : ["d", "e", "f"], "C" : ["x", "y" , "a"]})
df

A  B  C
0  a  d  x
1  b  e  y
2  c  f  a

只需输入:

(df=='a').any()

A     True
B    False
C     True

3.5 将多个pandas数据框保存到单个Excel文件

假设有多个数据框，若想将它们保存到包含许多工作表的的单个Excel文件中:

# create the xlswriter and give a name to the final excel 
# for example Final.xlsx
 
writer = pd.ExcelWriter('Final.xlsx', engine='xlsxwriter')
 
# it is convenient to store the pandas dataframes in a  
# dictionary, where the key is the worksheet name that you want to give 
# and the value is the data frame
df_dict = {'My_First_Tab': df1, 'My_Second_Tab': df2,
        'My_Third_Tab':df3, 'My_Forth_Tab':df4}
#iterate over the data frame of dictionaries
for my_sheet, dframe in  df_dict.items(): 
    dframe.to_excel(writer, sheet_name = my_sheet, index=False)
 
# finaly you have to save the writer
# and the Final.xlsx has been created
writer.save()

4 Google Spreadsheets

4.1 谷歌文档和电子表格的版本管理

大多数数据科学家都熟悉Git和GitHub，然而，许多人并不知道谷歌文档、电子表格和演示文稿中的版本历史记录功能。下面给出一个谷歌文档版本历史的例子:

打开谷歌文档。
在顶部，点击文件- >版本历史。

在左边，你会看到修改的日期和作者的名字。例如，2019年7月16日下午4点15分，茱莉亚·彭尼修改了文档:

你可以任意修改:

最后，点击“恢复此版本”按钮，可以恢复到之前的状态:

5 Linux

5.1 在Linux复制一个文件夹

使用Linux等操作系统时，如果想要将一个文件夹从一个目标复制到另一个目标，可以运行以下bash命令:

cp -R /some/dir/ /some/other/dir/

如果/some/other/dir/不存在，它将被创建。
-R表示递归复制目录。也可以使用-r，因为它不区分大小写。

参考资料

[1]

10 Tips And Tricks For Data Scientists Vol.2: https://predictivehacks.com/10-tips-and-tricks-for-data-scientists-vol-2/

推荐： 市调比赛终于结束啦，我们组拿到了国二的成绩，记录了一下这次的昆明之旅，感兴趣的可以看一看

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-05-25，如有侵权请联系 cloudcommunity@tencent.com 删除

linux

python

本文分享自庄闪闪的R语言手册微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

linux

python

登录后参与评论

0 条评论

热度