NumPy 将停止支持 Python 2，这里有一份给数据科学家的 Python 3 使用指导

文章来源：企鹅号 - 唯物

Python 已经成为机器学习和数据科学的主要编程语言，同时 Python 2 和 Python 3 共存与 Python 的生态体系内。不过，在 2019 年底，NumPy 将停止支持 Python 2.7，2018 年后的新版本只支持 Python 3。

为了让数据科学家们快速上手 Python 3，该库介绍了一些 Python 3 的新功能，供数据工作者参考。

更好的 pathlib 路径处理

frompathlibimportPath

dataset ='wiki_images'

datasets_root = Path('/path/to/datasets/')

train_path = datasets_root / dataset /'train'

test_path = datasets_root / dataset /'test'

forimage_pathintrain_path.iterdir():

withimage_path.open()asf:# note, open is a method of Path object

# do something with an image

以前，开发者喜欢用字符串来连接（虽然简洁，但是很明显这是很不好的），而现在 pathlib 的代码则是简洁、安全并且有高可读性的。

p.exists()

p.is_dir()

p.parts

p.with_name('sibling.png')# only change the name, but keep the folder

p.with_suffix('.jpg')# only change the extension, but keep the folder and the name

p.chmod(mode)

p.rmdir()

pathlib 能帮你节省很多时间，详情请查看以下文档和参考资料：

https://docs.python.org/3/library/pathlib.html

https://pymotw.com/3/pathlib/

类型提示已经是 Python 的一部分

下图是 pycharm 中的类型提示案例：

Python 不再是仅用于编写一些小脚本的编程语言，如今的数据处理流程包括了不同的步骤并涉及到了不同的框架。

程序复杂度日益增长，类型提示功能的引入能够很好地缓解这样的状况，能够让机器帮助验证代码。

下面是个简单的例子，这些代码可以处理不同类型的数据（这就是我们喜欢的 Python 数据栈）：

defrepeat_each_entry(data):

""" Each entry in the data is doubled

"""

index = numpy.repeat(numpy.arange(len(data)),2)

returndata[index]

这代码可以用于 numpy.array、astropy.Table、astropy.Column、bcolz、cupy、mxnet.ndarray 等等。

此代码可以用于 pandas.Series，不过下面这种方式是错的：

repeat_each_entry(pandas.Series(data=[,1,2],index=[3,4,5])) # returns Series with Nones inside

输入提示 —— 在运行时检查类型

默认情况下，函数注释不会影响你的代码，只是用于说明代码的意图。

不过，你可以在运行时使用类似 ... 这样的工具强制输入检测，这能有助于你的 debug（在很多情况下，输入提示不起作用）。

函数注释功能的其他用法

如同之前所说，注释不影响代码的执行，只是提供一些元信息，可以让你随意地使用它。

例如，测量单位的处理对于科研领域来讲是个令人头痛的事情，astropy 包可以提供一个简单的修饰器来控制输入量的单位，并将输出转化为所需的单位。

如果你在用 python 处理科学数据表格，你应该关注下 astropy。

你也可以定义特定应用程序的修饰器，以相同的方式执行输入和输出的控制与转换。

矩阵与 @ 相乘

我们来实现一个最简单的 ML 模型 —— L2 正则化线性回归（又称岭回归）：

带有 @ 的代码在不同的深度学习框架之间更具有可读性并且更容易翻译：

同样的单层感知代码 X @ W + b[None, :] 可以在 numpy、cupy、pytorch、tensorflow 中运行。

与 ** Globbing

在 Python 2 里，递归文件的 globbing 并不容易，即使有 glob2 （https://github.com/miracle2k/python-glob2）模块克服了这点。从 3.5 版本开始，Python 支持递归 flag：

importglob

# Python2

found_images = \

glob.glob('/path/*.jpg') \

+glob.glob('/path/*/*.jpg') \

+glob.glob('/path/*/*/*.jpg') \

+glob.glob('/path/*/*/*/*.jpg') \

+glob.glob('/path/*/*/*/*/*.jpg')

# Python3

found_images =glob.glob('/path/**/*.jpg', recursive=True)

一个更好的选择是在 Python 3 中使用 pathlib：

# Python 3

found_images= pathlib.Path('/path/').glob('**/*.jpg')

Print 现在是一个功能

是的，虽然代码里有些恼人的括号，但是这也有些优点：

用于文件描述符的简单语法

print>>sys.stderr,"critical error"# Python 2

print("critical error", file=sys.stderr)# Python 3

打印没有 str.jion 的 tab-aligned 表格

# Python 3

print(*array, sep='\t')

print(batch, epoch, loss, accuracy, time, sep='\t')

哈希压制/重定向打印输出：

# Python 3

_print =print# store the original print function

defprint(*args, **kargs):

pass# do something useful, e.g. store output to some file

在 jupyter 中最好将每个输出记录到一个单独的文件中（方便在你断线后跟踪），所以你现在可以覆盖（override）print 了，

在下面你可以看到一个临时覆盖 print 行为的 Context Manager：

@contextlib.contextmanager

defreplace_print():

importbuiltins

_print =print# saving old print function

# or use some other function here

builtins.print=lambda*args, **kwargs: _print('new printing', *args, **kwargs)

yield

builtins.print= _print

withreplace_print():

print 可以参与列表理解和其他的语言结构

# Python 3

result = process(x)ifis_valid(x)elseprint('invalid item: ', x)

数字间的下划线（千分位分隔符）

PEP-515（https://www.python.org/dev/peps/pep-0515/）在数字文字中引入了下划线。在 Python3 里，下划线可用于整数、浮点数、和复数的可视化分隔。

# grouping decimal numbers by thousands

one_million=1_000_000

# grouping hexadecimal addresses by words

addr=xCAFE_F00D

# grouping bits into nibbles in a binary literal

flags=b_0011_1111_0100_1110

# same, for string conversions

flags= int('0b_1111_0000',2)

用于简单格式化的 f-strings

默认的格式化系统提供了数据实验中不需要的灵活性，由此产生的代码对于任何变化来讲要么太脆弱要么太冗长。

通常数据科学家用一种固定格式输出记录信息：

# Python2

print(' / accuracy: ± time: '.format(

batch=batch, epoch=epoch, total_epochs=total_epochs,

acc_mean=numpy.mean(accuracies), acc_std=numpy.std(accuracies),

avg_time=time /len(data_batch)

))

# Python2(too error-prone during fast modifications, please avoid):

print('{:3} {:3} / {:3} accuracy: {:0.4f}±{:0.4f} time: {:3.2f}'.format(

batch, epoch, total_epochs, numpy.mean(accuracies), numpy.std(accuracies),

time /len(data_batch)

))

输出：

120 12 / 300 accuracy: 0.8180±0.4649 time: 56.60

在 Python 3.6 中，引入了f-strings aka 格式化的字符串文字：

# Python 3.6+

print(f'/accuracy:±time:')

「真除」和「整除」之间的区别

对于数据科学家来讲，这是一个非常方便的改变。

data= pandas.read_csv('timing.csv')

velocity= data['distance'] / data['time']

在 Python2 中，结果的类型取决于'time' 和 'distance' 是否储存为整数，在 Python3 中，结果在两种情况下都正确，因为结果是浮点型。

严格的 ordering

# All these comparisons are illegal in Python 3

(3,4)

(4,5)

# False in both Python 2 and Python 3

(4,5) == [4,5]

Unicode for NLP

s ='您好'

print(len(s))

print(s[:2])

输出：

Python 2: 6\n��

Python 3: 2\n您好.

x = u'со'

x += 'co' # ok

x += 'со' # fail

输出结果在 Python2 失败，Python3 正常。

在 Python3 中，str 是 unicode 字符串，用于非英文文本的 NLP 更加方便。

保留字典和 **kwargs 的顺序

在默认情况下，CPython 3.6+ 中的 dicts 的行为类似 OrderedDict。这保留了 dict 理解的顺序（以及一些其他操作，比如在 json 序列化和反序列化中的一些操作）。

importjson

x =

json.loads(json.dumps(x))

# Python 2

{u'1':1,u'0':,u'3':3,u'2':2,u'4':4}

# Python 3

{'0':,'1':1,'2':2,'3':3,'4':4}

这同样适用于 **kwargs（在 Python 3.6+ 里），在参数之间保持了同样的顺序。

当涉及到数据管道时，顺序至关重要，而以前我们必须用更加麻烦的方式来编写：

from torch importnn

# Python2

model =nn.Sequential(OrderedDict([

('conv1',nn.Conv2d(1,20,5)),

('relu1',nn.ReLU()),

('conv2',nn.Conv2d(20,64,5)),

('relu2',nn.ReLU())

]))

# Python3.6+, how it *can*bedone, not supportedrightnow in pytorch

model =nn.Sequential(

conv1=nn.Conv2d(1,20,5),

relu1=nn.ReLU(),

conv2=nn.Conv2d(20,64,5),

relu2=nn.ReLU())

)

Iterable 拆包

# handy when amount of additional stored info may vary between experiments, but the same code canbeused inallcases

model_paramteres, optimizer_parameters, *other_params = load(checkpoint_name)

# picking twolastvaluesfromasequence

*prev, next_to_last,last= values_history

# This also works with any iterables,soifyou haveafunctionthatyieldse.g.qualities,

# belowisasimple waytotakeonlylasttwovaluesfromalist

*prev, next_to_last,last= iter_train(args)

默认的 pickle 引擎能为数组提供更好的压缩

更少的空间，更快的速度。实际上，类似的压缩可以通过 protocol=2 参数来实现，但是用户通常忽略掉了这个选项。

更安全的理解

labels=

predictions= [model.predict(data) for data, labels in dataset]

# labels are overwritten in Python 2

# labels are not affected by comprehension in Python 3

超级 super()

Python2 的 super(...) 经常出错

# Python 2

classMySubClass(MySuperClass):

def__init__(self, name, **options):

super(MySubClass,self).__init__(name='subclass', **options)

# Python 3

classMySubClass(MySuperClass):

def__init__(self, name, **options):

super().__init__(name='subclass', **options)

更多 super 的解析请移步：

https://stackoverflow.com/questions/576169/understanding-python-super-with-init-methods

更好的 IDE 和变量注释

在 Java，C＃等语言编程里，最令人愉快的事情是 IDE 可以提出非常好的建议，因为在执行程序之前每个标识符的类型是已知的。

在 Python 里很难实现，但是注释会帮你：

用清晰的形式写下你的期望

从 IDE 里获取好的建议

多拆包

下面是你如何合并两个 dicts：

x= dict(a=1, b=2)

y= dict(b=3, d=4)

# Python 3.5+

z= {**x, **y}

# z = {'a': 1, 'b': 3, 'd': 4}, note that value for `b` is taken from the latter dict.

比较 Python2 中的差异，请查看：

https://stackoverflow.com/questions/38987/how-to-merge-two-dictionaries-in-a-single-expression

同样的方法也适用于列表、元组和集合（a、b、c 是任意 iterables）：

[*a, *b, *c]# list, concatenating

(*a, *b, *c)# tuple, concatenating

{*a, *b, *c}# set, union

函数也支持* args和** kwargs：

Python3.5+

do_something(**{**default_settings, **custom_settings})

# Also possible, this code also checks thereisnointersection betweenkeysof dictionaries

do_something(**first_args, **second_args)

只有关键字参数的不过时的 API

思考下这个代码片段：

model= sklearn.svm.SVC(2,'poly',2,4,0.5)

很明显，这个作者好没有形成 Python 的编码风格（他很有可能刚从 C++ 或者 rust 跳转到 Python 开发上）。不幸的是，这不是编码风格的问题，因为你改变 SVC 中参数的顺序将打破这段代码。特别是，sklearn 会不时对众多算法参数重排序/重命名来提供一致的 API，每次这样的重构都会破坏代码。

在 Python3 里，库作者可能需要用 * 来显示命名参数。

classSVC(BaseSVC):

def__init__(self, *, C=1.0, kernel='rbf', degree=3, gamma='auto', coef=., ... )

用户必须指定参数名 sklearn.svm.SVC(C=2, kernel='poly', degree=2, gamma=4, coef0=0.5)

这项机制结合了 API 可靠性和灵活性

结论

Python2 和 Python3 共存了将近十年，现在我们应该转移到 Python3上，研究和产品开发代码转移到 Python3-Only 的代码库之后会更简洁、更安全、更易读。

Via：https://github.com/arogozhnikov/python3_with_pleasure

NLP 工程师入门实践班：基于深度学习的自然语言处理

三大模块，五大应用，手把手快速入门 NLP

海外博士讲师，丰富项目经验

算法 + 实践，搭配典型行业应用

随到随学，专业社群，讲师在线答疑

发表于: 2018-02-062018-02-06 07:00:35
原文链接：http://kuaibao.qq.com/s/20180206A03YOQ00?refer=cp_1026
腾讯「腾讯云开发者社区」是腾讯内容开放平台帐号（企鹅号）传播渠道之一，根据《腾讯内容开放平台服务协议》转载发布内容。
如有侵权，请联系 cloudcommunity@tencent.com 删除。

扫码

添加站长进交流群

领取专属 10元无门槛券

私享最新 技术干货

NumPy 将停止支持 Python 2，这里有一份给数据科学家的 Python 3 使用指导

相关快讯

扫码

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐