Polars是一个高性能的数据处理库,它旨在提供快速的数据处理能力,特别是在处理大型数据集时。Polars是由Rust语言编写的,这使得它在性能和内存安全性方面具有显著优势。
以下是Polars的一些关键特性和优势:
以下对 常用的数据分析处理库 pandas 和 polars 进行性能对比测试
1千万
行数据
import numpy as np
import pandas as pd
import polars as pl
import time
# 设置随机种子以获得可重复的结果
np.random.seed(0)
# 生成大数据集
data_size = 10000000 # 例如,10 million rows
columns = ['col1', 'col2', 'col3', 'col4', 'col5']
data = np.random.rand(data_size, len(columns)) # 生成随机数据
DataFrame
# 将numpy数组转换为pandas DataFrame和polars DataFrame
start_time = time.time()
df_pandas = pd.DataFrame(data, columns=columns)
print(f"pands DataFrame took: {time.time() - start_time:.2f} seconds")
start_time = time.time()
df_polars = pl.DataFrame(data, columns)
print(f"polars DataFrame took: {time.time() - start_time:.2f} seconds")
输出:
pands DataFrame took: 0.00 seconds
polars DataFrame took: 0.64 seconds
csv
# 保存DataFrame为CSV文件
start_time = time.time()
df_pandas.to_csv('pandas_data.csv', index=False)
print(f"Saving pandas DataFrame to CSV took: {time.time() - start_time:.2f} seconds")
start_time = time.time()
df_polars.write_csv('polars_data.csv')
print(f"Saving polars DataFrame to CSV took: {time.time() - start_time:.2f} seconds")
输出:
Saving pandas DataFrame to CSV took: 116.20 seconds
Saving polars DataFrame to CSV took: 9.09 seconds
polars 的效率是 pandas 的 12.7 倍
csv
# 加载csv文件
start_time = time.time()
df_pandas = pd.read_csv('pandas_data.csv')
end_time = time.time()
print(f"Loading pandas DataFrame from CSV took: {end_time - start_time:.2f} seconds")
start_time = time.time()
df_polars = pl.read_csv('polars_data.csv')
end_time = time.time()
print(f"Loading polars DataFrame from CSV took: {end_time - start_time:.2f} seconds")
输出:
Loading pandas DataFrame from CSV took: 10.06 seconds
Loading polars DataFrame from CSV took: 0.95 seconds
polars 的效率是 pandas 的 10.5 倍
# 测试pandas的数据过滤性能
start_time = time.time()
df_filtered_pandas = df_pandas[df_pandas['col1'] > 0.5]
end_time = time.time()
print(f"Pandas data filtering took: {end_time - start_time:.2f} seconds")
# 测试polars的数据过滤性能
start_time = time.time()
df_filtered_polars = df_polars.filter(df_polars['col1'] > 0.5)
end_time = time.time()
print(f"Polars data filtering took: {end_time - start_time:.2f} seconds")
输出:
Pandas data filtering took: 0.42 seconds
Polars data filtering took: 0.23 seconds
polars 的效率是 pandas 的 1.8 倍
# 测试pandas的数据分组性能
start_time = time.time()
grouped_pandas = df_pandas.groupby('col1').agg(np.mean)
end_time = time.time()
print(f"Pandas data grouping took: {end_time - start_time:.2f} seconds")
# 测试polars的数据分组性能
start_time = time.time()
grouped_polars = df_polars.group_by('col1').agg(
col1_mean = pl.col('col1').mean()
)
end_time = time.time()
print(f"Polars data grouping took: {end_time - start_time:.2f} seconds")
输出:
Pandas data grouping took: 20.08 seconds
Polars data grouping took: 1.92 seconds
polars 的效率是 pandas 的 10.4 倍
# 测试pandas的数据排序性能
start_time = time.time()
sorted_pandas = df_pandas.sort_values(by='col1')
end_time = time.time()
print(f"Pandas data sorting took: {end_time - start_time:.2f} seconds")
# 测试polars的数据排序性能
start_time = time.time()
sorted_polars = df_polars.sort('col1')
end_time = time.time()
print(f"Polars data sorting took: {end_time - start_time:.2f} seconds")
输出:
Pandas data sorting took: 7.59 seconds
Polars data sorting took: 1.17 seconds
polars 的效率是 pandas 的 6.5 倍
Scikit-learn
等库还不支持 polars dataframe,但是支持 pandas,polars 提供了转换接口start_time = time.time()
df_pandas = df_polars.to_pandas()
end_time = time.time()
print(f"Polars to pandas conversion took: {end_time - start_time:.2f} seconds")
输出:Polars to pandas conversion took: 0.36 seconds
特点 | Polars | Pandas |
---|---|---|
性能优化 | 使用 Rust 编写底层,高性能 | 基于 Python 和 C,性能相对较低 |
并行处理 | 支持并行执行操作 | 受限于 Python 的 GIL,无法充分利用多核处理器 |
成熟度和生态 | 相对较新,生态系统较小 | 成熟且广泛使用,生态系统丰富 |