Polars Rust 第 5 课：数据清洗与转换

不吃草的牛德

发布于 2026-04-23 13:05:29

460

文章被收录于专栏：RustRust

开篇引入

欢迎来到 Polars Rust 系列的第 5 课！🎉

在真实的数据工程世界里，有一句广为流传的话："数据科学家 80% 的时间都在清洗数据"。这不是玩笑——无论你的模型多么精妙，输入的是垃圾，输出的也只能是垃圾（Garbage In, Garbage Out）。

数据清洗是 ETL（Extract-Transform-Load）流程中最耗时也最重要的环节。原始数据往往充满了缺失值、类型混乱、格式不统一、重复记录等问题。如何高效、优雅地处理这些问题，是每个数据工程师的必修课。

好消息是，Polars 的表达式系统（Expression System）让数据清洗变得异常优雅 🌟。通过链式调用和惰性求值，你可以像搭积木一样组合各种清洗操作，既清晰又高效。今天，我们就来系统学习 Polars 中的数据清洗与转换技巧！

本课的完整代码依赖如下：

# Cargo.toml
[dependencies]
polars = { version = "0.53", features = ["lazy", "csv", "strings", "temporal", "dtype-date", "dtype-datetime","polars-ops","regex"] }

准备好了吗？让我们开始吧！🚀

一、类型转换 🔄

1.1 为什么类型转换如此重要？

在实际场景中，CSV 文件读取时所有列默认会被解析为字符串（Utf8）。你拿到的年龄可能是 "25" 而不是 25，日期可能是 "2024-01-15" 而不是 Date 类型。类型不对，计算全废——字符串之间无法做数值运算，日期字符串也无法做时间差计算。

1.2 `.cast()` 方法：最常用的类型转换

.cast() 是 Polars 中进行类型转换的核心方法，它接受一个 DataType 参数，返回转换后的表达式。

use polars::prelude::*;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // 创建示例数据：注意 age 列是字符串类型
    let df = df![
        "name"    => &["Alice", "Bob", "Charlie", "Diana"],
        "age"     => &["25", "30", "35", "28"],       // 字符串，需要转为整数
        "salary"  => &["8500.50", "9200.00", "11000.75", "7800.30"], // 字符串，需要转为浮点
        "is_vip"  => &["true", "false", "true", "true"],  // 字符串，需要转为布尔
    ]?;

    // 使用 Lazy API 进行类型转换
    let result = df.lazy()
        // 将 age 列从字符串转为 Int32
        .with_column(col("age").cast(DataType::Int32))
        // 将 salary 列从字符串转为 Float64
        .with_column(col("salary").cast(DataType::Float64))
        // 将 is_vip 列从字符串转为 Boolean
        .with_column(when(col("is_vip").eq(lit("true")))
                        .then(lit(true))
                        .otherwise(lit(false))
                        .alias("is_vip"))
        .collect()?;

    println!("{}", result);
    Ok(())
}

输出结果：

shape: (4, 4)
┌─────────┬─────┬──────────┬────────┐
│ name    ┆ age ┆ salary   ┆ is_vip │
│ ---     ┆ --- ┆ ---      ┆ ---    │
│ str     ┆ i32 ┆ f64      ┆ bool   │
╞═════════╪═════╪══════════╪════════╡
│ Alice   ┆ 25  ┆ 8500.5   ┆ true   │
│ Bob     ┆ 30  ┆ 9200.0   ┆ false  │
│ Charlie ┆ 35  ┆ 11000.75 ┆ true   │
│ Diana   ┆ 28  ┆ 7800.3   ┆ true   │
└─────────┴─────┴──────────┴────────┘

1.3 Schema Override：读取时指定类型

如果你提前知道数据的类型，可以在读取时直接指定 Schema，避免后续的转换步骤。这不仅代码更简洁，而且性能更好——Polars 在读取阶段就完成了类型推断。

use polars::prelude::*;
use std::fs::File;

fn read_with_schema() -> Result<(), Box<dyn std::error::Error>> {
    // 定义 Schema：在读取时就指定每列的类型
    let schema = Schema::from_iter(vec![
        Field::new("id".into(), DataType::Int32),
        Field::new("name".into(), DataType::String),
        Field::new("age".into(), DataType::Int32),
        Field::new("salary".into(), DataType::Float64),
        Field::new("hire_date".into(), DataType::Date),
    ]);

    // 读取 CSV 时应用 Schema
    let file = File::open("employees.csv")?;
    let df = CsvReadOptions::default()
        .with_schema(Some(Arc::new(schema)))
        .into_reader_with_file_handle(file)
        .finish()?;

    println!("{}", df);
    Ok(())
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    read_with_schema()?;
    Ok(())
}

💡 小贴士：Schema Override 不仅能提升性能，还能避免 Polars 自动推断类型时可能出现的错误。对于大型数据集，强烈推荐在读取时指定 Schema！

1.4 安全转换与错误处理

类型转换并非总是成功的。比如将 "hello" 转为 Int32 就会报错。Polars 提供了 Strict 和 NonStrict 两种转换策略：

use polars::prelude::*;
use polars::chunked_array::cast::CastOptions;
fn safe_cast_example() -> Result<(), Box<dyn std::error::Error>> {
    let df = df![
        "value" => &["42", "hello", "100", "world"],
    ]?;

    // 严格模式（默认）：遇到无法转换的值会报错
    // col("value").cast(DataType::Int32)  // 这会 panic！

    // 非严格模式：无法转换的值变为 null
    let result = df.lazy()
        .with_column(
            col("value").cast_with_options(
                        DataType::Int32,
                        CastOptions::NonStrict   // ← this is the correct way
                    )
        )
        .collect()?;

    println!("{}", result);
    // 输出：42, null, 100, null
    Ok(())
}

1.5 常见数据类型一览

类型	Polars DataType	说明
字符串	DataType::Utf8	变长字符串
32位整数	DataType::Int32	常规整数
64位整数	DataType::Int64	大数值整数
64位浮点	DataType::Float64	浮点数
布尔	DataType::Boolean	true/false
日期	DataType::Date	年月日，无时区
日期时间	DataType::Datetime(TimeUnit, None)	含时分秒

⚠️ 注意：选择整数类型时，Int32 比 Int64 更节省内存。如果你的数据值在 ±21 亿以内，优先使用 Int32。

二、空值处理 💨

2.1 空值：数据中的"沉默杀手"

空值（Null）是数据清洗中最常见的问题之一。它们可能来自数据采集失败、用户未填写、系统错误等各种原因。忽视空值可能导致计算结果偏差、程序崩溃等严重问题。

Polars 提供了丰富的空值处理工具，让我们来逐一掌握。

2.2 `fill_null`：灵活填充空值

fill_null 可以用固定值、表达式结果来填充空值：

use polars::prelude::*;

fn fill_null_example() -> Result<(), Box<dyn std::error::Error>> {
    let df = df![
        "name"   => &["Alice", "Bob", "Charlie", "Diana", "Eve"],
        "age"    => &[Some(25), None, Some(35), None, Some(28)],
        "score"  => &[Some(88.5), Some(92.0), None, Some(76.3), None],
        "city"   => &[Some("北京"), Some("上海"), None, Some("广州"), None],
    ]?;

    let result = df.lazy()
        // 用固定值 0 填充 age 的空值
        .with_column(col("age").fill_null(lit(0)))
        // 用 score 列的均值填充空值
        .with_column(
            col("score").fill_null(col("score").mean())
        )
        // 用字符串 "未知" 填充 city 的空值
        .with_column(col("city").fill_null(lit("未知")))
        .collect()?;

    println!("{}", result);
    Ok(())
}

输出：

shape: (5, 4)
┌─────────┬─────┬───────┬──────┐
│ name    ┆ age ┆ score ┆ city │
│ ---     ┆ --- ┆ ---   ┆ ---  │
│ str     ┆ i32 ┆ f64   ┆ str  │
╞═════════╪═════╪═══════╪══════╡
│ Alice   ┆ 25  ┆ 88.5  ┆ 北京 │
│ Bob     ┆ 0   ┆ 92.0  ┆ 上海 │
│ Charlie ┆ 35  ┆ 85.6  ┆ 未知 │
│ Diana   ┆ 0   ┆ 76.3  ┆ 广州 │
│ Eve     ┆ 28  ┆ 85.6  ┆ 未知 │
└─────────┴─────┴───────┴──────┘

2.3 `fill_null_with_strategy`：前向填充与后向填充

在时间序列数据中，前向填充（Forward Fill） 和 后向填充（Backward Fill） 是非常常用的策略——用前一个有效值或后一个有效值来填充空值。

use polars::prelude::*;
use polars::series::Series;
use polars::chunked_array::ops::FillNullStrategy;
fn strategy_fill_example() -> Result<(), Box<dyn std::error::Error>> {
    let df = df![
        "date"  => &["2024-01-01", "2024-01-02", "2024-01-03", "2024-01-04", "2024-01-05"],
        "temp"  => &[Some(5.0), None, None, Some(8.0), None],
    ]?;

    // 前向填充：用前一个非空值填充
    let forward = df.clone().lazy()
        .with_column(
            col("temp").fill_null_with_strategy(FillNullStrategy::Forward(None))
        )
        .collect()?;
    println!("前向填充结果：\n{}", forward);
    // 结果：5.0, 5.0, 5.0, 8.0, 8.0

    // 后向填充：用后一个非空值填充
    let backward = df.lazy()
        .with_column(
            col("temp").fill_null_with_strategy(FillNullStrategy::Backward(None))
        )
        .collect()?;
    println!("后向填充结果：\n{}", backward);
    // 结果：5.0, 8.0, 8.0, 8.0, null（最后一个没有后值，仍为 null）

    Ok(())
}

💡 实战技巧：在金融数据（如股票价格）中，前向填充是最常见的选择——假设价格在缺失期间保持不变。

2.4 `drop_nulls`：果断删除空值行

当空值数量较少，或者你不想用填充值引入偏差时，可以直接删除含空值的行：

fn drop_nulls_example() -> Result<(), Box<dyn std::error::Error>> {
    let df = df![
        "name"  => &["Alice", "Bob", "Charlie", "Diana"],
        "age"   => &[Some(25), None, Some(35), Some(28)],
        "email" => &[Some("a@test.com"), Some("b@test.com"), None, Some("d@test.com")],
    ]?;

    // 删除任何含空值的行
    let clean = df.clone().lazy()
        .filter(
                col("age").is_not_null()
                    .and(col("email").is_not_null())
            )
        .collect()?;
    println!("删除空值后：\n{}", clean);
    // 只保留 Alice 和 Diana（两列都没有空值的行）

    // 也可以只检查特定列
    let clean_age = df.clone().lazy()
        .filter(col("age").is_not_null())   // 只删除 age 为空的行
        .collect()?;
    println!("只删除 age 为空的行：\n{}", clean_age);
    // 保留 Alice, Charlie, Diana
    // 检查所有列（效果比只检查 age 更强）
    //
    let clean_age = df.clone().lazy()
        .drop_nulls(None)
        .collect()?;
    println!("删除任意列为空的行：\n{}", clean_age);
 // 只留Alice 和 Diana
    Ok(())
}

运行结果

删除空值后：
shape: (2, 3)
┌───────┬─────┬────────────┐
│ name  ┆ age ┆ email      │
│ ---   ┆ --- ┆ ---        │
│ str   ┆ i32 ┆ str        │
╞═══════╪═════╪════════════╡
│ Alice ┆ 25  ┆ a@test.com │
│ Diana ┆ 28  ┆ d@test.com │
└───────┴─────┴────────────┘
只删除 age 为空的行：
shape: (3, 3)
┌─────────┬─────┬────────────┐
│ name    ┆ age ┆ email      │
│ ---     ┆ --- ┆ ---        │
│ str     ┆ i32 ┆ str        │
╞═════════╪═════╪════════════╡
│ Alice   ┆ 25  ┆ a@test.com │
│ Charlie ┆ 35  ┆ null       │
│ Diana   ┆ 28  ┆ d@test.com │
└─────────┴─────┴────────────┘
删除任意列为空的行：
shape: (2, 3)
┌───────┬─────┬────────────┐
│ name  ┆ age ┆ email      │
│ ---   ┆ --- ┆ ---        │
│ str   ┆ i32 ┆ str        │
╞═══════╪═════╪════════════╡
│ Alice ┆ 25  ┆ a@test.com │
│ Diana ┆ 28  ┆ d@test.com │
└───────┴─────┴────────────┘

Polars 0.53 的 drop_nulls 在 Rust 中对 subset 参数的支持还不够友好（需要 Selector::ByName { ... } 这种内部构造，非常麻烦）。 使用 .filter() 是这个版本下最稳定、最清晰的方式。

2.5 `is_null()` / `is_not_null()`：检测空值

有时你不想删除或填充空值，而是想标记它们，或者根据空值做条件过滤：

use polars::prelude::*;

fn detect_nulls_example() -> Result<(), Box<dyn std::error::Error>> {
    let df = df![
        "product"  => &["手机", "笔记本", "平板", "耳机"],
        "price"    => &[Some(2999), None, Some(1599), None],
    ]?;

    // 筛选出 price 为空的产品（需要补录价格）
    let missing_price = df.lazy()
        .filter(col("price").is_null())
        .collect()?;
    println!("需要补录价格的产品：\n{}", missing_price);

    // 添加一个标记列：是否有价格
    let tagged = df.lazy()
        .with_column(
            col("price").is_not_null().alias("has_price")
        )
        .collect()?;
    println!("标记结果：\n{}", tagged);

    Ok(())
}

2.6 实际案例：脏数据缺失值处理

fn dirty_data_null_handling() -> Result<(), Box<dyn std::error::Error>> {
    // 模拟一份脏数据：包含各种缺失值
    let df = df![
        "order_id" => &[1, 2, 3, 4, 5, 6],
        "customer" => &[Some("张三"), Some("李四"), None, Some("王五"), None, Some("赵六")],
        "amount"   => &[Some(100.0), Some(250.5), None, Some(80.0), None, Some(320.0)],
        "status"   => &[Some("已完成"), None, Some("已发货"), Some("已完成"), None, Some("待发货")],
    ]?;

    let cleaned = df.lazy()
        // 1. 删除 customer 为空的行（没有客户信息的订单无意义）
        .filter(col("customer").is_not_null())
        // 2. 用均值填充 amount 的空值
        .with_column(col("amount").fill_null(col("amount").mean()))
        // 3. 用 "未知" 填充 status 的空值
        .with_column(col("status").fill_null(lit("未知")))
        .collect()?;

    println!("清洗后的数据：\n{}", cleaned);
    Ok(())
}

运行结果：

清洗后的数据：
shape: (4, 4)
┌──────────┬──────────┬────────┬────────┐
│ order_id ┆ customer ┆ amount ┆ status │
│ ---      ┆ ---      ┆ ---    ┆ ---    │
│ i32      ┆ str      ┆ f64    ┆ str    │
╞══════════╪══════════╪════════╪════════╡
│ 1        ┆ 张三     ┆ 100.0  ┆ 已完成 │
│ 2        ┆ 李四     ┆ 250.5  ┆ 未知   │
│ 4        ┆ 王五     ┆ 80.0   ┆ 已完成 │
│ 6        ┆ 赵六     ┆ 320.0  ┆ 待发货 │
└──────────┴──────────┴────────┴────────┘

三、字符串操作 🔤

3.1 `.str` 命名空间：字符串操作的百宝箱

Polars 通过 .str 命名空间提供了丰富的字符串处理方法。在 Lazy API 中，你只需在列表达式后面加上 .str.xxx() 即可调用各种字符串方法。所有字符串操作都是向量化执行的，无需手动循环，性能极佳！

fn str_namespace_intro() -> Result<(), Box<dyn std::error::Error>> {
    let df = df![
        "raw_name" => &["  ALICE  ", "  bob  ", "  CHARLIE  ", "  diana  "],
    ]?;

    let result = df.lazy()
        .with_column(
            col("raw_name")
                .str()
                .strip_chars(" ".into())       // 去除两端空白
                .str()
                .to_lowercase()          // 转小写
                .alias("clean_name")
        )
        .collect()?;

    println!("{}", result);
    // 输出：alice, bob, charlie, diana
    Ok(())
}

3.2 `.str.contains()`：正则匹配

contains() 方法支持正则表达式，是文本筛选的利器：

use polars::prelude::*;

fn str_contains_example() -> Result<(), Box<dyn std::error::Error>> {
    let df = df![
        "email" => &[
            "alice@gmail.com",
            "bob@qq.com",
            "charlie@163.com",
            "diana@gmail.com",
            "eve@hotmail.com",
        ],
    ]?;

    // 筛选所有 Gmail 邮箱
    let gmail_users = df.lazy()
        .filter(col("email").str().contains(lit("@gmail"), true))
        .collect()?;
    println!("Gmail 用户：\n{}", gmail_users);

    // 使用正则表达式：匹配手机号（中国大陆格式）
    let contacts = df![
        "info" => &["张三 13800138000", "李四 13912345678", "王五 abcdefg", "赵六 15088886666"],
    ]?;
    let phone_valid = contacts.lazy()
        .filter(
            col("info").str().contains(lit(r"1[3-9]\d{9}"), true)
        )
        .collect()?;
    println!("包含有效手机号的记录：\n{}", phone_valid);

    Ok(())
}

3.3 `.str.replace()` / `.str.replace_all()`：替换

use polars::prelude::*;

fn str_replace_example() -> Result<(), Box<dyn std::error::Error>> {
    let df = df![
        "address" => &[
            "北京市朝阳区建国路100号",
            "上海市浦东新区陆家嘴环路200号",
            "广东省深圳市南山区科技园路300号",
        ],
    ]?;

    // replace：只替换第一个匹配项
    // replace_all：替换所有匹配项
    let result = df.lazy()
        .with_column(
            col("address")
                .str()
                .replace_all(lit("号"), lit("号（大厦）"), true)  // true 表示使用正则
                .alias("formatted_address")
        )
        .collect()?;
    println!("{}", result);

    // 实际场景：脱敏处理手机号
    let phones = df![
        "phone" => &["13800138000", "13912345678", "15088886666"],
    ]?;
    let masked = phones.lazy()
        .with_column(
            col("phone")
                .str()
                .replace_all(
                    lit(r"(\d{3})\d{4}(\d{4})"),
                    lit("$1****$2"),
                    false
                )
                .alias("masked_phone")
        )
        .collect()?;
    println!("脱敏结果：\n{}", masked);
    // 输出：138****8000, 139****5678, 150****6666

    Ok(())
}

🔒 安全提示：在处理用户隐私数据时，字符串替换是实现数据脱敏的重要手段。上面的手机号脱敏就是一个典型应用场景。

3.4 大小写转换

use polars::prelude::*;

fn str_case_example() -> Result<(), Box<dyn std::error::Error>> {
    let df = df![
        "product" => &["iPhone Pro", "MacBook AIR", "iPad MINI", "airpods MAX"],
    ]?;

    let result = df.lazy()
        .with_column(col("product").str().to_lowercase().alias("lower"))
        .with_column(col("product").str().to_uppercase().alias("upper"))
        .collect()?;
    println!("{}", result);

    Ok(())
}

3.5 去除字符

use polars::prelude::*;

fn str_strip_example() -> Result<(), Box<dyn std::error::Error>> {
    let df = df![
        "code" => &["  ABC123  ", "  DEF456  ", "GHI789  ", "  JKL012"],
        "url"  => &["https://example.com", "http://test.org", "ftp://files.net","http://test.com"],
    ]?;

    let result = df.lazy()
        // strip_chars(" ")：去除两端所有空白字符
        .with_column(col("code").str().strip_chars(lit(" ")).alias("trimmed"))
        // strip_prefix：去除指定前缀
        .with_column(
            col("url")
                .str()
                .strip_prefix(lit("https://"))
                .alias("domain")
        )
        .collect()?;
    println!("{}", result);

    Ok(())
}

3.6 `.str.split()` / `.str.extract()`：分割与提取

use polars::prelude::*;

fn str_split_extract_example() -> Result<(), Box<dyn std::error::Error>> {
    // 场景：从 "姓名:年龄:城市" 格式中提取各字段
    let df = df![
        "info" => &["Alice:25:北京", "Bob:30:上海", "Charlie:35:广州"],
    ]?;

    // 方法一：使用 split 拆分为多列
    let split_result = df.clone().lazy()
        .with_column(
            col("info")
                .str()
                .split(lit(":"))
                .alias("parts")
        )
        .collect()?;
    println!("分割结果：\n{}", split_result);

    // 方法二：使用 extract 用正则提取特定部分
    let extracted = df.clone().lazy()
        .with_column(
            // extract(正则, 分组索引)
            col("info")
                .str()
                .extract(lit(r"(\w+):(\d+):(\w+)"), 1)  // 提取第2组（年龄）
                .cast(DataType::Int32)
                .alias("age")
        )
        .with_column(
            col("info")
                .str()
                .extract(lit(r"(\w+):(\d+):(\w+)"), 2)  // 提取第3组（城市）
                .alias("city")
        )
        .collect()?;
    println!("提取结果：\n{}", extracted);

    Ok(())
}

3.7 `.str.len()` / `.str.slice()`：长度与切片

use polars::prelude::*;

fn str_len_slice_example() -> Result<(), Box<dyn std::error::Error>> {
    let df = df![
        "title" => &["Polars数据分析入门", "Rust编程实战指南", "深度学习从零开始"],
    ]?;

    let result = df.lazy()
        // 计算每个标题的字符长度
        .with_column(col("title").str().len_chars().alias("title_len"))
        // 截取前6个字符作为短标题
        .with_column(col("title").str().slice(lit(0), lit(6)).alias("short_title"))
        .collect()?;
    println!("{}", result);

    Ok(())
}

3.8 组合多个字符串操作的 Pipeline

字符串操作的真正威力在于链式组合。下面是一个综合案例——清洗用户输入的地址信息：

use polars::prelude::*;

fn string_pipeline_example() -> Result<(), Box<dyn std::error::Error>> {
    let df = df![
        "raw_address" => &[
            "  13800138000  ",
            " 139-1234-5678 ",
            " 15088886666 ",
            " 186-0000-1234 ",
        ],
    ]?;

    let cleaned = df.lazy()
        .with_column(
            col("raw_address")
                // 步骤1：去除两端空白
                .str()
                .strip_chars(lit(" "))
                // 步骤2：去除所有横线
                .str()
                .replace_all(lit("-"), lit(""), false)

        )
        .with_column(
                // 步骤3：加上 +86- 前缀
                (lit("+86-") + col("raw_address"))
                    .alias("masked_address")
            )
        .collect()?;
    println!("格式化后的电话号码：\n{}", cleaned);

    Ok(())
}

四、日期时间处理 📅

4.1 `.dt` 命名空间：时间数据的瑞士军刀

日期时间处理在数据清洗中极为常见。日志分析、金融数据、用户行为分析……几乎所有涉及时间的数据都需要进行日期解析、格式化和计算。

Polars 通过 .dt 命名空间提供了全面的日期时间操作方法。

4.2 日期解析：从字符串到日期

use polars::prelude::*;

fn datetime_parse_example() -> Result<(), Box<dyn std::error::Error>> {
    let df = df![
        "date_str"     => &["2024-01-15", "2024-03-20", "2024-06-01", "2024-12-25"],
        "datetime_str" => &["2024-01-15 08:30:00", "2024-03-20 14:45:30", "2024-06-01 09:00:00", "2024-12-25 23:59:59"],
    ]?;

    let result = df.lazy()
        // ===== 字符串 → Date（修正后）=====
        .with_column(
            col("date_str")
                .str()
                .to_date(StrptimeOptions {
                    format: Some("%Y-%m-%d".into()),
                    ..Default::default()
                })
                .alias("date")
        )
        .with_column(
            col("datetime_str")
                .str()
                .strptime(
                    DataType::Datetime(TimeUnit::Microseconds, None),
                    StrptimeOptions {
                        format: Some("%Y-%m-%d %H:%M:%S".into()),
                        ..Default::default()
                    },
                    lit("raise")          // 或 lit("earliest") / lit("latest")
                )
                .alias("datetime")
        )
        .collect()?;

    println!("{}", result);
    Ok(())
}

💡 格式化符号速查：%Y=四位年份，%m=两位月份，%d=两位日期，%H=24小时制小时，%M=分钟，%S=秒。

4.3 提取日期时间组件

use polars::prelude::*;

fn datetime_components_example() -> Result<(), Box<dyn std::error::Error>> {
    // 创建带日期时间的数据
    let df = df![
        "timestamp" => &[
            "2024-01-15 08:30:00",
            "2024-03-20 14:45:30",
            "2024-06-01 09:00:00",
            "2024-12-25 23:59:59",
        ],
        "amount" => &[150.0, 200.5, 80.0, 320.0],
    ]?;

    let result = df.lazy()
        // 先解析为 Datetime 类型
        .with_column(
            col("timestamp")
                .str()
                .strptime(
                    DataType::Datetime(TimeUnit::Microseconds, None),
                    StrptimeOptions {
                        format: Some("%Y-%m-%d %H:%M:%S".into()),
                        ..Default::default()
                    },
                    lit("raise")          // 或 lit("earliest") / lit("latest")
                )
                .alias("datetime")
        )
        // 提取各个时间组件
        .with_column(col("datetime").dt().year().alias("year"))
        .with_column(col("datetime").dt().month().alias("month"))
        .with_column(col("datetime").dt().day().alias("day"))
        .with_column(col("datetime").dt().hour().alias("hour"))
        .with_column(col("datetime").dt().minute().alias("minute"))
        // 提取星期几（1=周一，7=周日）
        .with_column(col("datetime").dt().weekday().alias("weekday"))
        .collect()?;

    println!("{}", result);
    Ok(())
}

4.4 日期格式化输出

use polars::prelude::*;

fn datetime_format_example() -> Result<(), Box<dyn std::error::Error>> {
    let df = df![
        "event" => &["系统上线", "版本发布", "用户突破百万"],
        "date"  => &["2024-01-15", "2024-06-01", "2024-12-25"],
    ]?;

    let result = df.lazy()
        .with_column(
            col("date")
                .str()
                .to_date(StrptimeOptions {
                    format: Some("%Y-%m-%d".into()),
                    ..Default::default()
                })
                .alias("parsed_date")
        )
        // 格式化为中文友好的日期格式
        .with_column(
            col("parsed_date")
                .dt()
                .strftime("%Y年%m月%d日")
                .alias("formatted_date")
        )
        .collect()?;
    println!("{}", result);
    // 输出：2024年01月15日, 2024年06月01日, 2024年12月25日

    Ok(())
}

4.5 日期运算：差值计算

use polars::prelude::*;

fn datetime_diff_example() -> Result<(), Box<dyn std::error::Error>> {
    let df = df![
        "task"       => &["需求分析", "开发", "测试", "上线"],
        "start_date" => &["2024-01-01", "2024-01-15", "2024-02-01", "2024-03-01"],
        "end_date"   => &["2024-01-14", "2024-01-31", "2024-02-28", "2024-03-15"],
    ]?;

    let result = df.lazy()
        // 解析日期列
        .with_column(
            col("start_date")
                .str()
                .to_date(StrptimeOptions {
                    format: Some("%Y-%m-%d".into()),
                    ..Default::default()
                })
                .alias("start")
        )
        .with_column(
            col("end_date")
                .str()
                .to_date(StrptimeOptions {
                    format: Some("%Y-%m-%d".into()),
                    ..Default::default()
                })
                .alias("end")
        )
        // 计算天数差
        .with_column(
            (col("end") - col("start"))
                .dt()
                .total_days(true)
                .alias("duration_days")
        )
        .collect()?;
    println!("{}", result);
    // 输出：13天, 16天, 27天, 14天

    Ok(())
}

🎯 实用技巧：日期差值计算在项目排期、用户留存分析、A/B 测试等场景中非常常用。掌握这个技巧，能帮你解决大量时间相关的分析需求！

五、其他常用操作 🛠️

5.1 `drop_duplicates()` / `unique()`：去重

重复数据是脏数据的常见形式。Polars 提供了两种去重方式：

use polars::prelude::*;

fn dedup_example() -> Result<(), Box<dyn std::error::Error>> {
    let df = df![
        "name"  => &["Alice", "Bob", "Alice", "Charlie", "Bob"],
        "score" => &[85, 92, 85, 78, 88],
    ]?;

    // drop_duplicates：基于所有列去重，保留第一个出现的行
    let deduped = df.clone().lazy()
        .unique(None, UniqueKeepStrategy::First)
        .collect()?;
    println!("去重结果：\n{}", deduped);

    // 基于特定列去重
    let deduped_name = df.clone().lazy()
        .unique(
            Some(Selector::ByName {
                        names: vec!["name".into()].into(),
                        strict: true
                    }),
                    UniqueKeepStrategy::First,
        )
        .collect()?;
    println!("按 name 去重：\n{}", deduped_name);

    Ok(())
}

5.2 `sort()` 排序

use polars::prelude::*;

fn sort_example() -> Result<(), Box<dyn std::error::Error>> {
    let df = df![
        "name"   => &["Charlie", "Alice", "Eve", "Bob", "Diana"],
        "score"  => &[78, 95, 88, 62, 91],
    ]?;

    // 按分数降序排列
    let sorted = df.clone().lazy()
        .sort(
            vec!["score"],                          // ← 用列名字符串
            SortMultipleOptions::default()
                .with_order_descending(true)
        )
        .collect()?;
    println!("按分数降序：\n{}", sorted);

    // 多列排序：先按 score 降序，再按 name 升序
    let multi_sorted = df.clone().lazy()
            .sort(
                vec!["score", "name"],                  // ← 列名列表
                SortMultipleOptions::default()
                    .with_order_descending_multi(vec![true, false])   // true=降序, false=升序
            )
            .collect()?;
    println!("多列排序：\n{}", multi_sorted);

    Ok(())
}

5.3 `rechunk()` 性能优化提示

Polars 内部将数据存储在连续的内存块（Chunk）中。经过多次操作后，DataFrame 可能会包含多个零散的 Chunk，影响后续计算性能。rechunk() 可以将所有列合并为单个连续的 Chunk：

use polars::prelude::*;

fn rechunk_example() -> Result<(), Box<dyn std::error::Error>> {
    let mut df = df![
        "name"  => &["Alice", "Bob", "Charlie", "Diana"],
        "score" => &[90, 85, 95, 88],
    ]?;

    println!("rechunk 前 chunk 数：{}", df.n_chunks());

    // 执行 rechunk（推荐方式）
    let rechunked = df.rechunk_mut();   // 大多数情况下这个可用

    println!("rechunk 后 chunk 数：{}", rechunked.n_chunks());
    println!("rechunked 数据：\n{}", rechunked);

    Ok(())
}

⚡ 性能提示：在完成一系列数据清洗操作后，调用一次 rechunk() 可以提升后续计算的性能。不过在 Lazy API 中，Polars 的查询优化器通常会自动处理这个问题，所以大多数情况下你不需要手动调用。

5.4 `rename()` 列重命名

use polars::prelude::*;

fn rename_example() -> Result<(), Box<dyn std::error::Error>> {
    let df = df![
        "a" => &["Alice", "Bob"],
        "b" => &[25, 30],
        "c" => &[85.5, 92.0],
    ]?;

    // 批量重命名：使用选择器
    let renamed = df.lazy()
        .rename(
            ["a", "b", "c"],
            ["name", "age", "score"],
            true,  // 覆盖已存在的名称
        )
        .collect()?;
    println!("重命名结果：\n{}", renamed);

    Ok(())
}

六、实战练习：清洗一份脏数据 🏋️

现在，让我们把前面学到的所有知识综合起来，完成一个完整的数据清洗实战！

场景描述

我们拿到了一份电商平台的用户行为日志，数据质量堪忧：

• 用户名有大小写混乱和多余空格
• 手机号格式不统一
• 年龄有缺失值
• 注册时间是字符串格式
• 有重复记录

完整可运行代码

use polars::prelude::*;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    println!("=== 原始脏数据 ===");
    let dirty_df = df![
        "user_id"     => &[1, 2, 3, 4, 5, 1],
        "user_name"   => &["  ALICE  ", "bob", "  Charlie", "diana", "EVE  ", "  ALICE  "],
        "phone"       => &["13800138000", "139-1234-5678", "15088886666", "18600001234", "invalid", "13800138000"],
        "age"         => &[Some(25), None, Some(35), None, Some(28), Some(25)],
        "register_at" => &["2024-01-15", "2024-03-20", "2024-06-01", "2024-12-25", "2024-08-10", "2024-01-15"],
        "amount"      => &[Some(150.0), Some(200.5), None, Some(80.0), None, Some(150.0)],
    ]?;
    println!("{}", dirty_df);

    println!("\n=== 清洗后的数据 ===");

    let clean_df = dirty_df.clone().lazy()
        // 第1步：按 user_id 去重（保留第一条）
        .group_by([col("user_id")])
        .agg([
            col("user_name").first(),
            col("phone").first(),
            col("age").first(),
            col("register_at").first(),
            col("amount").first(),
        ])

        // ===== 第2步：清洗用户名（去空格 + 首字母大写） =====
        .with_column(
            col("user_name")
                .str()
                .strip_chars(lit(" "))
                .str()
                .to_lowercase()
                .alias("user_name")
        )
        // 首字母大写（v0.53 推荐写法，使用 + 拼接）
        .with_column(
            when(col("user_name").is_not_null())
                .then(
                    col("user_name").str().slice(lit(0), lit(1)).str().to_uppercase()
                    + col("user_name").str().slice(lit(1), lit(100))
                )
                .otherwise(lit(NULL))
                .alias("user_name")
        )

        // 第3步：清理手机号（去掉横线）
        .with_column(
            col("phone")
                .str()
                .replace_all(lit("-"), lit(""), false)
                .alias("phone")
        )

        // 第4步：填充年龄缺失值（用平均年龄）
        .with_column(
            col("age")
                .fill_null(col("age").mean())
                .alias("age")
        )

        // 第5步：解析注册日期
        .with_column(
            col("register_at")
                .str()
                .to_date(StrptimeOptions {
                    format: Some("%Y-%m-%d".into()),
                    ..Default::default()
                })
                .alias("register_date")
        )

        // 第6步：填充消费金额缺失值
        .with_column(
            col("amount")
                .fill_null(lit(0.0))
                .alias("amount")
        )

        // 第7步：格式化日期为中文
        .with_column(
            col("register_date")
                .dt()
                .strftime("%Y年%m月%d日")
                .alias("注册日期")
        )

        // 第8步：按 user_id 升序排序
        .sort(
            vec!["user_id"],
            SortMultipleOptions::default().with_order_descending(false)
        )

        // 第9步：选择最终需要的列
        .select([
            col("user_id"),
            col("user_name"),
            col("phone"),
            col("age"),
            col("注册日期"),
            col("amount"),
        ])
        .collect()?;

    println!("{}", clean_df);

    // 统计信息
    println!("\n=== 清洗统计 ===");
    println!("原始行数：{}", dirty_df.height());
    println!("清洗后行数：{}", clean_df.height());
    println!("删除重复行数：{}", dirty_df.height() - clean_df.height());

    Ok(())
}

预期输出：

=== 原始脏数据 ===
shape: (6, 6)
┌─────────┬───────────┬───────────────┬──────┬─────────────┬────────┐
│ user_id ┆ user_name ┆ phone         ┆ age  ┆ register_at ┆ amount │
│ ---     ┆ ---       ┆ ---           ┆ ---  ┆ ---         ┆ ---    │
│ i32     ┆ str       ┆ str           ┆ i32  ┆ str         ┆ f64    │
╞═════════╪═══════════╪═══════════════╪══════╪═════════════╪════════╡
│ 1       ┆   ALICE   ┆ 13800138000   ┆ 25   ┆ 2024-01-15  ┆ 150.0  │
│ 2       ┆ bob       ┆ 139-1234-5678 ┆ null ┆ 2024-03-20  ┆ 200.5  │
│ 3       ┆   Charlie ┆ 15088886666   ┆ 35   ┆ 2024-06-01  ┆ null   │
│ 4       ┆ diana     ┆ 18600001234   ┆ null ┆ 2024-12-25  ┆ 80.0   │
│ 5       ┆ EVE       ┆ invalid       ┆ 28   ┆ 2024-08-10  ┆ null   │
│ 1       ┆   ALICE   ┆ 13800138000   ┆ 25   ┆ 2024-01-15  ┆ 150.0  │
└─────────┴───────────┴───────────────┴──────┴─────────────┴────────┘

=== 清洗后的数据 ===
shape: (5, 6)
┌─────────┬───────────┬─────────────┬───────────┬────────────────┬────────┐
│ user_id ┆ user_name ┆ phone       ┆ age       ┆ 注册日期       ┆ amount │
│ ---     ┆ ---       ┆ ---         ┆ ---       ┆ ---            ┆ ---    │
│ i32     ┆ str       ┆ str         ┆ f64       ┆ str            ┆ f64    │
╞═════════╪═══════════╪═════════════╪═══════════╪════════════════╪════════╡
│ 1       ┆ Alice     ┆ 13800138000 ┆ 25.0      ┆ 2024年01月15日 ┆ 150.0  │
│ 2       ┆ Bob       ┆ 13912345678 ┆ 29.333333 ┆ 2024年03月20日 ┆ 200.5  │
│ 3       ┆ Charlie   ┆ 15088886666 ┆ 35.0      ┆ 2024年06月01日 ┆ 0.0    │
│ 4       ┆ Diana     ┆ 18600001234 ┆ 29.333333 ┆ 2024年12月25日 ┆ 80.0   │
│ 5       ┆ Eve       ┆ invalid     ┆ 28.0      ┆ 2024年08月10日 ┆ 0.0    │
└─────────┴───────────┴─────────────┴───────────┴────────────────┴────────┘

=== 清洗统计 ===
原始行数：6
清洗后行数：5
删除重复行数：1

🎉 恭喜！ 你已经完成了一个完整的数据清洗 Pipeline！从去重、字符串清洗、空值填充到日期解析，一气呵成。这就是 Polars 表达式系统的魅力——代码即文档，链式即逻辑。

七、课后作业 📝

作业题目：通用清洗 Pipeline 函数

请编写一个通用的数据清洗函数 clean_dataframe，接受一个 DataFrame 和清洗配置，返回清洗后的 DataFrame。

要求：

1. 支持配置哪些列需要去除空白
2. 支持配置哪些列需要填充空值及填充策略
3. 支持配置日期列的解析格式
4. 支持去重配置

参考框架：

use polars::prelude::*;
use std::collections::HashMap;

/// 清洗配置结构体
struct CleanConfig {
    /// 需要去除空白的字符串列
    strip_columns: Vec<String>,
    /// 需要填充空值的列及其填充值（列名 -> 填充值表达式）
    fill_null_columns: HashMap<String, String>,
    /// 需要解析为日期的列及其格式（列名 -> 日期格式）
    date_columns: HashMap<String, String>,
    /// 去重依据的列
    dedup_columns: Option<Vec<String>>,
}

/// 通用数据清洗函数
///
/// 根据配置对 DataFrame 进行清洗操作
fn clean_dataframe(
    df: DataFrame,
    config: &CleanConfig,
) -> Result<DataFrame, Box<dyn std::error::Error>> {
    let mut lazy_df = df.lazy();

    // TODO: 1. 根据 strip_columns 去除字符串列两端空白
    // 提示：遍历 config.strip_columns，对每列应用 .str().strip_chars(None)

    // TODO: 2. 根据 fill_null_columns 填充空值
    // 提示：遍历 config.fill_null_columns，对每列应用 .fill_null()

    // TODO: 3. 根据 date_columns 解析日期
    // 提示：遍历 config.date_columns，对每列应用 .str().to_date()

    // TODO: 4. 根据 dedup_columns 去重
    // 提示：使用 .unique() 方法

    lazy_df.collect().map_err(Into::into)
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // 创建测试数据并使用你的清洗函数
    let df = df![
        "name"  => &["  Alice  ", "  Bob  ", "  Charlie  "],
        "age"   => &[Some(25), None, Some(35)],
        "date"  => &["2024-01-15", "2024-03-20", "2024-06-01"],
    ]?;

    let config = CleanConfig {
        strip_columns: vec!["name".to_string()],
        fill_null_columns: HashMap::from([
            ("age".to_string(), "0".to_string()),
        ]),
        date_columns: HashMap::from([
            ("date".to_string(), "%Y-%m-%d".to_string()),
        ]),
        dedup_columns: None,
    };

    let cleaned = clean_dataframe(df, &config)?;
    println!("{}", cleaned);
    Ok(())
}

加分项：

• 支持前向/后向填充策略
• 支持列重命名配置
• 添加清洗日志输出

💡 提示：你可以使用 col() 宏动态构建表达式，配合 with_column 逐步应用清洗规则。Polars 的 Lazy API 天然适合这种动态 Pipeline 的构建方式！

总结与下节预告 📌

今天我们学习了 Polars 中数据清洗的五大核心技能：

技能	核心方法	应用场景
类型转换	.cast(), Schema Override	读取数据后的类型修正
空值处理	fill_null, drop_nulls, is_null	缺失数据补全与过滤
字符串操作	.str.contains(), .str.replace(), .str.split()	文本清洗与格式化
日期时间	.dt.year(), .str.to_date(), strftime()	时间解析与计算
去重排序	unique(), sort(), rename()	数据整理

这些工具覆盖了日常数据清洗 90% 以上的需求。记住 Polars 的核心理念：用表达式描述"做什么"，让引擎决定"怎么做"。

下一课预告：第 6 课我们将学习 分组聚合（GroupBy）——这是数据分析中最强大的操作之一。从简单的 group_by().agg() 到复杂的窗口函数，带你玩转数据聚合！📊

有任何问题欢迎在评论区留言，我们下期见！👋

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2026-04-11，如有侵权请联系 cloudcommunity@tencent.com 删除

配置

本文分享自 Rust火箭工坊微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！