首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >如何将这些文本字段分类为数字?

如何将这些文本字段分类为数字?
EN

Stack Overflow用户
提问于 2022-10-04 18:27:42
回答 2查看 86关注 0票数 1

我的数据有很多文本条目,如下所示: 7-8个工作日,10天,10-12天,2-3周,1年,8-12天,2周,等等。

我想要训练一个模型来将这些文本条目转换为数字。我认为对于NLP模型来说,这是一个相当容易处理的问题,但对于使用哪种模型我并不太自信。

如果指定范围的话,我想取两个数字中较大的一个。例如,2-3周应该变成21周,8-10天应该变成10天。

我想我可以手动编码大约100条记录来训练模型。有人能推荐一个NLP模型使用,甚至一个脚本,我可以编辑吗?如果这不是NLP的一个好用例,也请提供建议。

(我只在R编程语言中进行练习,但如果需要,我可以使用Python进行修改)。

如果其他所有的失败都是两次使用连字符作为分隔符使用strsplit(as.character(df$column), "-")将数据分割成单独的列,然后使用If -语句计算数据:例如:

代码语言:javascript
运行
复制
date_conversion = if_else(is.numeric(columnA) & columnB == "weeks", columnA*7, if_else(is.numeric(columnA) & columnB == "days", columnA, NA)

但理想情况下,我想训练一个模型,因为我有很多数据。

使用dput的一些数据:

代码语言:javascript
运行
复制
text_fields <- c("10 - 14 days", "2-3 Weeks", "10-12 days", "8 days", "8-12 days", 
                 "10 days", "7-10 days", "5 days", "7-10 days", "5-7 days", "10 days", 
                 "7 days", "7 - 10 days", "1 week", "5-7 days", "2 weeks", "2-4 weeks", 
                 "10 days", "7-10 days", "8-10 days", "1 week", "10 days", "8-10 days", 
                 "2 weeks", "10-12 days", "7-10 days", "2-3 weeks", "7-10 days", 
                 "10 days", "2 weeks", "8-12 days", "12 days", "10 Days", "7 Days", 
                 "2 weeks", "5-8 days", "8-12 days", "8-12 days", "10 days", "12-14 days", 
                 "10-12 days", "7 days", "5-7 days", "2 weeks", "2-3 weeks", "5-7 days", 
                 "5-7 business days", "5-7 days", "5-7 business days", "7 days", 
                 "2-3 weeks", "7-10 days", "8-12 Days", "10 days", "10 days", 
                 "10 days", "10 days", "10 days", "14", "2 weeks", "10 business days", 
                 "2-3 weeks", "4 days", "1 month", "7-10 days", "8-12 days", "2-3 weeks", 
                 "3-5 days", "10 days", "3-5 days", "2-3 days", "2-3 days", "3-5", 
                 "5-7 days", "7-10 days", "5-7 days", "8-12 days", "7-10 days", 
                 "7-10 days", "7-10 days", "2.5 weeks", "2 Weeks", "10-12 days", 
                 "10-12 days", "7-10 days", "7-10 days", "7-10 days", "7-10 days", 
                 "7-10 days", "7-10 days", "2 weeks", "1 month", "1 month", "1 week"
)
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-10-04 20:29:03

对于这样的情况,columnA的所有类型的值都是可预测的,如果这就是您所说的NLP的意思,那么您很可能不需要使用任何深入的学习。

因此,根据我的经验,你对if-语句的建议实际上是正确的总体想法。

下面是一个使用tidyverse库和正则表达式的过程:

代码语言:javascript
运行
复制
library(tidyverse)

text_fields <- c("10 - 14 days", "2-3 Weeks", "10-12 days", "8 days", "8-12 days", 
                 "10 days", "7-10 days", "5 days", "7-10 days", "5-7 days", "10 days", 
                 "7 days", "7 - 10 days", "1 week", "5-7 days", "2 weeks", "2-4 weeks", 
                 "10 days", "7-10 days", "8-10 days", "1 week", "10 days", "8-10 days", 
                 "2 weeks", "10-12 days", "7-10 days", "2-3 weeks", "7-10 days", 
                 "10 days", "2 weeks", "8-12 days", "12 days", "10 Days", "7 Days", 
                 "2 weeks", "5-8 days", "8-12 days", "8-12 days", "10 days", "12-14 days", 
                 "10-12 days", "7 days", "5-7 days", "2 weeks", "2-3 weeks", "5-7 days", 
                 "5-7 business days", "5-7 days", "5-7 business days", "7 days", 
                 "2-3 weeks", "7-10 days", "8-12 Days", "10 days", "10 days", 
                 "10 days", "10 days", "10 days", "14", "2 weeks", "10 business days", 
                 "2-3 weeks", "4 days", "1 month", "7-10 days", "8-12 days", "2-3 weeks", 
                 "3-5 days", "10 days", "3-5 days", "2-3 days", "2-3 days", "3-5", 
                 "5-7 days", "7-10 days", "5-7 days", "8-12 days", "7-10 days", 
                 "7-10 days", "7-10 days", "2.5 weeks", "2 Weeks", "10-12 days", 
                 "10-12 days", "7-10 days", "7-10 days", "7-10 days", "7-10 days", 
                 "7-10 days", "7-10 days", "2 weeks", "1 month", "1 month", "1 week"
)

# putting the values in a dataframe. `tibble()` also works
df <- data.frame(text = text_fields)

# The %>% is a special operator that pipes the result into the first argument of the next function. I use it to keep things clean.
df <- df %>% 
  # I capture the last instance of a number in each value
  # then save the captured values to a new column called days
  mutate(days = str_match_all(text, "\\d*\\.?\\d+") %>% 
           # take the last match only
           lapply(tail, 1) %>% 
           # collaspe this list into a simple vector
           unlist() %>% 
           # change text to number
           as.numeric()
         ) %>% 
  # I update the days column according to the type of unit
  mutate(days = case_when(
    # (?i) makes the search case insensitive.
    str_detect(text, "(?i)business") ~ days + 2 * ceiling(days / 7),
    str_detect(text, "(?i)week") ~ days * 7,
    str_detect(text, "(?i)month") ~ ceiling(days * 30.5),
    str_detect(text, "(?i)year") ~ ceiling(days * 365.25),
    TRUE ~ days
  ))

完成此操作后,您可以检查是否有遗漏的边缘情况,并且可以展开函数以支持这些情况。应用关于编写培训示例的建议,您可以创建一堆复杂的案例,并对它们进行代码测试。如果您的代码失败,您可以更新它,以便它能够处理这种情况。

下面是一个用于测试和学习正则表达式的好资源:https://regex101.com/

注意,当您将正则表达式放入R中时,必须转义转义字符。代替\,使用\\

票数 1
EN

Stack Overflow用户

发布于 2022-10-04 20:08:01

如果您有一个简单的列表作为输入,您可以使用简单的逻辑和不使用NLP来确定正确的天数:

代码语言:javascript
运行
复制
input_list = ["10-12 days", "7 days", "5-7 days", "2 weeks", "2-3 weeks", "5-7 days"]

for inp in input_list:
    relevantPart = inp.split("-")[-1]
    relevantPart = relevantPart.strip().replace("  ", " ").split(" ")

    if "day" in relevantPart[-1].lower():
        print(float(relevantPart[0]))
    elif "week" in relevantPart[-1].lower():
        print(float(relevantPart[0]) * 7)
    elif "month" in relevantPart[-1].lower():
        print(float(relevantPart[0]) * 31)
    else:
        print("No valid period of time.")

基本上,对于split("-")[-1],您只需忽略-前面的所有内容。通过添加一些字符串处理、清理和简单的数学,您将得到您想要的东西。

您可能希望添加更多的特定情况(例如,年份),并以不同的方式处理月份(计算28、29、30或31天?)。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/73952134

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档