我的数据有很多文本条目,如下所示: 7-8个工作日,10天,10-12天,2-3周,1年,8-12天,2周,等等。
我想要训练一个模型来将这些文本条目转换为数字。我认为对于NLP模型来说,这是一个相当容易处理的问题,但对于使用哪种模型我并不太自信。
如果指定范围的话,我想取两个数字中较大的一个。例如,2-3周应该变成21周,8-10天应该变成10天。
我想我可以手动编码大约100条记录来训练模型。有人能推荐一个NLP模型使用,甚至一个脚本,我可以编辑吗?如果这不是NLP的一个好用例,也请提供建议。
(我只在R编程语言中进行练习,但如果需要,我可以使用Python进行修改)。
如果其他所有的失败都是两次使用连字符作为分隔符使用strsplit(as.character(df$column), "-")
将数据分割成单独的列,然后使用If -语句计算数据:例如:
date_conversion = if_else(is.numeric(columnA) & columnB == "weeks", columnA*7, if_else(is.numeric(columnA) & columnB == "days", columnA, NA)
但理想情况下,我想训练一个模型,因为我有很多数据。
使用dput的一些数据:
text_fields <- c("10 - 14 days", "2-3 Weeks", "10-12 days", "8 days", "8-12 days",
"10 days", "7-10 days", "5 days", "7-10 days", "5-7 days", "10 days",
"7 days", "7 - 10 days", "1 week", "5-7 days", "2 weeks", "2-4 weeks",
"10 days", "7-10 days", "8-10 days", "1 week", "10 days", "8-10 days",
"2 weeks", "10-12 days", "7-10 days", "2-3 weeks", "7-10 days",
"10 days", "2 weeks", "8-12 days", "12 days", "10 Days", "7 Days",
"2 weeks", "5-8 days", "8-12 days", "8-12 days", "10 days", "12-14 days",
"10-12 days", "7 days", "5-7 days", "2 weeks", "2-3 weeks", "5-7 days",
"5-7 business days", "5-7 days", "5-7 business days", "7 days",
"2-3 weeks", "7-10 days", "8-12 Days", "10 days", "10 days",
"10 days", "10 days", "10 days", "14", "2 weeks", "10 business days",
"2-3 weeks", "4 days", "1 month", "7-10 days", "8-12 days", "2-3 weeks",
"3-5 days", "10 days", "3-5 days", "2-3 days", "2-3 days", "3-5",
"5-7 days", "7-10 days", "5-7 days", "8-12 days", "7-10 days",
"7-10 days", "7-10 days", "2.5 weeks", "2 Weeks", "10-12 days",
"10-12 days", "7-10 days", "7-10 days", "7-10 days", "7-10 days",
"7-10 days", "7-10 days", "2 weeks", "1 month", "1 month", "1 week"
)
发布于 2022-10-04 20:29:03
对于这样的情况,columnA的所有类型的值都是可预测的,如果这就是您所说的NLP的意思,那么您很可能不需要使用任何深入的学习。
因此,根据我的经验,你对if-语句的建议实际上是正确的总体想法。
下面是一个使用tidyverse库和正则表达式的过程:
library(tidyverse)
text_fields <- c("10 - 14 days", "2-3 Weeks", "10-12 days", "8 days", "8-12 days",
"10 days", "7-10 days", "5 days", "7-10 days", "5-7 days", "10 days",
"7 days", "7 - 10 days", "1 week", "5-7 days", "2 weeks", "2-4 weeks",
"10 days", "7-10 days", "8-10 days", "1 week", "10 days", "8-10 days",
"2 weeks", "10-12 days", "7-10 days", "2-3 weeks", "7-10 days",
"10 days", "2 weeks", "8-12 days", "12 days", "10 Days", "7 Days",
"2 weeks", "5-8 days", "8-12 days", "8-12 days", "10 days", "12-14 days",
"10-12 days", "7 days", "5-7 days", "2 weeks", "2-3 weeks", "5-7 days",
"5-7 business days", "5-7 days", "5-7 business days", "7 days",
"2-3 weeks", "7-10 days", "8-12 Days", "10 days", "10 days",
"10 days", "10 days", "10 days", "14", "2 weeks", "10 business days",
"2-3 weeks", "4 days", "1 month", "7-10 days", "8-12 days", "2-3 weeks",
"3-5 days", "10 days", "3-5 days", "2-3 days", "2-3 days", "3-5",
"5-7 days", "7-10 days", "5-7 days", "8-12 days", "7-10 days",
"7-10 days", "7-10 days", "2.5 weeks", "2 Weeks", "10-12 days",
"10-12 days", "7-10 days", "7-10 days", "7-10 days", "7-10 days",
"7-10 days", "7-10 days", "2 weeks", "1 month", "1 month", "1 week"
)
# putting the values in a dataframe. `tibble()` also works
df <- data.frame(text = text_fields)
# The %>% is a special operator that pipes the result into the first argument of the next function. I use it to keep things clean.
df <- df %>%
# I capture the last instance of a number in each value
# then save the captured values to a new column called days
mutate(days = str_match_all(text, "\\d*\\.?\\d+") %>%
# take the last match only
lapply(tail, 1) %>%
# collaspe this list into a simple vector
unlist() %>%
# change text to number
as.numeric()
) %>%
# I update the days column according to the type of unit
mutate(days = case_when(
# (?i) makes the search case insensitive.
str_detect(text, "(?i)business") ~ days + 2 * ceiling(days / 7),
str_detect(text, "(?i)week") ~ days * 7,
str_detect(text, "(?i)month") ~ ceiling(days * 30.5),
str_detect(text, "(?i)year") ~ ceiling(days * 365.25),
TRUE ~ days
))
完成此操作后,您可以检查是否有遗漏的边缘情况,并且可以展开函数以支持这些情况。应用关于编写培训示例的建议,您可以创建一堆复杂的案例,并对它们进行代码测试。如果您的代码失败,您可以更新它,以便它能够处理这种情况。
下面是一个用于测试和学习正则表达式的好资源:https://regex101.com/
注意,当您将正则表达式放入R中时,必须转义转义字符。代替\
,使用\\
。
发布于 2022-10-04 20:08:01
如果您有一个简单的列表作为输入,您可以使用简单的逻辑和不使用NLP来确定正确的天数:
input_list = ["10-12 days", "7 days", "5-7 days", "2 weeks", "2-3 weeks", "5-7 days"]
for inp in input_list:
relevantPart = inp.split("-")[-1]
relevantPart = relevantPart.strip().replace(" ", " ").split(" ")
if "day" in relevantPart[-1].lower():
print(float(relevantPart[0]))
elif "week" in relevantPart[-1].lower():
print(float(relevantPart[0]) * 7)
elif "month" in relevantPart[-1].lower():
print(float(relevantPart[0]) * 31)
else:
print("No valid period of time.")
基本上,对于split("-")[-1]
,您只需忽略-
前面的所有内容。通过添加一些字符串处理、清理和简单的数学,您将得到您想要的东西。
您可能希望添加更多的特定情况(例如,年份),并以不同的方式处理月份(计算28、29、30或31天?)。
https://stackoverflow.com/questions/73952134
复制相似问题