我试图用四列的dataframe对凌乱的数据进行分类:
具有正确的公司类型。
"company_name“类”搜索“"company_type”约翰景观美化草坪
我希望我的最终结果是这样的:
"company_name“类”搜索“"company_type”约翰园林绿化草坪绿化兄弟草坪清洁清洁绿化漆漆清洁
我使用Chris在这里创建的函数:https://r-dir.com/blog/2015/01/quickly-categorize-messy-data.html
这是密码
df$company_type <- NA
categorizeDF <- function(df, searchColName, searchList, catList, newColName="Category") {
catDF <- data.frame(matrix(ncol=ncol(df), nrow=0))
colnames(catDF) <- paste0(names(df))
df$sequence <- seq(nrow(df))
for (i in seq_along(searchList)) {
rownames(df) <- NULL
index <- grep(searchList[i], df[,which(colnames(df) == searchColName)], ignore.case=TRUE)
tempDF <- df[index,]
tempDF$newCol <- catList[i]
catDF <- rbind(catDF, tempDF)
df <- df[-index,]
}
if (nrow(df) > 0) {
df$newCol <- "OTHER"
catDF <- rbind(catDF, df)
}
catDF <- catDF[order(catDF$sequence),]
catDF$sequence <- NULL
rownames(catDF) <- NULL
catDF$newCol <- as.factor(catDF$newCol)
colnames(catDF)[which(colnames(catDF) == "newCol")] <- newColName
catDF
}
sorted <- categorizeDF(df, "company_name", "search", "categories", "company_type")
但是,我得到了一个错误(带跟踪返回):
Error in `$<-.data.frame`(`*tmp*`, "newCol", value = "categories") :
replacement has 1 row, data has 0
4.
stop(sprintf(ngettext(N, "replacement has %d row, data has %d",
"replacement has %d rows, data has %d"), N, nrows), domain = NA)
3.
`$<-.data.frame`(`*tmp*`, "newCol", value = "categories")
2.
`$<-`(`*tmp*`, "newCol", value = "categories")
1.
categorizeDF(df, "company_name", "search", "categories", "company_type")
任何帮助都将不胜感激。
发布于 2022-07-28 22:01:46
这是由搜索字符串引起的,而不是在任何混乱的数据列中。
更新并发挥了作用:
categorizeDF <- function(df, searchColName, searchList, catList, newColName="Category") {
catDF <- data.frame(matrix(ncol=ncol(df), nrow=0))
colnames(catDF) <- paste0(names(df))
df$sequence <- seq(nrow(df))
for (i in seq_along(searchList)) {
rownames(df) <- NULL
index <- grep(searchList[i], df[,which(colnames(df) == searchColName)], ignore.case=TRUE)
if (identical(index,integer(0))){
next
}
tempDF <- df[index,]
tempDF$newCol <- catList[i]
catDF <- rbind(catDF, tempDF)
df <- df[-index,]
}
if (nrow(df) > 0) {
df$newCol <- "OTHER"
catDF <- rbind(catDF, df)
}
catDF <- catDF[order(catDF$sequence),]
catDF$sequence <- NULL
rownames(catDF) <- NULL
catDF$newCol <- as.factor(catDF$newCol)
colnames(catDF)[which(colnames(catDF) == "newCol")] <- newColName
catDF
}
https://stackoverflow.com/questions/73157588
复制相似问题