这是线程Extract rows with duplicate values in two or more fields but different values in another field的后续,目标是用DuckDB对大于RAM的数据执行相同的操作。上面提到的线程中的dupKeyEx()
函数代码只适用于适合于RAM的数据。
下面是这条线索的要点:
library(data.table)
customers <- structure(list(
NAME = c("GEETA SHYAM RAO", "B V RAMANA", "GONTU VENKATARAMANAIAH",
"DAMAT RAMAKRISHNA", "MARIAM SUDHAKAR", "VELPURI LAKSHMI SUJATHA",
"MOHAMMED LIYAKHAT ALI", "VENKATESHWARAN PONNAMBALAM",
"DEVARAKONDA SATISH BABU", "GEEDI RAMULU", "KANDU OBULESU",
"B V RAMANA", "DOKKA RAJESH", "G TULASIRAM REDDY",
"MALLELA CHIRANJEEVI", "MANEPALLI VENKATA RAVAMMA",
"DOKKA JAGADEESHWAR", "K KRISHNA", "B SUDARSHAN", "B ANNAPURNA",
"CHAVVA SHIVA RAMULU", "BIKASH BAHADUR CHITRE", "DARBAR ASHOK",
"VEMULAPALLY SANGAMESHWAR RAO", "MOHAMMED ABDUL HAKEEM ANWAR",
"MANEPALLI SHIV SHANKAR RAO", "MOHD MISKEEN MOHIUDDIN",
"KOTLA CHENNAMMA", "NAYAK SURYAKANTH", "GOPIREDDY INDIRA",
"MEKALA SREEDEVI", "K KRISHNA", "B V RAMANA",
"KUMMARI VENKATESHAM", "BHAVANI CONSRUCTIONS",
"UPPUTHOLLA KOTAIAH", "YEDIDHA NIRMALA DEVI", "MARIAM SUDHAKAR",
"B ANNAPURNA", "VELPURI LAKSHMI SUJATHA", "DARBAR ASHOK",
"AMMANA VISHNU VARDHAN REDDY", "ZAITOON BEE", "MOHD CHAND PASHA",
"PALERELLA RAMESH", "GEEDI SRINIVAS", "RAMAIAH SADU",
"BIMAN BALAIAH", "KOTLA CHENNAMMA", "VENKATESHWARAN PONNAMBALAM"),
DOB = c("13-02-1971", "15-01-1960", "01-07-1970", "10-03-1977",
"24-01-1954", "28-06-1971", "26-01-1980", "14-04-1969",
"23-09-1978", "15-08-1954", "09-10-1984", "15-01-1960",
"29-09-1984", "03-03-1975", "26-01-1979", "01-01-1964",
"21-01-1954", "01-05-1964", "12-03-1975", "12-12-1962",
"10-03-1982", "14-05-1983", "03-01-1950", "04-03-1962",
"12-05-1966", "01-06-1960", "10-03-1964", "15-07-1958",
"26-06-1979", "02-04-1974", "10-01-1975", "01-05-1964",
"15-01-1960", "08-08-1977", NA, "05-04-1981", "29-08-1971",
"24-01-1954", "12-12-1962", "28-06-1971", "03-01-1950",
"23-06-1970", "20-02-1960", "05-07-1975", "10-01-1979",
"31-08-1982", "10-08-1983", "10-03-1964", "15-07-1958",
"14-04-1969"),
ID = c(502969, 502902, 502985, 502981, 502475, 502267, 502976,
502272, 502977, 502973, 502986, 502910, 502989, 502998, 502967,
502971, 502988, 502737, 502995, 502878, 502972, 502984, 502639,
502968, 502975, 502970, 502997, 502466, 502991, 502982, 502980,
502737, 502902, 502999, 502994, 502987, 502990, 502047, 502877,
502251, 502548, 502992, 503000, 502993, 502983, 502974, 502996,
502979, 502467, 502290),
PIN = c(500082, 500032, 500032, 500032, 500032, 500084, 500032, 500032,
500032, 500032, 500032, 500033, 500032, 500084, 500084, 500032,
5e+05, 500050, 500032, 500084, 500032, 500032, 500032, 500050,
500032, 500032, 500045, 500032, 500084, 500032, 500032, 500084,
500035, 500084, 500032, 500032, 500032, 500032, 500084, 500032,
500084, 500033, 500084, 500032, 500032, 500032, 500084, 500032,
500032, 500032)),
.Names = c("NAME", "DOB", "ID", "PIN"),
row.names = c(NA, -50L), class = c("data.table", "data.frame"))
目标是从上表中提取具有以下内容的行:
。
工作职能是:
dupKeyEx <- function(DT, dup_cols, unique_cols) {
cols <- c(dup_cols, unique_cols)
chr_cols <- cols[sapply(DT[, ..cols], is.character)]
DT[, (chr_cols) := lapply(.SD, stringr::str_trim), .SDcols=chr_cols]
mDT <- DT[!duplicated(DT, by=cols), .N, by=dup_cols][N > 1L]
ans <- unique(DT[mDT[, !"N"], on=dup_cols], by=cols)
setorderv(ans, c(dup_cols, unique_cols))
return(ans)
}
它提供了所需的结果
dupKeyEx(customers, c("NAME", "DOB"), "ID")
NAME DOB ID PIN
1: B ANNAPURNA 12-12-1962 502877 500084
2: B ANNAPURNA 12-12-1962 502878 500084
3: B V RAMANA 15-01-1960 502902 500032
4: B V RAMANA 15-01-1960 502910 500033
5: DARBAR ASHOK 03-01-1950 502548 500084
6: DARBAR ASHOK 03-01-1950 502639 500032
7: KOTLA CHENNAMMA 15-07-1958 502466 500032
8: KOTLA CHENNAMMA 15-07-1958 502467 500032
9: MARIAM SUDHAKAR 24-01-1954 502047 500032
10: MARIAM SUDHAKAR 24-01-1954 502475 500032
11: VELPURI LAKSHMI SUJATHA 28-06-1971 502251 500032
12: VELPURI LAKSHMI SUJATHA 28-06-1971 502267 500084
13: VENKATESHWARAN PONNAMBALAM 14-04-1969 502272 500032
14: VENKATESHWARAN PONNAMBALAM 14-04-1969 502290 500032
但是,此功能不适用于不太适合RAM的非常大的表。因此,我想知道是否可以设计一些SQL查询来对非常大的表执行相同的工作。来自DuckDB的https://duckdb.org/是一个非常高效的开放源码分析数据库,并且与R.很好地合作,所以我们尝试将上面的表导入DuckDB数据库,如下所示:
library(DBI)
library(duckdb)
dbFile <- "...path..to..database.../experimental_database.db"
con <- dbConnect(duckdb::duckdb(), dbFile)
# import the customers table into the database (in actual scenario, the very large bigger-than-RAM table will be imported into the database from a large csv or we may run SQL query directly on chunked parquet files, as mentioned at https://duckdb.org/docs/data/parquet) - for simplicity, here we just using the customers table to try to compose functional SQL query
dbWriteTable(con, "customers", customers)
# check if table is successfully imported
dbListTables(con)
# try reading the table
dbReadTable(con, "customers")
现在,需要组合一个functional,它将提供与上面提到的dupKeEx()
函数相同的结果。如果该SQL查询运行良好,并使用大于-RAM的表提供有效的结果,则该线程的目标将被认为是完成的。
任何帮助都将不胜感激。
发布于 2021-11-05 10:34:18
下面是如何在某些列中获得具有重复值的表行的方法:https://chartio.com/learn/databases/how-to-find-duplicate-values-in-a-sql-table/
如果您有一个较大的CSV,并且您已经知道将用于搜索/分组的列,您可以拆分CSV,对块进行排序,然后合并、排序和保存为块。
https://stackoverflow.com/questions/69566587
复制相似问题