我有一个30 GB的制表符分隔的文本文件,它有超过1亿行,当我想用\copy命令将这个文本文件导入到PostgreSQL表格时,有些行会出错。如何在导入postgresql时忽略这些行并记录被忽略的行?
我通过SSH连接到我的机器,所以我不能使用pgadmin!
在导入之前编辑文本文件非常困难,因为许多不同的行都有不同的问题。如果有一种方法可以在导入之前逐个检查行,然后对单个行运行\copy命令,这将是很有帮助的。
下面是生成表的代码:
CREATE TABLE Papers(
Paper_ID CHARACTER(8) PRIMARY KEY,
Original_paper_title TEXT,
Normalized_paper_title TEXT,
Paper_publish_year INTEGER,
Paper_publish_date DATE,
Paper_Document_Object_Identifier TEXT,
Original_venue_name TEXT,
Normalized_venue_name TEXT,
Journal_ID_mapped_to_venue_name CHARACTER(8),
Conference_ID_mapped_to_venue_name CHARACTER(8),
Paper_rank BIGINT,
FOREIGN KEY(Journal_ID_mapped_to_venue_name) REFERENCES Journals(Journal_ID),
FOREIGN KEY(Conference_ID_mapped_to_venue_name) REFERENCES Conferences(Conference_ID));发布于 2016-12-29 02:22:56
不要直接加载到目标表中,而是加载到单列的临时表中。
create table Papers_stg (rec text);一旦加载了所有数据,就可以使用SQL对数据进行验证。
查找字段数错误的记录:
select rec
from Papers_stg
where cardinality(string_to_array(rec,' ')) <> 11创建包含所有文本字段的表
create table Papers_fields_text
as
select fields[1] as Paper_ID
,fields[2] as Original_paper_title
,fields[3] as Normalized_paper_title
,fields[4] as Paper_publish_year
,fields[5] as Paper_publish_date
,fields[6] as Paper_Document_Object_Identifier
,fields[7] as Original_venue_name
,fields[8] as Normalized_venue_name
,fields[9] as Journal_ID_mapped_to_venue_name
,fields[10] as Conference_ID_mapped_to_venue_name
,fields[11] as Paper_rank
from (select string_to_array(rec,' ') as fields
from Papers_stg
) t
where cardinality(fields) = 11对于字段转换检查,您可能希望使用所述的概念here
https://stackoverflow.com/questions/41332493
复制相似问题