文章/答案/技术大牛

发布

社区首页 >问答首页 >Postgresql + psycopg:使用POSTGRESQL函数调用批量插入大数据

问Postgresql + psycopg:使用POSTGRESQL函数调用批量插入大数据
EN

Stack Overflow用户

提问于 2022-10-17 10:20:06

回答 2查看 140关注 0票数 0

我正在处理大量非常简单的数据(点云)。我想使用Python将这些数据插入Postgresql数据库中的一个简单表中。

我需要执行的insert语句的示例如下：

INSERT INTO points_postgis (id_scan, scandist, pt) VALUES (1, 32.656, **ST_MakePoint**(1.1, 2.2, 3.3));

注意到在INSERT语句.中对Postgresql函数ST_MakePoint的调用

我必须调用数十亿次(是的，数十亿次)，因此显然我必须以更优化的方式将数据插入Postgresql中。如本文所述，有许多策略可以大容量地插入数据(插入、复制等)。https://hakibenita.com/fast-load-data-python-postgresql

但是，在需要调用服务器端的函数时，没有示例说明如何执行这些插入。我的问题是:当我需要在Postgresql数据库的服务器端调用函数时，如何才能批量插入数据？

任何帮助都是非常感谢的！谢谢!

请注意，使用CSV没有多大意义，因为我的数据是巨大的。或者，我已经尝试用简单的列填充ST_MakePoint函数的3个输入的临时表，然后，在所有数据都进入这个临时函数之后，调用一个INSERT/SELECT。问题是，这需要大量的时间，我需要的磁盘空间的数量是荒谬的。

postgresql

psycopg2

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-10-17 14:44:40

最重要的是，为了在合理的时间内做到这一点，并且以最小的努力，是将这个任务分解成组件部分，这样您就可以单独地利用不同的Postgres特性。

首先，您需要首先创建表，减去几何转换。例如：

create table temp_table (
    id_scan bigint, 
    scandist numeric, 
    pt_1 numeric, 
    pt_2 numeric, 
    pt_3 numeric
);

由于我们没有添加任何索引和约束，这很可能是将“原始”数据输入RDBMS的最快方法。

最好的方法是使用COPY方法，您可以直接从Postgres (如果有足够的访问权限)使用它，也可以通过专家通过Python接口使用。

下面是实现这一目标的示例代码：

iconn_string = "host={0} user={1} dbname={2} password={3} sslmode={4}".format(target_host, target_usr, target_db, target_pw, "require")
iconn = psycopg2.connect(iconn_string)
import_cursor = iconn.cursor()
csv_filename = '/path/to/my_file.csv'
copy_sql = "COPY temp_table (id_scan, scandist, pt_1, pt_2, pt_3) FROM STDIN WITH CSV HEADER DELIMITER ',' QUOTE '\"' ESCAPE '\\' NULL AS 'null'"
with open(csv_filename, mode='r', encoding='utf-8', errors='ignore') as csv_file:
    import_cursor.copy_expert(copy_sql, csv_file)
iconn.commit()

下一步将是从现有的原始数据中高效地创建所需的表。然后，您将能够使用单个SQL语句创建实际目标表，并让RDBMS发挥其神奇的作用。

一旦数据在RDBMS中，就有必要对其进行少量优化，并在适用的情况下添加一两个索引(主索引或唯一索引最好是加快转换速度)。

这将取决于您的数据/用例，但类似的内容应该会有所帮助：

alter table temp_table add primary key (id_scan); --if unique
-- or
create index idx_temp_table_1 on temp_table(id_scan); --if not unique

若要将数据从raw移动到目标表：

with temp_t as (
    select id_scan, scandist, ST_MakePoint(pt_1, pt_2, pt_3) as pt from temp_table
)
INSERT INTO points_postgis (id_scan, scandist, pt)
SELECT temp_t.id_scan, temp_t.scandist, temp_t.pt
FROM temp_t;

这将在一次中从上一个表中选择所有数据并对其进行转换。

您可以使用的第二个选项类似。您可以直接将所有原始数据加载到points_postgis，同时将其分隔为3列。然后使用alter table points_postgis add column pt geometry;并进行更新，删除临时列：update points_postgis set pt = ST_MakePoint(pt_1, pt_2, pt_3); & alter table points_postgis drop column pt_1, drop column pt_2, drop column pt_3;

主要的优点是，最有表现力的选择不是专注于最终的表状态，而是将其分解为容易实现的块。Postgres将很容易地处理数十亿行的导入，以及随后的转换。

票数 0

Stack Overflow用户

发布于 2022-10-21 22:01:29

使用一个生成UPC的函数的一些简单示例--带有检查数字的条形码：

使用批处理。execute_batch有page_size参数，允许您使用多行语句对插入进行批处理。默认情况下，这是在100设置的，它一次将插入100行。您可以通过增加这一点来减少到服务器的往返次数。
只使用execute并从另一个表中选择数据。

import psycopg2
from psycopg2.extras import execute_batch

con = psycopg2.connect(dbname='test', host='localhost', user='postgres', 
port=5432)
cur = con.cursor()

cur.execute('create table import_test(id integer, suffix_val varchar, upca_val 
varchar)')
con.commit()

# Input data as a list of tuples. Means some data is duplicated.
input_list = [(1, '12345', '12345'), (2, '45278', '45278'), (3, '61289', 
'61289')]
execute_batch(cur, 'insert into import_test values(%s, %s, 
upc_check_digit(%s))', input_list)
con.commit()

select * from import_test ;
 id | suffix_val |   upca_val   
----+------------+--------------
  1 | 12345      | 744835123458
  2 | 45278      | 744835452787
  3 | 61289      | 744835612891

# Input data as list of dicts and using named parameters to avoid duplicating data.
input_list_dict = [{'id': 50, 'suffix_val': '12345'}, {'id': 51, 'suffix_val': '45278'}, {'id': 52, 'suffix_val': '61289'}]  


execute_batch(cur, 'insert into import_test values(%(id)s, 
%(suffix_val)s, upc_check_digit(%(suffix_val)s))', input_list_dict)
con.commit()

select * from import_test ;
 id | suffix_val |   upca_val   
----+------------+--------------
  1 | 12345      | 744835123458
  2 | 45278      | 744835452787
  3 | 61289      | 744835612891
 50 | 12345      | 744835123458
 51 | 45278      | 744835452787
 52 | 61289      | 744835612891



# Create a table with values to be used for inserting into final table
cur.execute('create table input_vals (id integer, suffix_val varchar)')
con.commit()

execute_batch(cur, 'insert into input_vals values(%s, %s)', [(100, '76234'), 
(101, '92348'), (102, '16235')])
con.commit()
cur.execute('insert into import_test select id, suffix_val, 
upc_check_digit(suffix_val) from input_vals')
con.commit()

 select * from import_test ;
  id   | suffix_val |   upca_val   
-------+------------+--------------
     1 | 12345      | 744835123458
     2 | 45278      | 744835452787
     3 | 61289      | 744835612891
 12345 | 12345      | 744835123458
 45278 | 45278      | 744835452787
 61289 | 61289      | 744835612891
   100 | 76234      | 744835762343
   101 | 92348      | 744835923485
   102 | 16235      | 744835162358

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/74095783

复制

相似问题

问Postgresql + psycopg:使用POSTGRESQL函数调用批量插入大数据
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Postgresql + psycopg:使用POSTGRESQL函数调用批量插入大数据EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Postgresql + psycopg:使用POSTGRESQL函数调用批量插入大数据
EN