我有5个节点Hortonworks集群(Version2.4.2),我已经在其中安装了HAWQ2.0.0。
这5个节点是:边缘主节点(名称节点)node1(数据Node1) node2(数据Node2) node3(Data Node3)
我按照这个链接安装HDP - http://hdb.docs.pivotal.io/hdb/install/install-ambari.html中的HDP- Hawq。
在这些节点中安装了Hawq组件:
Hawq硕士- node1 Hawq标准硕士- node2
Hawq部门- node1,node2,node3
在安装时,已成功安装了Hawq主机、Hawq标准主机、hawq段,但由Ambari的Hawq安装程序运行的基本Hawq测试失败:
下面是安装程序执行的操作中的
2016-06-30 00:24:22,513 - --- Check state of HAWQ cluster ---
2016-06-30 00:24:22,513 - Executing hawq status check...
2016-06-30 00:24:22,514 - Command executed: su - gpadmin -c "ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null node1.localdomain \"source /usr/local/hawq/greenplum_path.sh && hawq state -d /data/hawq/master \" "
2016-06-30 00:24:23,343 - Output of command:
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:--HAWQ instance status summary
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:------------------------------------------------------
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:-- Master instance = Active
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:-- Master standby = node2.localdomain
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:-- Standby master state = Standby host passive
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:-- Total segment instance count from config file = 3
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:------------------------------------------------------
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:-- Segment Status
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:------------------------------------------------------
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:-- Total segments count from catalog = 1
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:-- Total segment valid (at master) = 0
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:-- Total segment failures (at master) = 3
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:-- Total number of postmaster.pid files missing = 0
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:-- Total number of postmaster.pid files found = 3
2016-06-30 00:24:23,344 - --- Check if HAWQ can write and query from a table ---
2016-06-30 00:24:23,344 - Dropping ambari_hawq_test table if exists
2016-06-30 00:24:23,344 - Command executed: su - gpadmin -c "ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null node1.localdomain \"export PGPORT=5432 && source /usr/local/hawq/greenplum_path.sh && psql -d template1 -c \\\"DROP TABLE IF EXISTS ambari_hawq_test;\\\" \" "
2016-06-30 00:24:23,436 - Output:
DROP TABLE
2016-06-30 00:24:23,436 - Creating table ambari_hawq_test
2016-06-30 00:24:23,436 - Command executed: su - gpadmin -c "ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null node1.localdomain \"export PGPORT=5432 && source /usr/local/hawq/greenplum_path.sh && psql -d template1 -c \\\"CREATE TABLE ambari_hawq_test (col1 int) DISTRIBUTED RANDOMLY;\\\" \" "
2016-06-30 00:24:23,693 - Output:
CREATE TABLE
2016-06-30 00:24:23,693 - Inserting data to table ambari_hawq_test
2016-06-30 00:24:23,693 - Command executed: su - gpadmin -c "ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null node1.localdomain \"export PGPORT=5432 && source /usr/local/hawq/greenplum_path.sh && psql -d template1 -c \\\"INSERT INTO ambari_hawq_test SELECT * FROM generate_series(1,10);\\\" \" “
--上面我们可以看到,drop和Create表已经执行,但是insert操作没有成功。
因此,我在Hawq主节点即node1上手动执行insert命令。
以下是手动执行的步骤:
[root@node1 ~]# su - gpadmin
[gpadmin@node1 ~]$ psql
psql (8.4.20, server 8.2.15)
WARNING: psql version 8.4, server version 8.2.
Some psql features might not work.
Type "help" for help.
gpadmin=#
gpadmin=# \c gpadmin
psql (8.4.20, server 8.2.15)
WARNING: psql version 8.4, server version 8.2.
Some psql features might not work.
You are now connected to database "gpadmin".
gpadmin=# create table test(name varchar);
gpadmin=# insert into test values('vikash');--上面的插入操作在很长一段时间后引发了一个错误
错误:无法从资源管理器获取资源,由于没有可用群集(pquery.c:804),资源请求被超时。
此外,node1中的hawq段日志将作为
[root@node1 ambari-agent]# tail -f /data/hawq/segment/pg_log/hawq-2016-06-30_045853.csv
2016-06-30 05:10:24.522688 EDT,,,p248618,th-1357371264,,,,0,,,seg-10000,,,,,"LOG","00000","Resource manager discovered local host IPv4 address 192.168.122.1"
,,,,,,,0,,"network_utils.c",210,
2016-06-30 05:10:54.603726 EDT,,,p248618,th-1357371264,,,,0,,,seg-10000,,,,,"LOG","00000","Resource manager discovered local host IPv4 address 127.0.0.1",,,,
,,,0,,"network_utils.c",210,
2016-06-30 05:10:54.603769 EDT,,,p248618,th-1357371264,,,,0,,,seg-10000,,,,,"LOG","00000","Resource manager discovered local host IPv4 address 2.10.1.71",,,,
,,,0,,"network_utils.c",210,
2016-06-30 05:10:54.603778 EDT,,,p248618,th-1357371264,,,,0,,,seg-10000,,,,,"LOG","00000","Resource manager discovered local host IPv4 address 192.168.122.1"
,,,,,,,0,,"network_utils.c",210,
2016-06-30 05:11:24.625919 EDT,,,p248618,th-1357371264,,,,0,,,seg-10000,,,,,"LOG","00000","Resource manager discovered local host IPv4 address 127.0.0.1",,,,
,,,0,,"network_utils.c",210,
2016-06-30 05:11:24.626088 EDT,,,p248618,th-1357371264,,,,0,,,seg-10000,,,,,"LOG","00000","Resource manager discovered local host IPv4 address 2.10.1.71",,,,
,,,0,,"network_utils.c",210,
2016-06-30 05:11:24.626129 EDT,,,p248618,th-1357371264,,,,0,,,seg-10000,,,,,"LOG","00000","Resource manager discovered local host IPv4 address 192.168.122.1"
,,,,,,,0,,"network_utils.c",210,我也试过检查"gp_segment_configuration“
gpadmin=# select * from gp_segment_configuration
gpadmin-# ;
registration_order | role | status | port | hostname | address | description
--------------------+------+--------+-------+-------------------+-----------+------------------------------------
-1 | s | u | 5432 | node2.localdomain | 2.10.1.72 |
0 | m | u | 5432 | node1 | node1 |
1 | p | d | 40000 | node1.localdomain | 2.10.1.71 | resource manager process was reset
(3 rows)注意:在hawq-site.xml中,资源管理类型被选择为“独立”,而不是下拉列表中的“纱线”。
任何人都有线索,这里有什么问题?提前谢谢!
发布于 2016-07-01 03:02:05
我以前遇到过这样的问题。在这种环境下,每个段都有一个共同的IP地址。因此,请检查段节点是否具有相同的IP地址。对于hawq2.0.0,它将考虑与一个节点具有相同IP地址的段,这就是为什么您有3个段节点,但在gp_segment_configuration中,只有一个段节点注册。可以删除重复的IP地址,然后再试一次。
这个问题已经用最新的hawq代码解决了。
发布于 2016-07-01 07:27:38
感谢你们的答复。
centOS中的底层操作系统及其在vCloud上的应用。正如建议的那样,我已经完成了包含3个段的所有3个数据节点的IP配置。这些节点没有使用相同的nics(IP)。但经过进一步研究,我通过ifconfig发现,与"eth1“和"lo”一起,还有一组配置在"vibr0“下。
在所有段节点中,"vibr0“都是相同的,这导致了问题的发生。我从所有节点中删除了它,然后插入查询工作了。
下面是ifconfig的结果,并解决从所有段节点中删除的"vibr0“问题。
eth1链接附件:以太网HWaddr 00:50:56:01:31:26错误:0删除:0溢出:0载波:0碰撞:0 txqueuelen:1 000 RX字节:361465764 (344.7 MiB) TX字节:216951933 (206.9 MiB)
lo Link :本地回环inet :127.0.0.1掩码:255.0.0.0 inet6 addr::1/128作用域:运行MTU:65536度量:1 RX数据包:6错误:0丢弃:0帧:0 TX数据包:6错误:0丢弃:0丢弃:0承运人:0冲突:0 txqueuelen:0RX字节:416 (416.0 b) TX字节:416 (416.0 b)
virbr0链路外壳:以太网HWaddr 52:54:00:DC:EE:00 inet :192.168.122.1 Bcast:192.168.122.255掩码:255.255.255.0启动广播运行组播MTU:1500米:1个RX数据包:0丢弃:0溢出:0帧:0丢弃:0丢弃:0载波:0碰撞:0 txqueuelen:0 RX字节:0 (0.0 b) TX字节:0 (0.0 b)
https://stackoverflow.com/questions/38120486
复制相似问题