我正在经历一个在这里和这里之前已经提出的问题,但是在不同的背景下。
假设我有一个名为R
的psock.R
脚本文件,它包含以下代码:
cat("Setup cluster...\n")
cluster <- parallel::makePSOCKcluster(
rep("localhost", 2),
master = "localhost",
port = 11234,
manual = FALSE,
outfile = ""
)
cat("Sleep...\n")
Sys.sleep(1)
cat("Teardown cluster...\n")
parallel::stopCluster(cluster)
当我通过Rscript --vanilla psock.R
运行脚本时,一切都如愿以偿,我看到:
Setup cluster...
starting worker pid=11557 on localhost:11234 at 13:55:52.818
starting worker pid=11556 on localhost:11234 at 13:55:52.818
Sleep...
Teardown cluster...
但是,当我试图在外部R
进程中执行同样的操作时,parallel::makePSOCKcluster
会挂起。例如,假设psock.R
现在包含以下代码:
# Create a session.
session <- callr::r_session$new()
cat("Setup cluster...\n")
session$run(function() {
# Create a cluster in the `.GlobalEnv`.
cluster <<- parallel::makePSOCKcluster(
rep("localhost", 2),
master = "localhost",
port = 11234,
manual = FALSE,
outfile = ""
)
})
cat("Sleep...\n")
Sys.sleep(5)
cat("Teardown cluster...\n")
session$run(function() {
# Stop it.
parallel::stopCluster(cluster)
})
# Close session.
session$close()
运行psock.R
现在在Setup cluster...
挂起几分钟,然后退出时出现一个错误,即:
Cluster setup failed. 1 worker of 2 failed to connect.
此外,运行htop --filter /exec/R
可以确认第一个工作人员设法连接,但第二个工作人员没有连接。更具体地说,我看到以下进程(即由我添加的数字):
(1) └─ /Library/Frameworks/R.framework/Resources/bin/exec/R --no-echo --no-restore --vanilla --file=psock.R
(2) └─ /Library/Frameworks/R.framework/Resources/bin/exec/R --no-readline --slave --no-save --no-restore
(3) ├─ /Library/Frameworks/R.framework/Resources/bin/exec/R --no-echo --no-restore -e tryCatch(parallel:::.workRSOCK,error=function(e)parallel:::.slaveRSOCK)() --args MASTER=localhost PORT=11234 OUT= SETUPTIMEOUT=120 TIMEOUT=2592000 XDR=TRUE SETUPSTRATEGY=parallel
其中(1)是psock.R
脚本调用,(2)是callr::r_session$new()
创建的外部进程,(3)是parallel::makePSOCKcluster
生成的第一个成功连接的工作程序。
我首先调整psock.R
,将外部进程上的输出写入文件,并通过设置manual = TRUE
启用手动模式,即:
session$run(function() {
# Connection.
connection <- file("/some/path/log.txt", open = "wt")
# Write anything to a log file.
sink(connection, append = TRUE)
sink(connection, append = TRUE, type = "message")
# Create a cluster in the `.GlobalEnv`.
cluster <<- parallel::makePSOCKcluster(
rep("localhost", 2),
master = "localhost",
port = 11234,
manual = TRUE,
outfile = ""
)
})
在上面的代码更改之后运行Rscript --vanilla psock.R
,将以下命令记录到log.txt
Manually start worker on localhost with
'/Library/Frameworks/R.framework/Resources/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'tryCatch(parallel:::.workRSOCK,error=function(e)parallel:::.slaveRSOCK)()' MASTER=localhost PORT=11234 OUT='' SETUPTIMEOUT=120 TIMEOUT=2592000 XDR=TRUE SETUPSTRATEGY=sequential
到目前为止,还没有创建任何工作人员,从htop --filter /exec/R
的输出可以看出
└─ /Library/Frameworks/R.framework/Resources/bin/exec/R --no-echo --no-restore --vanilla --file=psock.R
└─ /Library/Frameworks/R.framework/Resources/bin/exec/R --no-readline --slave --no-save --no-restore
现在,我可以手动运行上面的命令,两次,创建工作人员并连接到主进程(即外部进程)。这将在终端中产生以下输出,确认已创建并连接了工作人员:
starting worker pid=13065 on localhost:11234 at 14:59:38.185
starting worker pid=13201 on localhost:11234 at 14:59:38.185
我还可以通过htop --filter /exec/R
验证这一点,它现在显示了以下内容:
└─ /Library/Frameworks/R.framework/Resources/bin/exec/R --no-echo --no-restore --vanilla --file=psock.R
└─ /Library/Frameworks/R.framework/Resources/bin/exec/R --no-readline --slave --no-save --no-restore
└─ /Library/Frameworks/R.framework/Resources/bin/exec/R --no-echo --no-restore -e tryCatch(parallel:::.workRSOCK,error=function(e)parallel:::.slaveRSOCK)() --args MASTER=localhost PORT=11234 OUT= SETUPTIMEOUT=120 TIMEOUT=2592000 XDR=TRUE SETUPSTRATEGY=sequential
└─ /Library/Frameworks/R.framework/Resources/bin/exec/R --no-echo --no-restore -e tryCatch(parallel:::.workRSOCK,error=function(e)parallel:::.slaveRSOCK)() --args MASTER=localhost PORT=11234 OUT= SETUPTIMEOUT=120 TIMEOUT=2592000 XDR=TRUE SETUPSTRATEGY=sequential
在这一点上,我感到困惑,因为我希望某些事情会失败。我唯一能想到的解释是,我为启动PSOCK
集群而创建的进程的行为与在终端中手动运行该命令不同。可能是不同的权限还是缺少环境变量?这是意料之中的吗?
在手动模式打印的命令中,我还注意到使用了SETUPSTRATEGY=sequential
。然而,当manual = FALSE
时,htop
过程显示使用了SETUPSTRATEGY=parallel
。实际上,这与parallel::makePSOCKcluster
的文档是一致的,该文档如下:
如果“并行”(默认)工作人员在可能的情况下在集群设置过程中并行启动,则现在用于本地计算机上所有工作人员自动启动的同构"PSOCK“集群(手动= FALSE)。工作人员将依次在其他集群、所有具有setup_strategy =“顺序”的集群和R3.6.0及更高版本的集群上启动。
出于好奇,我尝试运行手动模型打印的命令,而不是使用parallel
策略(即SETUPSTRATEGY=parallel
)。只运行一次之后,脚本将继续执行,并停止在Sys.sleep(5)
部分。这也反映在htop
进程列表中,其中我只能看到一个工作人员。
如果我禁用手动模式并将setup_strategy = "sequential"
添加到parallel::makePSOCKcluster
函数调用中,一切都会按预期工作。然而,我真的很想知道为什么它在parallel
安装策略中失败了。此外,在机器上运行良好,但在基于Debian的系统上挂起的情况完全一样。
就我正在运行的内容而言,这是我的R
会话的样子:
sessionInfo()
# R version 4.2.1 (2022-06-23)
# Platform: aarch64-apple-darwin20 (64-bit)
# Running under: macOS Monterey 12.6
#
# Matrix products: default
# BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
# LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
#
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# loaded via a namespace (and not attached):
# [1] compiler_4.2.1 cli_3.3.0 jsonlite_1.8.0 rlang_1.0.5
而且,由于其他答案显示了与主机相关的问题,我的/etc/hosts
看起来如下:
───────┼─────────────────────────────────────────────────────────
1 │ ##
2 │ # Host Database
3 │ #
4 │ # localhost is used to configure the loopback interface
5 │ # when the system is booting. Do not change this entry.
6 │ ##
7 │ 127.0.0.1 localhost
8 │ 255.255.255.255 broadcasthost
9 │ ::1 localhost
───────┴─────────────────────────────────────────────────────────
而且,我的.Rprofile
是空的。
最后,不确定这是否相关,但在运行netstat -i
时,我的网络接口的Address
列中的值是IPv6格式的,或者更确切地说是IPv6 4映射的。
在我屈服于setup_strategy = "sequential"
之前,你知道发生了什么吗?
编辑1.
按照HenrikB的建议,我用parallelly::makeClusterPSOCK
替换了parallel::makePSOCKcluster
调用,在调用之前启用了调试日志options(parallelly.debug = TRUE)
。
session$run(function() {
# Enable logging for `parallelly`.
options(parallelly.debug = TRUE)
# Connection.
connection <- file("/some/path/log.txt", open = "wt")
# Write anything to a log file.
sink(connection, append = TRUE)
sink(connection, append = TRUE, type = "message")
# Create a cluster in the `.GlobalEnv`.
cluster <<- parallelly::makeClusterPSOCK(
rep("localhost", 2),
master = "localhost",
port = 11234,
manual = FALSE,
outfile = ""
)
})
log.txt
中的输出包含以下内容:
[09:27:19.936] Set package option ‘parallelly.availableCores.methods’
[09:27:19.941] Environment variable ‘R_PARALLELLY_AVAILABLECORES_METHODS’ not set
[09:27:19.941] Set package option ‘parallelly.availableCores.fallback’
[09:27:19.941] Environment variable ‘R_PARALLELLY_AVAILABLECORES_FALLBACK’ not set
[09:27:19.941] Set package option ‘parallelly.availableCores.min’
[09:27:19.942] Environment variable ‘R_PARALLELLY_AVAILABLECORES_MIN’ not set
[09:27:19.942] Set package option ‘parallelly.availableCores.system’
[09:27:19.942] Environment variable ‘R_PARALLELLY_AVAILABLECORES_SYSTEM’ not set
[09:27:19.942] Set package option ‘parallelly.availableCores.logical’
[09:27:19.942] Environment variable ‘R_PARALLELLY_AVAILABLECORES_LOGICAL’ not set
[09:27:19.942] Set package option ‘parallelly.availableCores.omit’
[09:27:19.942] Environment variable ‘R_PARALLELLY_AVAILABLECORES_OMIT’ not set
[09:27:19.942] Set package option ‘parallelly.availableWorkers.methods’
[09:27:19.942] Environment variable ‘R_PARALLELLY_AVAILABLEWORKERS_METHODS’ not set
[09:27:19.942] Set package option ‘parallelly.fork.enable’
[09:27:19.942] Environment variable ‘R_PARALLELLY_FORK_ENABLE’ not set
[09:27:19.942] Set package option ‘parallelly.supportsMulticore.unstable’
[09:27:19.942] Environment variable ‘R_PARALLELLY_SUPPORTSMULTICORE_UNSTABLE’ not set
[09:27:19.943] Set package option ‘parallelly.makeNodePSOCK.setup_strategy’
[09:27:19.943] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_SETUP_STRATEGY’ not set
[09:27:19.943] Set package option ‘parallelly.makeNodePSOCK.validate’
[09:27:19.943] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_VALIDATE’ not set
[09:27:19.943] Set package option ‘parallelly.makeNodePSOCK.connectTimeout’
[09:27:19.943] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_CONNECTTIMEOUT’ not set
[09:27:19.943] Set package option ‘parallelly.makeNodePSOCK.timeout’
[09:27:19.943] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_TIMEOUT’ not set
[09:27:19.943] Set package option ‘parallelly.makeNodePSOCK.useXDR’
[09:27:19.943] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_USEXDR’ not set
[09:27:19.943] Set package option ‘parallelly.makeNodePSOCK.socketOptions’
[09:27:19.943] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_SOCKETOPTIONS’ not set
[09:27:19.943] Set package option ‘parallelly.makeNodePSOCK.rshcmd’
[09:27:19.943] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_RSHCMD’ not set
[09:27:19.944] Set package option ‘parallelly.makeNodePSOCK.rshopts’
[09:27:19.944] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_RSHOPTS’ not set
[09:27:19.944] Set package option ‘parallelly.makeNodePSOCK.tries’
[09:27:19.944] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_TRIES’ not set
[09:27:19.944] Set package option ‘parallelly.makeNodePSOCK.tries.delay’
[09:27:19.944] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_TRIES_DELAY’ not set
[09:27:19.944] Set package option ‘parallelly.makeNodePSOCK.rscript_label’
[09:27:19.944] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_RSCRIPT_LABEL’ not set
[09:27:19.944] Set package option ‘parallelly.makeNodePSOCK.sessionInfo.pkgs’
[09:27:19.944] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_SESSIONINFO_PKGS’ not set
[09:27:19.944] Set package option ‘parallelly.makeNodePSOCK.autoKill’
[09:27:19.944] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_AUTOKILL’ not set
[09:27:19.944] Set package option ‘parallelly.makeNodePSOCK.master.localhost.hostname’
[09:27:19.945] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_MASTER_LOCALHOST_HOSTNAME’ not set
[09:27:19.945] Set package option ‘parallelly.makeNodePSOCK.port.increment’
[09:27:19.945] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_PORT_INCREMENT’ not set
[09:27:19.945] parallelly-specific environment variables:
[09:27:19.961] [local output] Workers: [n = 2] ‘localhost’, ‘localhost’
[09:27:19.962] [local output] Base port: 11234
[09:27:19.962] [local output] Getting setup options for 2 cluster nodes ...
[09:27:19.963] [local output] - Node 1 of 2 ...
[09:27:19.963] [local output] localMachine=TRUE => revtunnel=FALSE
[09:27:19.963] Testing if worker's PID can be inferred: ‘'/Library/Frameworks/R.framework/Resources/bin/Rscript' -e 'try(suppressWarnings(cat(Sys.getpid(),file="/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=1.parallelly.parent=3664.e504deb9152.pid")), silent = TRUE)' -e 'file.exists("/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=1.parallelly.parent=3664.e504deb9152.pid")'’
[09:27:19.978] - Possible to infer worker's PID: FALSE
[09:27:19.978] [local output] Rscript port: 11234
[09:27:19.978] [local output] - Node 2 of 2 ...
[09:27:19.979] [local output] localMachine=TRUE => revtunnel=FALSE
[09:27:19.979] Testing if worker's PID can be inferred: ‘'/Library/Frameworks/R.framework/Resources/bin/Rscript' -e 'try(suppressWarnings(cat(Sys.getpid(),file="/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=2.parallelly.parent=3664.e502491c675.pid")), silent = TRUE)' -e 'file.exists("/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=2.parallelly.parent=3664.e502491c675.pid")'’
[09:27:19.995] - Possible to infer worker's PID: FALSE
[09:27:19.995] [local output] Rscript port: 11234
[09:27:19.995] [local output] Getting setup options for 2 cluster nodes ... done
[09:27:19.995] [local output] - Parallel setup requested for some PSOCK nodes
[09:27:19.996] [local output] Setting up PSOCK nodes in parallel
[09:27:19.996] List of 20
[09:27:19.996] $ local_cmd : chr "'/Library/Frameworks/R.framework/Resources/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,st"| __truncated__
[09:27:19.996] $ worker : chr "localhost"
[09:27:19.996] ..- attr(*, "localhost")= logi TRUE
[09:27:19.996] $ rank : int 1
[09:27:19.996] $ rshlogfile : NULL
[09:27:19.996] $ port : int 11234
[09:27:19.996] $ connectTimeout: num 120
[09:27:19.996] $ timeout : num 2592000
[09:27:19.996] $ useXDR : logi FALSE
[09:27:19.996] $ pidfile : chr "/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=1.parallelly.parent=3664.e504deb9152.pid"
[09:27:19.996] $ setup_strategy: chr "parallel"
[09:27:19.996] $ outfile : chr ""
[09:27:19.996] $ rshcmd_label : NULL
[09:27:19.996] $ rsh_call : NULL
[09:27:19.996] $ cmd : chr "'/Library/Frameworks/R.framework/Resources/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,st"| __truncated__
[09:27:19.996] $ localMachine : logi TRUE
[09:27:19.996] $ manual : logi FALSE
[09:27:19.996] $ dryrun : logi FALSE
[09:27:19.996] $ quiet : logi FALSE
[09:27:19.996] $ rshcmd : NULL
[09:27:19.996] $ revtunnel : logi FALSE
[09:27:19.996] - attr(*, "class")= chr [1:2] "makeNodePSOCKOptions" "makeNodeOptions"
[09:27:20.001] [local output] System call to launch all workers:
[09:27:20.001] [local output] '/Library/Frameworks/R.framework/Resources/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'try(suppressWarnings(cat(Sys.getpid(),file="/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=1.parallelly.parent=3664.e504deb9152.pid")), silent = TRUE)' -e 'options(socketOptions = "no-delay")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' MASTER=localhost PORT=11234 OUT= TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=parallel
[09:27:20.001] [local output] Starting PSOCK main server
[09:27:20.003] [local output] Workers launched
[09:27:20.003] [local output] Waiting for workers to connect back
[09:27:20.003] [local output] 0 workers out of 2 ready
[09:27:20.120] [local output] 0 workers out of 2 ready
[09:27:20.121] [local output] 1 workers out of 2 ready
[09:29:20.126] [local output] 1 workers out of 2 ready
[09:31:20.133] [local output] 1 workers out of 2 ready
Error in parallelly::makeClusterPSOCK(rep("localhost", 2), master = "localhost", :
Cluster setup failed. 1 worker of 2 failed to connect.
In addition: Warning messages:
1: In system(test_cmd, intern = TRUE, input = input) :
running command ''/Library/Frameworks/R.framework/Resources/bin/Rscript' -e 'try(suppressWarnings(cat(Sys.getpid(),file="/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=1.parallelly.parent=3664.e504deb9152.pid")), silent = TRUE)' -e 'file.exists("/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=1.parallelly.parent=3664.e504deb9152.pid")'' had status 2
2: In system(test_cmd, intern = TRUE, input = input) :
running command ''/Library/Frameworks/R.framework/Resources/bin/Rscript' -e 'try(suppressWarnings(cat(Sys.getpid(),file="/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=2.parallelly.parent=3664.e502491c675.pid")), silent = TRUE)' -e 'file.exists("/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=2.parallelly.parent=3664.e502491c675.pid")'' had status 2
看起来,Rscript
命令的退出状态为2
,从文档发出的退出状态听起来像是发生了一些神秘的事情:
错误状态
2
用于R
‘自杀’,这是灾难性的故障,而其他一些小数字则由特定端口用于初始化失败。
发布于 2022-10-07 19:24:25
更新2022-10-10:根本的问题是由于R本身的一个错误,这个问题已经在R-devel r83051 (2022-10-10) https://github.com/wch/r-source/commit/97b3dfb71aeff4a6acb72d400bb1fba8e6b2ed37中修复了。不确定,但我怀疑这一修正也会进入Rv4.2.2(2022年10月底)。
对于固定版本的R,OP的示例也适用于processx 3.7.0当前在CRAN上。
更新2022-10-09:
该问题与processx包和R本身有关,后者需要工作标准输入(stdin)流。具体来说,callr (当使用processx <= 3.7.0时)将所有文件描述符标记为接近exec,system2()
假定stdin (等等)。将在子进程中继承,而不是显式复制它。这导致system()
失败,cf。https://github.com/r-lib/callr/issues/236。因此,processx包(>= 3.7.0-9000)被更新为默认不关闭标准流(stdin、stdout和stderr)。这解决了OP报告的问题( 2022-10-08T21:05:46由OP确认)。
在processx包的固定版本在克拉恩上之前,开发人员版本可以安装如下:
remotes::install_github("r-lib/processx")
由于您更新了来自parallelly::makeClusterPSOCK()
的调试输出(*),我已经将其缩小到了当前callr中的一个bug。我在https://github.com/r-lib/callr/issues/236向上游报告了这一点,这也说明了问题的核心。
(*)引起我注意的确切原因是:
[09:27:19.963] Testing if worker's PID can be inferred: ‘'/Library/Frameworks/R.framework/Resources/bin/Rscript' -e 'try(suppressWarnings(cat(Sys.getpid(),file="/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=1.parallelly.parent=3664.e504deb9152.pid")), silent = TRUE)' -e 'file.exists("/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=1.parallelly.parent=3664.e504deb9152.pid")'’
[09:27:19.978] - Possible to infer worker's PID: FALSE
更准确地说,当使用后台回调进程调用时,它无法推断工人的PID。如果您在主R会话中执行相同的操作,您将得到:
[09:27:19.978] - Possible to infer worker's PID: TRUE
https://stackoverflow.com/questions/73962109
复制相似问题