首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >如何停止本地登录,但只保存到wandb的服务器并使用软链接进行wandb工作?

如何停止本地登录,但只保存到wandb的服务器并使用软链接进行wandb工作?
EN

Stack Overflow用户
提问于 2022-10-23 23:04:20
回答 2查看 219关注 0票数 1

我遇到了一个奇怪的问题,我将所有代码和数据的位置更改为具有更多磁盘空间的不同位置,然后将我的项目和数据软链接到具有更多空间的位置。我想一定有一些文件处理问题,因为wandb的记录器正在向我抛出问题。所以我的问题是:

  1. 我如何让wandb只在网上登录,而不是在本地登录?(例如,停止尝试将任何内容记录到./wandb或它可能登录到的任何秘密位置,因为这会造成问题)。注意,我的代码在我停止登录到wandb之后运行良好,所以我认为这就是问题所在。注意,dir=None是wandb的param的默认值。
  2. 我如何完全解决这个问题,使它似乎与我的所有项目软链接到其他地方?

关于错误的更多细节

代码语言:javascript
复制
Traceback (most recent call last):
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1087, in emit
    self.flush()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1067, in flush
    self.stream.flush()
OSError: [Errno 116] Stale file handle
Call stack:
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 930, in _bootstrap
    self._bootstrap_inner()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/vendor/watchdog/observers/api.py", line 199, in run
    self.dispatch_events(self.event_queue, self.timeout)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/vendor/watchdog/observers/api.py", line 368, in dispatch_events
    handler.dispatch(event)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/vendor/watchdog/events.py", line 454, in dispatch
    _method_map[event_type](event)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/filesync/dir_watcher.py", line 275, in _on_file_created
    logger.info("file/dir created: %s", event.src_path)
Message: 'file/dir created: %s'
Arguments: ('/shared/rsaas/miranda9/diversity-for-predictive-success-of-meta-learning/wandb/run-20221023_170722-1tfzh49r/files/output.log',)
--- Logging error ---
Traceback (most recent call last):
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1087, in emit
    self.flush()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1067, in flush
    self.stream.flush()
OSError: [Errno 116] Stale file handle
Call stack:
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 930, in _bootstrap
    self._bootstrap_inner()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/internal/internal_util.py", line 50, in run
    self._run()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/internal/internal_util.py", line 101, in _run
    self._process(record)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/internal/internal.py", line 263, in _process
    self._hm.handle(record)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/internal/handler.py", line 130, in handle
    handler(record)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/internal/handler.py", line 138, in handle_request
    logger.debug(f"handle_request: {request_type}")
Message: 'handle_request: stop_status'
Arguments: ()
N/A% (0 of 100000) |      | Elapsed Time: 0:00:00 | ETA:  --:--:-- |   0.0 s/it

Traceback (most recent call last):
  File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1814, in <module>
    main()
  File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1747, in main
    train(args=args)
  File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1794, in train
    meta_train_iterations_ala_l2l(args, args.agent, args.opt, args.scheduler)
  File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/torch_uu/training/meta_training.py", line 167, in meta_train_iterations_ala_l2l
    log_zeroth_step(args, meta_learner)
  File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/logging_uu/wandb_logging/meta_learning.py", line 92, in log_zeroth_step
    log_train_val_stats(args, args.it, step_name, train_loss, train_acc, training=True)
  File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/logging_uu/wandb_logging/supervised_learning.py", line 55, in log_train_val_stats
    _log_train_val_stats(args=args,
  File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/logging_uu/wandb_logging/supervised_learning.py", line 116, in _log_train_val_stats
    args.logger.log('\n')
  File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/logger.py", line 89, in log
    print(msg, flush=flush)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/lib/redirect.py", line 640, in write
    self._old_write(data)
OSError: [Errno 116] Stale file handle
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: Synced vit_mi Adam_rfs_cifarfs Adam_cosine_scheduler_rfs_cifarfs 0.001: args.jobid=101161: https://wandb.ai/brando/entire-diversity-spectrum/runs/1tfzh49r
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20221023_170722-1tfzh49r/logs
--- Logging error ---
Traceback (most recent call last):
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/interface/router_sock.py", line 27, in _read_message
    resp = self._sock_client.read_server_response(timeout=1)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 283, in read_server_response
    data = self._read_packet_bytes(timeout=timeout)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 269, in _read_packet_bytes
    raise SockClientClosedError()
wandb.sdk.lib.sock_client.SockClientClosedError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/interface/router.py", line 70, in message_loop
    msg = self._read_message()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/interface/router_sock.py", line 29, in _read_message
    raise MessageRouterClosedError
wandb.sdk.interface.router.MessageRouterClosedError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1087, in emit
    self.flush()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1067, in flush
    self.stream.flush()
OSError: [Errno 116] Stale file handle
Call stack:
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 930, in _bootstrap
    self._bootstrap_inner()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/interface/router.py", line 77, in message_loop
    logger.warning("message_loop has been closed")
Message: 'message_loop has been closed'
Arguments: ()
/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/tempfile.py:817: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/srv/condor/execute/dir_27749/tmpmvf78q6owandb'>
  _warnings.warn(warn_message, ResourceWarning)
/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/tempfile.py:817: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/srv/condor/execute/dir_27749/tmpt5etqpw_wandb-artifacts'>
  _warnings.warn(warn_message, ResourceWarning)
/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/tempfile.py:817: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/srv/condor/execute/dir_27749/tmp55lzwviywandb-media'>
  _warnings.warn(warn_message, ResourceWarning)
/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/tempfile.py:817: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/srv/condor/execute/dir_27749/tmprmk7lnx4wandb-media'>
  _warnings.warn(warn_message, ResourceWarning)

错误:

代码语言:javascript
复制
====> about to start train loop
Starting training!
WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)'))': /api/5288891/envelope/
--- Logging error ---
Traceback (most recent call last):
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1086, in emit
    stream.write(msg + self.terminator)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/lib/redirect.py", line 640, in write
    self._old_write(data)
OSError: [Errno 116] Stale file handle
Call stack:
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 930, in _bootstrap
    self._bootstrap_inner()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/sentry_sdk/worker.py", line 128, in _target
    callback()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/sentry_sdk/transport.py", line 467, in send_envelope_wrapper
    self._send_envelope(envelope)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/sentry_sdk/transport.py", line 384, in _send_envelope
    self._send_request(
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/sentry_sdk/transport.py", line 230, in _send_request
    response = self._pool.request(
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/urllib3/request.py", line 78, in request
    return self.request_encode_body(
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/urllib3/request.py", line 170, in request_encode_body
    return self.urlopen(method, url, **extra_kw)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/urllib3/poolmanager.py", line 375, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/urllib3/connectionpool.py", line 780, in urlopen
    log.warning(
Message: "Retrying (%r) after connection broken by '%r': %s"
Arguments: (Retry(total=2, connect=None, read=None, redirect=None, status=None), SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)')), '/api/5288891/envelope/')

赏金

我对如何解决这一问题的建议如下:

  1. 找出一种方法来在本地停止wandb日志记录,或者最小化日志wandb在本地日志记录的数量。
  2. 找出确切的日志记录,并尽量减少空间。
  3. 让日志工作,即使所有的文件夹都是符号链接。(国际水文学组织,这应该是开箱即用的)
  4. 找出一种系统而简单的方法来查找陈旧的文件句柄来自何处。

我感到惊讶的是,将everything迁移到/shared/rsaas/miranda9/,并从那里运行实验并没有解决这个问题。

十字:

EN

回答 2

Stack Overflow用户

发布于 2022-10-24 00:03:20

解决方案似乎是不使用符号链接登录到奇怪的地方,而是登录到真正的路径,而不是经常清理wandb本地路径,以避免HPC中的磁盘配额错误。不是我的fav解决方案,而是完成它:)。

万达应该解决这个问题,Wandb的全部意义是,它是开箱即用的,我不需要做MLOps,我可以专注于研究。

最好在这里看到讨论:https://github.com/wandb/wandb/issues/4409

票数 2
EN

Stack Overflow用户

发布于 2022-11-02 03:33:58

我如何让wandb只在网上登录,而不是在本地登录?(例如,停止尝试将任何内容记录到./wandb或它可能登录到的任何秘密位置,因为这会造成问题)

您可以尝试使用由wandb offline提供的wandb文档关闭日志:

命令wandb offline设置一个环境变量WANDB_MODE=offline。这将停止从计算机同步到远程wandb服务器的任何数据。如果您有多个项目,它们都将停止将记录的数据同步到W&B服务器。

您还可以查看关于"wandb同步在运行wandb本地时不登录“的讨论,其中一些人设法发现它与--network host标志有关。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/74175401

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档