代码Unsup Visual Dynamics Simulation with Object-Centric Models

CreateAMind

发布于 2023-09-13 20:30:59

1440

发布于 2023-09-13 20:30:59

文章被收录于专栏：CreateAMind

Unsupervised Visual Dynamics Simulation with Object-Centric Models

https://github.com/pairlab/SlotFormer

https://slotformer.github.io

Abstract

Understanding dynamics from visual observations is a challenging problem that requires disentangling individual objects from the scene and learning their interactions. While recent object-centric models can successfully decompose a scene into objects, modeling their dynamics effectively still remains a challenge. We address this problem by introducing SlotFormer -- a Transformer-based autoregressive model operating on learned object-centric representations. Given a video clip, our approach reasons over object features to model spatio-temporal relationships and predicts accurate future object states. In this paper, , we successfully apply SlotFormer to perform video prediction on datasets with complex object interactions. Moreover, the unsupervised SlotFormer's dynamics model can be used to improve the performance on supervised downstream tasks, such as Visual Question Answering (VQA), and goal-conditioned planning. Compared to past works on dynamics modeling, our method achieves significantly better long-term synthesis of object dynamics, while retaining high quality visual generation. Besides, SlotFormer enables VQA models to reason about the future without object-level labels, even outperforming counterparts that use ground-truth annotations. Finally, we show its ability to serve as a world model for model-based planning, which is competitive with methods designed specifically for such tasks.

从视觉观察中了解动态是一个具有挑战性的问题，需要将单个对象与场景解开并学习其相互作用。尽管最近以对象为中心的模型可以成功地将场景分解为对象，但有效地建模其动态仍然是一个挑战。我们通过引入SLOTFORFER来解决此问题，这是一种基于变压器的自动回归模型，以学习对象为中心的表示。给定视频剪辑，我们的方法是对象特征来建模时空关系并预测准确的未来对象状态的方法。在本文中，我们成功地应用了插槽器来在具有复杂对象交互的数据集上执行视频预测。此外，可以使用无监督的插槽动力学模型来改善监督下游任务的性能，例如视觉问答（VQA）和目标条件计划。与过去的动态建模作品相比，我们的方法在保留高质量的视觉生成的同时，可以实现对物体动力学的长期综合。此外，Slotformer还可以使VQA模型在没有对象级标签的情况下推理未来，甚至超过了使用接地注释的对应物。最后，我们展示了它作为基于模型计划的世界模型的能力，该模型与专门为此类任务设计的方法具有竞争力。

谷歌的这篇论文指出了语言模型的缺陷

MIND’S EYE: GROUNDED LANGUAGE MODEL REASONING THROUGH SIMULATION

abs：

人类和人工智能之间成功有效的交流依赖于对世界的共享体验。通过只对书面文本进行训练，当前的语言模型(LMs)错过了人类在现实世界中的基础经验——它们无法将语言与物理世界联系起来，导致知识被歪曲，并在推理中出现明显的错误。我们提出心灵之眼，一个在物理世界中进行语言模型推理的范例。给定一个物理推理问题，我们使用一个计算物理引擎(DeepMind的MuJoCo)来模拟可能的结果，然后将模拟结果作为输入的一部分，这使得语言模型能够执行推理。在物理比对基准中的39个任务上的实验表明，心灵之眼可以大幅度提高推理能力(平均零命中率为27.9%，少命中率为46.0%)。配备了思维之眼的较小的语言模型可以获得与100倍大的模型相似的性能。最后，我们通过消融研究证实了心灵之眼的稳健性。

Successful and effective communication between humans and AI relies on a shared experience of the world. By training solely on written text, current language models (LMs) miss the grounded experience of humans in the real-world -- their failure to relate language to the physical world causes knowledge to be misrepresented and obvious mistakes in their reasoning. We present Mind's Eye, a paradigm to ground language model reasoning in the physical world. Given a physical reasoning question, we use a computational physics engine (DeepMind's MuJoCo) to simulate the possible outcomes, and then use the simulation results as part of the input, which enables language models to perform reasoning. Experiments on 39 tasks in a physics alignment benchmark demonstrate that Mind's Eye can improve reasoning ability by a large margin (27.9% zero-shot, and 46.0% few-shot absolute accuracy improvement on average). Smaller language models armed with Mind's Eye can obtain similar performance to models that are 100x larger. Finally, we confirm the robustness of Mind's Eye through ablation studies.

https://arxiv.org/abs/2210.05359

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2023/02/26 15:34:00，如有侵权请联系 cloudcommunity@tencent.com 删除

models