Emoji SAM2Act:

Integrating Visual Foundation Model with
A Memory Architecture for Robotic Manipulation

1University of Washington 2Universidad Católica San Pablo
3NVIDIA 4Allen Institute for Artifical Intelligence
Equal Advising


Abstract

Robotic manipulation systems operating in diverse, dynamic environments must exhibit three critical abilities: multitask interaction, generalization to unseen scenarios, and spatial memory. While significant progress has been made in robotic manipulation, existing approaches often fall short in generalization to complex environmental variations and addressing memory-dependent tasks. To bridge this gap, we introduce SAM2Act, a multi-view robotic transformer-based policy that leverages multi-resolution upsampling with visual representations from large-scale foundation model. SAM2Act achieves a state-of-the-art average success rate of 86.8% across 18 tasks in the RLBench benchmark, and demonstrates robust generalization on The Colosseum benchmark, with only a 4.3% performance gap under diverse environmental perturbations. Building on this foundation, we propose SAM2Act+, a memory-based architecture inspired by SAM2, which incorporates a memory bank, an encoder, and an attention mechanism to enhance spatial memory. To address the need for evaluating memory-dependent tasks, we introduce MemoryBench, a novel benchmark designed to assess spatial memory and action recall in robotic manipulation. SAM2Act+ achieves competitive performance on MemoryBench, significantly outperforming existing approaches and pushing the boundaries of memory-based robotic systems.

Real World Results



Memory Tasks

Summary


We introduce SAM2Act, a multi-view robotics transformer-based policy that enhances feature representation by integrating multi-resolution upsampling with visual embeddings from large-scale foundation model. Built on the RVT-2 multi-view transformer, SAM2Act achieves strong multitask success and generalization. Building on this foundation, we introduce SAM2Act+, which incorporates a memory-based architecture inspired by SAM2's approach. Using a memory bank, an encoder, and an attention mechanism, SAM2Act+ enables episodic recall to solve spatial memory-dependent manipulation tasks.

Overview of SAM2Act and SAM2Act+


Our method, SAM2Act, enables precise 3D manipulation with strong generalization across environ- mental and object-level variations. Building upon the RVT-2 framework, SAM2Act introduces key architectural innovations that enhance visual feature representation and task-specific reasoning. The architecture reconstructs a point cloud of the scene, renders it from virtual cameras at orthogonal views, and employs a two-stage multi-view transformer (coarse-to-fine) to predict action heatmaps. The coarse branch generates zoom-in heatmaps to localize regions of interest, while the fine branch refines these into precise action heatmaps.

SAM2Act leverages the pre-trained SAM2 encoder to extract multi-resolution image embeddings, which are further refined through the multi-resolution upsampling technique to predict accurate translation heatmaps with minimal information loss. To address tasks requiring spatial memory, SAM2Act+ extends the SAM2Act architecture by incorporating memory-based components. These include Memory Bank, Memory Encoder, and Memory Attention, enabling the model to encode historical actions and condition current observations. This memory- based policy enhances the agent's ability to predict actions based on past contextual information, significantly improving performance in tasks that require sequential decision-making.

MemoryBench


Unlike standard RLBench tasks, many of which involve long-horizon scenarios, our tasks are specifically designed to require spatial memory. Without such memory, the agent would be forced to rely on random actions. To create these tasks, we intentionally violate the Markov assumption, which states that in a Markov Decision Process (MDP), the next observation depends solely on the current observation and action:

$$ P\bigl(o_{t+1} \mid o_1, a_1, \dots, o_t, a_t\bigr) \;=\; P\bigl(o_{t+1} \mid o_t, a_t\bigr). $$

This assumption implies that knowing only ot and at is sufficient to predict \( o_{t+1} \). However, in our tasks, we design scenarios where two distinct action histories lead to the same observation ot, but require different subsequent actions. This forces the agent to recall which action history led to ot to perform the correct next action. Furthermore, we standardized the language instructions to prevent unintentional leakage of spatial information that could aid the model in memory-based tasks. These principles guided the development of our spatial memory-based tasks.

Experiments and Results

RLBench 18 Tasks


Overall, SAM2Act achieves an average success rate of 86.8%±0.5, surpassing the previous best (RVT-2) by 5.4%. A closer look at individual tasks reveals that SAM2Act ranks first in 9 out of 18 tasks and remains highly competitive in 7 others, coming within one successful attempt or 4% of the best performance. These tasks include Close Jar, Drag Stick, Meat Off Grill, Place Wine, Screw Bulb, Sweep to Dustpan, and Turn Tap. The largest margin of improvement occurs in Insert Peg, where SAM2Act exceeds RVT-2 by 44% (approximately 2.1×), and in Sort Shape, where it outperforms RVT-2 by 29%. Both tasks require precise manipulation, underscoring the effectiveness of SAM2Act's multi-resolution upsampling strategy. These results establish SAM2Act as a leading policy for complex 3D tasks, highlighting its ability to handle high-precision manipulations - an area where prior methods have struggled.

The Colosseum


The results evaluated in the above figure were obtained by training and testing models within the same environment. However, to truly assess generalization performance, policies must remain robust against both environmental and object-level perturbations. We therefore trained SAM2Act and the baseline methods on 20 tasks from The Colosseum benchmark and tested them under 13 different perturbation categories over three runs. SAM2Act exhibits the smallest performance drop compared to the baselines, with an average decrease of 4.3% (standard deviation of 3.59%). Notably, it proves particularly robust to environmental perturbations – such as changes in lighting, table color/texture, the addition of distractors, and even camera pose – while also maintaining competitive performance under object-level perturbations.

MemoryBench


In the figure above, we evaluate SAM2Act+ against SoTA 3D BC model, RVT-2 on MemoryBench, training all models in a single-task setting to isolate memory-related challenges (e.g., opening the wrong drawer rather than unrelated mid-task failures). This setup ensures that performance differences stem from memory capabilities. For a random agent, the expected success rates are determined by the number of possible choices per task: 33% for reopen_drawer (three drawers), 25% for put_block_back (four patches), and 50% for rearrange_block (two blocks). However, variations in task complexity, fixed training data, and imbalanced task distributions lead to slight deviations from these baselines. Our proposed memory-based model, SAM2Act+, demonstrates a strong understanding of spatial memory, achieving an average success rate of 94.3% across all tasks. It outperforms SAM2Act (without memory) by a huge margin of 39.3% on MemoryBench, highlighting the significant impact of explicit memory modeling. Note that we made an update to open_drawer task, see more in our paper's appendix.

Real-robot


The table above presents our real-world experiment results, where our method achieves a 75% task success rate, compared to 43% for RVT-2. SAM2Act significantly outperforms the baseline in high-precision tasks (60% vs 0%). SAM2Act+ (indicated with *) excels in memory-based tasks, such as (d) Push the same button, which requires recalling the button's previous location. Here, SAM2Act achieves 70% success, while RVT-2, relying on random guessing, scores 40%. We also test models' generalization against perturbations like lighting changes, distractors, and position variations.

More Video Results ⬇️

Results on RLBench 18 Tasks

Task
Baseline


Ours

Baseline

Results on The Colosseum



Results on MemoryBench

Task
Method
Episode


In-distribution Real-world Results

Task
Episode


SAM2Act

RVT-2

Out-distribution Real-world Results

Task
Episode


SAM2Act

RVT-2

BibTeX

@misc{fang2025sam2act,
      title={SAM2Act: Integrating Visual Foundation Model with A Memory Architecture for Robotic Manipulation}, 
      author={Haoquan Fang and Markus Grotz and Wilbert Pumacay and Yi Ru Wang and Dieter Fox and Ranjay Krishna and Jiafei Duan},
      year={2025},
      eprint={2501.18564},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2501.18564}, 
}