LAV-ACT: Language-Augmented Visual Action Chunking with Transformers for Bimanual Robotic Manipulation

Dhurba Tripathi1 * Chunting Liu1 * Niraj Pudasaini1 Yu Hao1 Anthony Tzes1 Yi Fang1 †
1 New York University Abu Dhabi

* Equal contribution

Corresponding Author

Embodied AI and Robotics (AIR) Lab, New York University Abu Dhabi, UAE

In the Proceedings of the 11th International Conference on Automation, Robotics, and Applications (ICARA 2025)
MY ALT TEXT

The overview of the proposed method.

MY ALT TEXT MY ALT TEXT

Success rates for LAV-ACT.

Policy rollouts

Bimanual Pouring without Temporal Aggregation



Single Arm Pouring without Temporal Aggregation



Single Arm Pouring with Temporal Aggregation

Abstract

Bimanual robotic manipulation, involving the coordinated use of two robotic arms, is essential for tasks requiring complex, synchronous actions. Action Chunking with Transformers (ACT) is a representative framework that enables robots to break down complex tasks into manageable sequences, facilitating autonomous learning of multi-step actions. However, we observe critical limitations in the ACT framework: it relies solely on visual observations as input, focusing on task-specific action predictions, and it uses a simple ResNet-based feature extractor for image processing, which is often insufficient for complex and multi-view bimanual arm observations. In this paper, we introduce an enhanced language-driven version of ACT that leverages Voltron—a language-driven representation model—to incorporate both visual observations and language prompts into dense, multi-modal embeddings. These embeddings are used to condition the ResNet backbone feature maps through Featurewise Linear Modulation (FiLM), allowing our model to integrate contextually relevant linguistic information with visual data for more adaptive action chunking. Extensive experiments show that our approach significantly improves the performance of bimanual robot arms in executing complex, multi-step tasks guided by language cues, outperforming traditional ACT methods.



What has Changed?

We present the following technical contributions to enhance the ACT, and demonstrate in real-world and simulated environments, that our method improves upon the ACT:

1. Language-Driven Dense Vector Encoding

We introduce a language-driven system to bimanual robotic arm manipulation by integrating the Voltron vision-language encoder.

2. Feature Map Conditioning with FiLM

We propose integrating Voltron to condition the ResNet backbone feature maps through Feature-wise Linear Modulation (FiLM), enabling the model to combine linguistic context with visual data for more adaptive action chunking.

MY ALT TEXT

Feature extractor pipeline


Real world tasks tested on

MY ALT TEXT

Data collection

We collected data using human expert demonstrations where operators manually control the robot to perform a specific task in the first stage, which is then played while recording from the cameras.


Data Samples

Find all the trainig data in this link: (coming soon)

BibTex


          @INPROCEEDINGS{10977578,
              author={Tripathi, Dhurba and Liu, Chunting and Pudasaini, Niraj and Hao, Yu and Tzes, Anthony and Fang, Yi},
              booktitle={2025 11th International Conference on Automation, Robotics, and Applications (ICARA)}, 
              title={LAV-ACT: Language-Augmented Visual Action Chunking with Transformers for Bimanual Robotic Manipulation}, 
              year={2025},
              volume={},
              number={},
              pages={18-22},
              keywords={Visualization;Adaptation models;Robot kinematics;Modulation;Linguistics;Transformers;Manipulators;Feature extraction;Data models;Context modeling;Imitation Learning;Bimanual Manipulation;Behavior cloning},
              doi={10.1109/ICARA64554.2025.10977578}}