Jul 1, 11:31 AM

LLaRA Framework Supercharges Robot Learning Data for Vision-Language Policy with SotA Performance

LLaRA, a new framework for robot learning, has been introduced to supercharge the data used in vision-language policy. This framework converts a pretrained vision language model (VLM) into a robot action policy through curated instruction tuning datasets. LLaRA has demonstrated state-of-the-art (SotA) performance in robot manipulation tasks, outperforming approaches like RT-2. The framework formulates robot action policies as conversations and fine-tunes a VLM into a robot action policy using instruction-response pairs. Additionally, it includes a complete recipe for converting a VLM into a robot policy, from data curation and fine-tuning to real-robot execution. The benefits of auxiliary data, such as spatial and temporal reasoning, have been highlighted in learning policies. The framework is now open-sourced.

#RT

Written with ChatGPT (GPT-4o).

Sources

Additional media

Image #1 for story llara-framework-supercharges-robot-learning-data-vision-language-policy-sota

LLaRA Framework Supercharges Robot Learning Data for Vision-Language Policy with SotA Performance

Sources

Additional media

Similar Stories