
LLaRA, a new framework for robot learning, has been introduced to supercharge the data used in vision-language policy. This framework converts a pretrained vision language model (VLM) into a robot action policy through curated instruction tuning datasets. LLaRA has demonstrated state-of-the-art (SotA) performance in robot manipulation tasks, outperforming approaches like RT-2. The framework formulates robot action policies as conversations and fine-tunes a VLM into a robot action policy using instruction-response pairs. Additionally, it includes a complete recipe for converting a VLM into a robot policy, from data curation and fine-tuning to real-robot execution. The benefits of auxiliary data, such as spatial and temporal reasoning, have been highlighted in learning policies. The framework is now open-sourced.
๐ฐ OpenVLA, a vision-language-action model for robotics, lets developers control robots with natural language & images, aiding customization in multi-task environments with multiple objects affordably. https://t.co/Z9vuF7GYP0
[RO] LLaRA: Supercharging Robot Learning Data for Vision-Language Policy https://t.co/ebM3pszYAY - The paper proposes LLaRA, a framework to convert a pretrained vision language model (VLM) into a robot action policy using curated instruction tuning datasets. - LLaRA firstโฆ https://t.co/IAsCPF47n7
Introducing LLaRA โจ A complete recipe for converting a VLM into a robot policy: from data curation, finetuning to real-robot execution, all open-sourced NOW! Our experiments show the benefits of auxiliary data (eg: spatial/temporal reasoning) on learning policy. Have fun! https://t.co/PUJ9p9vrEz
