🏷️:FAST: Efficient Action Tokenization for Vision-Language-Action Models 🔗:https://t.co/Lhtm2WU4Qn https://t.co/cZVZ849j1m
🏷️:Learnings from Scaling Visual Tokenizers for Reconstruction and Generation 🔗:https://t.co/PVoO2xj5dp https://t.co/pvCdjL33qO
FAST: Efficient Action Tokenization for Vision-Language-Action Models. https://t.co/GdwgQXdwjg
Recent developments in action tokenization for robotic models highlight innovative approaches to improve training efficiency. A new method called FAST (Frequency-space Action Tokenizer) has been introduced, which enables efficient autoregressive training for dexterous tasks. This method reportedly accelerates training by five times compared to traditional diffusion methods while maintaining precision. In addition, research from Meta has examined the effects of scaling the autoencoder bottleneck on reconstruction and generation performance, suggesting that simply scaling the encoder does not guarantee improved outcomes. Traditional tokenization methods have faced challenges with high-frequency, dexterous tasks due to redundancy and inefficiency. A new compressed action representation inspired by JPEG compression has been developed by Physical Intelligence, which also enhances the training speed of Vision-Language-Action models by five times.