
Researchers have introduced CT-LLM, a 2B parameter Chinese-Centric Large Language Model, marking a significant advancement in the field of artificial intelligence. This model, featuring 2 billion parameters, has been trained on a diverse corpus of 1.2 trillion tokens, including 800 billion Chinese tokens. The development of CT-LLM represents a pivotal shift towards prioritizing the Chinese language in the development of large language models (LLMs). The team behind CT-LLM has also committed to open-sourcing the full process of its training, including a detailed data processing procedure, to facilitate further research and development in the area.



MiniCPM Unveiling the Potential of Small Language Models with Scalable Training Strategies The burgeoning interest in developing Large Language Models (LLMs) with up to trillion parameters has been met with concerns regarding resource efficiency and practical expense, https://t.co/Ci3iUn0pfE
LLM2Vec Large Language Models Are Secretly Powerful Text Encoders Large decoder-only language models (LLMs) are the state-of-the-art models on most of today's NLP tasks and benchmarks. Yet, the community is only slowly adopting these models for text embedding tasks, https://t.co/f9seabeGQq
The Case of Homegrown Large Language Models Recent developments in building large language models (LLMs) to boost generative AI in local languages have caught everyone’s attention. https://t.co/UJKos4sFzd https://t.co/vk6jdZfFbd