Sarvam AI has unveiled Sarvam-1, a 2 billion parameter large language model designed exclusively for Indian languages.
In a blog post, the startup stated that the model is optimized for 10 Indian languages, including Hindi, Bengali, Tamil, and Telugu, in addition to English.
The model seeks to address two major issues: token inefficiency and low data quality in Indic languages.
Token inefficiency refers to the number of components (tokens) that a language model must break down a word into before processing it. For example, in English, the word “apple” may be processed as a single token. However, in several Indian languages, the same word may be broken into four or eight tokens. This reduces processing speed and efficiency.
Sarvam-1 claims to have reached a token efficiency rate of 1.4-2.1 tokens per word (compared to 4-8 in current models). It stated that the LLM was trained on Sarvam-2T, a 2-trillion-token dataset selected particularly for Indian languages. This results in improved performance in areas such as cross-lingual translation and question-answering.