A Brief Introduction to Tokenizer Models

How to solve Chinese tokenization problems?

Learn how to handle Chinese text tokenization

2025-3-16

chromadbembeddingsvector-database

Problem Description

In practical testing, when using the all-MiniLM-L6-v2 model to process Chinese text, many Chinese characters were marked as [UNK] (Unknown Token). For example, when inputting the sentence "这是一个测试句子。", the tokenization result was: 'UNK', 'UNK', '一', 'UNK', 'UNK', 'UNK', 'UNK', '子', '。', indicating that the model cannot correctly recognize most Chinese characters, resulting in poor tokenization performance.


Problem Analysis

1. Vocabulary Limitations

all-MiniLM-L6-v2 is a model based on WordPiece tokenizer, with its vocabulary primarily optimized for English and very limited Chinese support. Many Chinese characters are not included in the vocabulary, thus being marked as [UNK].

2. Tokenization Method

  • The model's tokenizer splits Chinese text character by character, but due to vocabulary limitations, many characters cannot be recognized.
  • For example, only a few common characters (like "一", "子", "。") are recognized, while the remaining characters are marked as [UNK].

3. Model Design Purpose

all-MiniLM-L6-v2 is mainly designed for English tasks, and while it theoretically can handle multilingual text, its support for Chinese is very limited.


Solutions

1. Use Specialized Chinese Models

bert-base-chinese

This is a BERT model specifically trained for Chinese provided by Hugging Face, supporting a complete Chinese character set with better tokenization results.

from transformers import AutoTokenizer

model_name = "bert-base-chinese"
tokenizer = AutoTokenizer.from_pretrained(model_name)

text = "这是一个测试句子。"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)

print("Tokens:", tokens)  # Example: ['这', '是', '一', '个', '测', '试', '句', '子', '。']
print("Token IDs:", token_ids)

2. Use Multilingual Models

paraphrase-multilingual-MiniLM-L12-v2

This is a multilingual model provided by Sentence Transformers that supports multiple languages including Chinese, with better tokenization results. We can see that the tokenization is not done character by character, but can achieve word-level segmentation.

from transformers import AutoTokenizer

model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)

text = "这是一个测试句子。"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)

print("Tokens:", tokens)  # Example: ['这', '是', '一个', '测试', '句子', '。']
print("Token IDs:", token_ids)

xlm-roberta-base

This is a powerful multilingual model that supports over 100 languages, including Chinese.