A Brief Introduction to Tokenizer Models

Learn how to handle Chinese text tokenization
2025-3-16
Problem Description
In practical testing, when using the all-MiniLM-L6-v2
model to process Chinese text, many Chinese characters were marked as [UNK]
(Unknown Token). For example, when inputting the sentence "这是一个测试句子。"
, the tokenization result was: 'UNK', 'UNK', '一', 'UNK', 'UNK', 'UNK', 'UNK', '子', '。', indicating that the model cannot correctly recognize most Chinese characters, resulting in poor tokenization performance.
Problem Analysis
1. Vocabulary Limitations
all-MiniLM-L6-v2
is a model based on WordPiece tokenizer, with its vocabulary primarily optimized for English and very limited Chinese support. Many Chinese characters are not included in the vocabulary, thus being marked as [UNK]
.
2. Tokenization Method
- The model's tokenizer splits Chinese text character by character, but due to vocabulary limitations, many characters cannot be recognized.
- For example, only a few common characters (like "一", "子", "。") are recognized, while the remaining characters are marked as
[UNK]
.
3. Model Design Purpose
all-MiniLM-L6-v2
is mainly designed for English tasks, and while it theoretically can handle multilingual text, its support for Chinese is very limited.
Solutions
1. Use Specialized Chinese Models
bert-base-chinese
This is a BERT model specifically trained for Chinese provided by Hugging Face, supporting a complete Chinese character set with better tokenization results.
from transformers import AutoTokenizer
model_name = "bert-base-chinese"
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = "这是一个测试句子。"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)
print("Tokens:", tokens) # Example: ['这', '是', '一', '个', '测', '试', '句', '子', '。']
print("Token IDs:", token_ids)
2. Use Multilingual Models
paraphrase-multilingual-MiniLM-L12-v2
This is a multilingual model provided by Sentence Transformers that supports multiple languages including Chinese, with better tokenization results. We can see that the tokenization is not done character by character, but can achieve word-level segmentation.
from transformers import AutoTokenizer
model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = "这是一个测试句子。"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)
print("Tokens:", tokens) # Example: ['这', '是', '一个', '测试', '句子', '。']
print("Token IDs:", token_ids)
xlm-roberta-base
This is a powerful multilingual model that supports over 100 languages, including Chinese.