Word Segmentation Algorithms (Training Tokenizer with SentencePiece)
10 Aug 2022< 목차 >
- tmp
- SPM Example
- Basic end-to-end example
- User defined and control symbols (for BERT or something)
- Manipulating BOS/EOS/EOS/PAD symbols
- Changing the vocab id and surface representation of UNK/BOS/EOS/PAD symbols
- Sampling and nbest segmentation for subword regularization (can be used for Lexicon Generation)
- BPE (Byte pair encoding) model (BPE vs Unigram)
- Character and word model (not Subword)
- Text normalization
- Randomizing training data
- Put it All together
- References
tmp
## Character (Letter) Tokenizer
THE CITY OF MONTREAL => ['T','H','E', '_', 'C','I','T','Y','_', 'O', 'F', '_, 'M','O','N','T','R','E','A','L']
## Word Tokenizer
THE CITY OF MONTREAL => ['THE','CITY','OF','MONTREAL']
## Byte-Pair Encoding (BPE) Tokenizer
THE CITY OF MONTREAL => ['THE', '▁CITY', '▁OF', '▁MO', 'NT', 'RE', 'AL']
byte-pair-encoding (BPE)
WordPiece Model (WPM)
unigram language model
SPM Example
mkdir -p /workspace/spm_tutorial
cd /workspace/spm_tutorial
wget wget https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt
(py38) root@557bec2a5c9d:/workspace/spm_tutorial# tree .
.
`-- botchan.txt
Basic end-to-end example
import sentencepiece as spm
# train sentencepiece model from `botchan.txt` and makes `m.model` and `m.vocab`
# `m.vocab` is just a reference. not used in the segmentation.
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')
# makes segmenter instance and loads the model file (m.model)
sp = spm.SentencePieceProcessor()
sp.load('m.model')
# encode: text => id
print(sp.encode_as_pieces('This is a test'))
print(sp.encode_as_ids('This is a test'))
# decode: id => text
print(sp.decode_pieces(['▁This', '▁is', '▁a', '▁t', 'est']))
print(sp.decode_ids([209, 31, 9, 375, 586]))
['▁This', '▁is', '▁a', '▁t', 'est']
[209, 31, 9, 375, 586]
This is a test
This is a test
(py38) root@557bec2a5c9d:/workspace/spm_tutorial# tree .
.
|-- botchan.txt
|-- m.model
`-- m.vocab
(py38) root@557bec2a5c9d:/workspace/spm_tutorial# head -20 m.vocab
<unk> 0
<s> 0
</s> 0
, -3.2299
. -3.36342
▁the -3.40218
▁I -3.74108
s -3.89451
▁to -3.91479
▁a -4.02289
▁and -4.09969
▁of -4.10888
▁ -4.38181
ing -4.44207
ed -4.44878
▁in -4.53349
▁was -4.61109
▁" -4.64108
▁it -4.78218
t -4.80042
여기서 특별한 옵션을 주지 않았다면 <unk>, <s>, </s>
이 세 가지는 special token 으로서 dictionary 의 0, 1, 2 번으로 매핑되며 특히 <s>
와 </s>
는 control symbol 로 정의됩니다.
# returns vocab size
print(sp.get_piece_size())
# id <=> piece conversion
print(sp.id_to_piece(209))
print(sp.piece_to_id('▁This'))
# returns 0 for unknown tokens (we can change the id for UNK)
print(sp.piece_to_id('__MUST_BE_UNKNOWN__'))
# <unk>, <s>, </s> are defined by default. Their ids are (0, 1, 2)
# <s> and </s> are defined as 'control' symbol.
for id in range(3):
print(sp.id_to_piece(id), sp.is_control(id))
2000
▁This
209
0
<unk> False
<s> True
</s> True
User defined and control symbols (for BERT or something)
# Example of user defined symbols
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_user --user_defined_symbols=<sep>,<cls> --vocab_size=2000')
sp_user = spm.SentencePieceProcessor()
sp_user.load('m_user.model')
# ids are reserved in both mode.
# <unk>=0, <s>=1, </s>=2, <sep>=3, <cls>=4
# user defined symbols allow these symbol to apper in the text.
print(sp_user.encode_as_pieces('this is a test<sep> hello world<cls>'))
print(sp_user.piece_to_id('<sep>')) # 3
print(sp_user.piece_to_id('<cls>')) # 4
print('3=', sp_user.decode_ids([3])) # decoded to <sep>
print('4=', sp_user.decode_ids([4])) # decoded to <cls>
['▁this', '▁is', '▁a', '▁t', 'est', '<sep>', '▁he', 'll', 'o', '▁world', '<cls>']
3
4
3= <sep>
4= <cls>
# Example of control symbols
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_ctrl --control_symbols=<sep>,<cls> --vocab_size=2000')
sp_ctrl = spm.SentencePieceProcessor()
sp_ctrl.load('m_ctrl.model')
# control symbols just reserve ids.
print(sp_ctrl.encode_as_pieces('this is a test<sep> hello world<cls>'))
print(sp_ctrl.piece_to_id('<sep>')) # 3
print(sp_ctrl.piece_to_id('<cls>')) # 4
print('3=', sp_ctrl.decode_ids([3])) # decoded to empty
print('4=', sp_ctrl.decode_ids([4])) # decoded to empty
['▁this', '▁is', '▁a', '▁t', 'est', '<', 'se', 'p', '>', '▁he', 'll', 'o', '▁world', '<', 'c', 'l', 's', '>']
3
4
3=
4=
['▁', '<', 's', '>', '▁he', 'll', 'o', '</', 's', '>']
['▁', '<s>', '▁he', 'll', 'o', '</s>']
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_bos_as_user --user_defined_symbols=<s>,</s> --vocab_size=2000')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.encode_as_pieces('<s> hello</s>')) # <s>,</s> are segmented. (default behavior)
sp = spm.SentencePieceProcessor()
sp.load('m_bos_as_user.model')
print(sp.encode_as_pieces('<s> hello</s>')) # <s>,</s> are handled as one token.
Manipulating BOS/EOS/EOS/PAD symbols
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
print('bos=', sp.bos_id())
print('eos=', sp.eos_id())
print('unk=', sp.unk_id())
print('pad=', sp.pad_id()) # disabled by default
print(sp.encode_as_ids('Hello world'))
# Prepend or append bos/eos ids.
print([sp.bos_id()] + sp.encode_as_ids('Hello world') + [sp.eos_id()])
bos= 1
eos= 2
unk= 0
pad= -1
[12, 1828, 1038]
[1, 12, 1828, 1038, 2]
Changing the vocab id and surface representation of UNK/BOS/EOS/PAD symbols
spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m --pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3 --pad_piece=[PAD] --unk_piece=[UNK] --bos_piece=[BOS] --eos_piece=[EOS]')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
for id in range(4):
print(sp.id_to_piece(id), sp.is_control(id))
[PAD] True
[UNK] False
[BOS] True
[EOS] True
# Disable BOS/EOS
spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m --bos_id=-1 --eos_id=-1')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
# <s>, </s> are UNK.
print(sp.unk_id())
print(sp.piece_to_id('<s>'))
print(sp.piece_to_id('</s>'))
0
0
0
Sampling and nbest segmentation for subword regularization (can be used for Lexicon Generation)
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')
# Can obtain different segmentations per request.
# There are two hyperparamenters for sampling (nbest_size and inverse temperature). see the paper [kudo18] for detail.
for n in range(10):
print(sp.sample_encode_as_pieces('hello world', -1, 0.1))
for n in range(10):
print(sp.sample_encode_as_ids('hello world', -1, 0.1))
['▁', 'h', 'e', 'll', 'o', '▁w', 'o', 'r', 'l', 'd']
['▁he', 'l', 'l', 'o', '▁world']
['▁he', 'l', 'l', 'o', '▁w', 'or', 'l', 'd']
['▁', 'he', 'l', 'l', 'o', '▁world']
['▁', 'he', 'll', 'o', '▁w', 'o', 'r', 'l', 'd']
['▁', 'he', 'll', 'o', '▁world']
['▁he', 'll', 'o', '▁world']
['▁', 'he', 'll', 'o', '▁world']
['▁he', 'll', 'o', '▁w', 'o', 'r', 'l', 'd']
['▁', 'h', 'e', 'l', 'l', 'o', '▁w', 'o', 'r', 'l', 'd']
[12, 489, 57, 57, 38, 1246, 57, 20]
[28, 98, 38, 1038]
[12, 489, 98, 38, 12, 151, 105, 57, 20]
[12, 489, 98, 38, 1038]
[28, 98, 38, 254, 105, 57, 20]
[12, 489, 98, 38, 12, 151, 38, 46, 57, 20]
[28, 57, 57, 38, 1038]
[28, 98, 38, 1038]
[12, 96, 351, 57, 38, 1038]
[28, 98, 38, 1038]
# get 10 best
print(sp.nbest_encode_as_pieces('hello world', 10))
print(sp.nbest_encode_as_ids('hello world', 10))
[['▁he', 'll', 'o', '▁world'], ['▁he', 'l', 'l', 'o', '▁world'], ['▁', 'he', 'll', 'o', '▁world'], ['▁', 'h', 'e', 'll', 'o', '▁world'], ['▁he', 'll', 'o', '▁wor', 'l', 'd'], ['▁', 'he', 'l', 'l', 'o', '▁world'], ['▁', 'h', 'el', 'l', 'o', '▁world'], ['▁he', 'll', 'o', '▁w', 'or', 'l', 'd'], ['▁', 'h', 'e', 'l', 'l', 'o', '▁world'], ['▁he', 'l', 'l', 'o', '▁wor', 'l', 'd']]
[[28, 98, 38, 1038], [28, 57, 57, 38, 1038], [12, 489, 98, 38, 1038], [12, 96, 25, 98, 38, 1038], [28, 98, 38, 1246, 57, 20], [12, 489, 57, 57, 38, 1038], [12, 96, 351, 57, 38, 1038], [28, 98, 38, 254, 105, 57, 20], [12, 96, 25, 57, 57, 38, 1038], [28, 57, 57, 38, 1246, 57, 20]]
BPE (Byte pair encoding) model (BPE vs Unigram)
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_bpe --vocab_size=2000 --model_type=bpe')
sp_bpe = spm.SentencePieceProcessor()
sp_bpe.load('m_bpe.model')
print('*** BPE ***')
print(sp_bpe.encode_as_pieces('thisisatesthelloworld'))
print(sp_bpe.nbest_encode_as_pieces('hello world', 5)) # returns an empty list.
*** BPE ***
['▁this', 'is', 'at', 'est', 'he', 'llow', 'or', 'ld']
[]
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_unigram --vocab_size=2000 --model_type=unigram')
sp_unigram = spm.SentencePieceProcessor()
sp_unigram.load('m_unigram.model')
print('*** Unigram ***')
print(sp_unigram.encode_as_pieces('thisisatesthelloworld'))
print(sp_unigram.nbest_encode_as_pieces('thisisatesthelloworld', 5))
*** Unigram ***
['▁this', 'is', 'ate', 's', 'the', 'llow', 'or', 'l', 'd']
[['▁this', 'is', 'ate', 's', 'the', 'llow', 'or', 'l', 'd'], ['▁this', 'i', 's', 'ate', 's', 'the', 'llow', 'or', 'l', 'd'], ['▁this', 'is', 'ate', 'st', 'he', 'llow', 'or', 'l', 'd'], ['▁this', 'is', 'at', 'es', 'the', 'llow', 'or', 'l', 'd'], ['▁this', 'is', 'at', 'est', 'he', 'llow', 'or', 'l', 'd']]
Character and word model (not Subword)
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_char --model_type=char --vocab_size=400')
sp_char = spm.SentencePieceProcessor()
sp_char.load('m_char.model')
print(sp_char.encode_as_pieces('this is a test.'))
print(sp_char.encode_as_ids('this is a test.'))
['▁', 't', 'h', 'i', 's', '▁', 'i', 's', '▁', 'a', '▁', 't', 'e', 's', 't', '.']
[3, 5, 10, 9, 11, 3, 9, 11, 3, 7, 3, 5, 4, 11, 5, 23]
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_word --model_type=word --vocab_size=2000')
sp_word = spm.SentencePieceProcessor()
sp_word.load('m_word.model')
print(sp_word.encode_as_pieces('this is a test.')) # '.' will not be one token.
print(sp_word.encode_as_ids('this is a test.'))
['▁this', '▁is', '▁a', '▁test.']
[31, 17, 8, 0]
Text normalization
# NFKC normalization and lower casing.
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --normalization_rule_name=nfkc_cf')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.encode_as_pieces('HELLO WORLD.')) # lower casing and normalization
['▁', 'hello', '▁world', '.']
def tocode(s):
out = []
for c in s:
out.append(str(hex(ord(c))).replace('0x', 'U+'))
return ' '.join(out)
# TSV format: source Unicode code points <tab> target code points
# normalize "don't => do not, I'm => I am"
with open('normalization_rule.tsv', 'w') as f:
f.write(tocode("I'm") + '\t' + tocode("I am") + '\n')
f.write(tocode("don't") + '\t' + tocode("do not") + '\n')
print(open('normalization_rule.tsv', 'r').read())
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --normalization_rule_tsv=normalization_rule.tsv')
sp = spm.SentencePieceProcessor()
# m.model embeds the normalization rule compiled into an FST.
sp.load('m.model')
print(sp.encode_as_pieces("I'm busy")) # normalzied to `I am busy'
print(sp.encode_as_pieces("I don't know it.")) # normalized to 'I do not know it.'
U+49 U+27 U+6d U+49 U+20 U+61 U+6d
U+64 U+6f U+6e U+27 U+74 U+64 U+6f U+20 U+6e U+6f U+74
['▁I', '▁am', '▁bu', 's', 'y']
['▁I', '▁do', '▁not', '▁know', '▁it', '.']
Randomizing training data
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --input_sentence_size=1000')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
sp.encode_as_pieces('this is a test.')
['▁this', '▁is', '▁a', '▁t', 'est', '.']
Put it All together
from __future__ import absolute_import, division, print_function, unicode_literals
import os
import sentencepiece as spm
input = "/workspace/spm_tutorial/botchan.txt"
model_type = "unigram"
vocab_size = 2000
path, fname = os.path.split(input)
prefix = os.path.join(path, fname + "." + str(vocab_size) + "." + model_type + ".wp")
# train
print("Computing word pieces...\n", flush=True)
train_cmd = (
"--input={input} --model_prefix={prefix} --vocab_size={vocab_size}"
" --character_coverage=1.0 --model_type={model_type}"
" --split_by_unicode_script=false".format(
input=input,
prefix=prefix,
vocab_size=vocab_size,
model_type=model_type # ["unigram", "bpe", "char"]
)
)
spm.SentencePieceTrainer.Train(train_cmd)
(py38) root@557bec2a5c9d:/workspace/spm_tutorial# python spm.py
Computing word pieces...
sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=/workspace/spm_tutorial/botchan.txt --model_prefix=/workspace/spm_tutorial/botchan.txt.2000.unigram.wp --vocab_size=2000 --character_coverage=1.0 --model_type=unigram --split_by_unicode_script=false
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with :
trainer_spec {
input: /workspace/spm_tutorial/botchan.txt
input_format:
model_prefix: /workspace/spm_tutorial/botchan.txt.2000.unigram.wp
model_type: UNIGRAM
vocab_size: 2000
self_test_sample_size: 0
character_coverage: 1
input_sentence_size: 0
shuffle_input_sentence: 1
seed_sentencepiece_size: 1000000
shrinking_factor: 0.75
max_sentence_length: 4192
num_threads: 16
num_sub_iterations: 2
max_sentencepiece_length: 16
split_by_unicode_script: 0
split_by_number: 1
split_by_whitespace: 1
split_digits: 0
treat_whitespace_as_suffix: 0
allow_whitespace_only_pieces: 0
required_chars:
byte_fallback: 0
vocabulary_output_piece_score: 1
train_extremely_large_corpus: 0
hard_vocab_limit: 1
use_all_vocab: 0
unk_id: 0
bos_id: 1
eos_id: 2
pad_id: -1
unk_piece: <unk>
bos_piece: <s>
eos_piece: </s>
pad_piece: <pad>
unk_surface: ⁇
}
normalizer_spec {
name: nmt_nfkc
add_dummy_prefix: 1
remove_extra_whitespaces: 1
escape_whitespaces: 1
normalization_rule_tsv:
}
denormalizer_spec {}
trainer_interface.cc(329) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator.
trainer_interface.cc(178) LOG(INFO) Loading corpus: /workspace/spm_tutorial/botchan.txt
trainer_interface.cc(385) LOG(INFO) Loaded all 4288 sentences
trainer_interface.cc(400) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(400) LOG(INFO) Adding meta_piece: <s>
trainer_interface.cc(400) LOG(INFO) Adding meta_piece: </s>
trainer_interface.cc(405) LOG(INFO) Normalizing sentences...
trainer_interface.cc(466) LOG(INFO) all chars count=274252
trainer_interface.cc(487) LOG(INFO) Alphabet size=83
trainer_interface.cc(488) LOG(INFO) Final character coverage=1
trainer_interface.cc(520) LOG(INFO) Done! preprocessed 4288 sentences.
unigram_model_trainer.cc(139) LOG(INFO) Making suffix array...
unigram_model_trainer.cc(143) LOG(INFO) Extracting frequent sub strings...
unigram_model_trainer.cc(194) LOG(INFO) Initialized 20470 seed sentencepieces
trainer_interface.cc(526) LOG(INFO) Tokenizing input sentences with whitespace: 4288
trainer_interface.cc(537) LOG(INFO) Done! 9183
unigram_model_trainer.cc(489) LOG(INFO) Using 9183 sentences for EM training
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=7200 obj=9.97176 num_tokens=16901 num_tokens/piece=2.34736
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=6197 obj=8.34115 num_tokens=17017 num_tokens/piece=2.74601
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=4644 obj=8.37819 num_tokens=18438 num_tokens/piece=3.97028
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=4640 obj=8.31629 num_tokens=18495 num_tokens/piece=3.98599
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=3480 obj=8.59811 num_tokens=20668 num_tokens/piece=5.93908
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=3480 obj=8.51923 num_tokens=20669 num_tokens/piece=5.93937
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=2610 obj=8.91866 num_tokens=23381 num_tokens/piece=8.95824
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=2610 obj=8.83068 num_tokens=23383 num_tokens/piece=8.959
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=2200 obj=9.07924 num_tokens=25089 num_tokens/piece=11.4041
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=2200 obj=9.03234 num_tokens=25090 num_tokens/piece=11.4045
trainer_interface.cc(615) LOG(INFO) Saving model: /workspace/spm_tutorial/botchan.txt.2000.unigram.wp.model
trainer_interface.cc(626) LOG(INFO) Saving vocabs: /workspace/spm_tutorial/botchan.txt.2000.unigram.wp.vocab
import sentencepiece as spm
prefix = "botchan.txt.2000.unigram.wp"
input = "/workspace/spm_tutorial/botchan.txt"
sp = spm.SentencePieceProcessor()
sp.Load(prefix + ".model")
with open(input, 'rt') as f:
lines = f.readlines()
for i in range(len(lines)):
if i == 5 :
break;
print('original line : {}'.format(lines[i]))
print('processed line : {}'.format(sp.EncodeAsPieces(lines[i])))
print()
original line : Project Gutenberg's Botchan (Master Darling), by Kin-nosuke Natsume
processed line : ['▁Project', '▁Gutenberg', "'s", '▁', 'Botchan', '▁(M', 'aster', '▁Darling', ')', ',', '▁by', '▁K', 'in', '-', 'nosu', 'ke', '▁Natsume']
original line : This eBook is for the use of anyone anywhere at no cost and with
processed line : ['▁This', '▁eBook', '▁is', '▁for', '▁the', '▁use', '▁of', '▁anyone', '▁any', 'where', '▁at', '▁no', '▁cost', '▁and', '▁with']
original line : almost no restrictions whatsoever. You may copy it, give it away or
processed line : ['▁almost', '▁no', '▁re', 'strict', 'ion', 's', '▁what', 's', 'o', 'ever', '.', '▁You', '▁may', '▁copy', '▁it', ',', '▁give', '▁it', '▁away', '▁or']
original line : re-use it under the terms of the Project Gutenberg License included
processed line : ['▁re', '-', 'us', 'e', '▁it', '▁under', '▁the', '▁terms', '▁of', '▁the', '▁Project', '▁Gutenberg', '▁License', '▁includ', 'ed']
original line : with this eBook or online at www.gutenberg.org
processed line : ['▁with', '▁this', '▁eBook', '▁or', '▁on', 'line', '▁at', '▁w', 'ww.gutenberg.org']
References
- SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
- Neural Machine Translation of Rare Words with Subword Units
- Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
- github.com/google/sentencepiece
- SentencePiece 알고리즘
- ASRfromScratch.ipynb with Speechbrain
- sentencepiece_python_module_example
- Sentencepiece 사용하기 from Inhyeok Yoo
- (issue) My training crashes with large corpus.
- (issue) Vocab size to train LM and ASR
- espnet/egs/librispeech/asr1/run.sh
- 13. 서브워드 토크나이저(Subword Tokenizer) from 딥 러닝을 이용한 자연어 처리 입문
- wav2letter/recipes/utilities/prepare_librispeech_wp_and_official_lexicon.py