Word Segmentation Algorithms (Training Tokenizer with SentencePiece)


< 목차 >


tmp

## Character (Letter) Tokenizer
THE CITY OF MONTREAL => ['T','H','E', '_', 'C','I','T','Y','_', 'O', 'F', '_, 'M','O','N','T','R','E','A','L']

## Word Tokenizer
THE CITY OF MONTREAL => ['THE','CITY','OF','MONTREAL']

## Byte-Pair Encoding (BPE) Tokenizer
THE CITY OF MONTREAL => ['THE', 'CITY', 'OF', 'MO', 'NT', 'RE', 'AL']

byte-pair-encoding (BPE)

WordPiece Model (WPM)

unigram language model

SPM Example

mkdir -p /workspace/spm_tutorial
cd /workspace/spm_tutorial
wget wget https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt
(py38) root@557bec2a5c9d:/workspace/spm_tutorial# tree .
.
`-- botchan.txt

Basic end-to-end example

import sentencepiece as spm

# train sentencepiece model from `botchan.txt` and makes `m.model` and `m.vocab`
# `m.vocab` is just a reference. not used in the segmentation.
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')

# makes segmenter instance and loads the model file (m.model)
sp = spm.SentencePieceProcessor()
sp.load('m.model')

# encode: text => id
print(sp.encode_as_pieces('This is a test'))
print(sp.encode_as_ids('This is a test'))

# decode: id => text
print(sp.decode_pieces(['▁This', '▁is', '▁a', '▁t', 'est']))
print(sp.decode_ids([209, 31, 9, 375, 586]))
['▁This', '▁is', '▁a', '▁t', 'est']
[209, 31, 9, 375, 586]
This is a test
This is a test
(py38) root@557bec2a5c9d:/workspace/spm_tutorial# tree .
.
|-- botchan.txt
|-- m.model
`-- m.vocab
(py38) root@557bec2a5c9d:/workspace/spm_tutorial# head -20 m.vocab 
<unk>   0
<s>     0
</s>    0
,       -3.2299
.       -3.36342
▁the    -3.40218
▁I      -3.74108
s       -3.89451
▁to     -3.91479
▁a      -4.02289
▁and    -4.09969
▁of     -4.10888
▁       -4.38181
ing     -4.44207
ed      -4.44878
▁in     -4.53349
▁was    -4.61109
▁"      -4.64108
▁it     -4.78218
t       -4.80042

여기서 특별한 옵션을 주지 않았다면 <unk>, <s>, </s> 이 세 가지는 special token 으로서 dictionary 의 0, 1, 2 번으로 매핑되며 특히 <s></s> 는 control symbol 로 정의됩니다.

# returns vocab size
print(sp.get_piece_size())

# id <=> piece conversion
print(sp.id_to_piece(209))
print(sp.piece_to_id('▁This'))

# returns 0 for unknown tokens (we can change the id for UNK)
print(sp.piece_to_id('__MUST_BE_UNKNOWN__'))

# <unk>, <s>, </s> are defined by default. Their ids are (0, 1, 2)
# <s> and </s> are defined as 'control' symbol.
for id in range(3):
  print(sp.id_to_piece(id), sp.is_control(id))
2000
▁This
209
0
<unk> False
<s> True
</s> True

User defined and control symbols (for BERT or something)

# Example of user defined symbols
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_user --user_defined_symbols=<sep>,<cls> --vocab_size=2000')

sp_user = spm.SentencePieceProcessor()
sp_user.load('m_user.model')

# ids are reserved in both mode.
# <unk>=0, <s>=1, </s>=2, <sep>=3, <cls>=4
# user defined symbols allow these symbol to apper in the text.
print(sp_user.encode_as_pieces('this is a test<sep> hello world<cls>'))
print(sp_user.piece_to_id('<sep>'))  # 3
print(sp_user.piece_to_id('<cls>'))  # 4
print('3=', sp_user.decode_ids([3]))  # decoded to <sep>
print('4=', sp_user.decode_ids([4]))  # decoded to <cls>
['▁this', '▁is', '▁a', '▁t', 'est', '<sep>', '▁he', 'll', 'o', '▁world', '<cls>']
3
4
3= <sep>
4= <cls>
# Example of control symbols
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_ctrl --control_symbols=<sep>,<cls> --vocab_size=2000')

sp_ctrl = spm.SentencePieceProcessor()
sp_ctrl.load('m_ctrl.model')

# control symbols just reserve ids.
print(sp_ctrl.encode_as_pieces('this is a test<sep> hello world<cls>'))
print(sp_ctrl.piece_to_id('<sep>'))  # 3
print(sp_ctrl.piece_to_id('<cls>'))  # 4
print('3=', sp_ctrl.decode_ids([3]))  # decoded to empty
print('4=', sp_ctrl.decode_ids([4]))  # decoded to empty
['▁this', '▁is', '▁a', '▁t', 'est', '<', 'se', 'p', '>', '▁he', 'll', 'o', '▁world', '<', 'c', 'l', 's', '>']
3
4
3= 
4= 
['▁', '<', 's', '>', '▁he', 'll', 'o', '</', 's', '>']
['▁', '<s>', '▁he', 'll', 'o', '</s>']
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_bos_as_user --user_defined_symbols=<s>,</s> --vocab_size=2000')

sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.encode_as_pieces('<s> hello</s>'))   # <s>,</s> are segmented. (default behavior)

sp = spm.SentencePieceProcessor()
sp.load('m_bos_as_user.model')
print(sp.encode_as_pieces('<s> hello</s>'))   # <s>,</s> are handled as one token.

Manipulating BOS/EOS/EOS/PAD symbols

spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')

sp = spm.SentencePieceProcessor()
sp.load('m.model')

print('bos=', sp.bos_id())
print('eos=', sp.eos_id())
print('unk=', sp.unk_id())
print('pad=', sp.pad_id())  # disabled by default


print(sp.encode_as_ids('Hello world'))

# Prepend or append bos/eos ids.
print([sp.bos_id()] + sp.encode_as_ids('Hello world') + [sp.eos_id()])
bos= 1
eos= 2
unk= 0
pad= -1
[12, 1828, 1038]
[1, 12, 1828, 1038, 2]

Changing the vocab id and surface representation of UNK/BOS/EOS/PAD symbols

spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m --pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3 --pad_piece=[PAD] --unk_piece=[UNK] --bos_piece=[BOS] --eos_piece=[EOS]')
sp = spm.SentencePieceProcessor()
sp.load('m.model')


for id in range(4):
    print(sp.id_to_piece(id), sp.is_control(id))
[PAD] True
[UNK] False
[BOS] True
[EOS] True
# Disable BOS/EOS
spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m --bos_id=-1 --eos_id=-1')
sp = spm.SentencePieceProcessor()
sp.load('m.model')

# <s>, </s> are UNK.
print(sp.unk_id())
print(sp.piece_to_id('<s>'))
print(sp.piece_to_id('</s>'))
0
0
0

Sampling and nbest segmentation for subword regularization (can be used for Lexicon Generation)

spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')

# Can obtain different segmentations per request.
# There are two hyperparamenters for sampling (nbest_size and inverse temperature). see the paper [kudo18] for detail.
for n in range(10):
  print(sp.sample_encode_as_pieces('hello world', -1, 0.1))

for n in range(10):
  print(sp.sample_encode_as_ids('hello world', -1, 0.1))
['▁', 'h', 'e', 'll', 'o', '▁w', 'o', 'r', 'l', 'd']
['▁he', 'l', 'l', 'o', '▁world']
['▁he', 'l', 'l', 'o', '▁w', 'or', 'l', 'd']
['▁', 'he', 'l', 'l', 'o', '▁world']
['▁', 'he', 'll', 'o', '▁w', 'o', 'r', 'l', 'd']
['▁', 'he', 'll', 'o', '▁world']
['▁he', 'll', 'o', '▁world']
['▁', 'he', 'll', 'o', '▁world']
['▁he', 'll', 'o', '▁w', 'o', 'r', 'l', 'd']
['▁', 'h', 'e', 'l', 'l', 'o', '▁w', 'o', 'r', 'l', 'd']
[12, 489, 57, 57, 38, 1246, 57, 20]
[28, 98, 38, 1038]
[12, 489, 98, 38, 12, 151, 105, 57, 20]
[12, 489, 98, 38, 1038]
[28, 98, 38, 254, 105, 57, 20]
[12, 489, 98, 38, 12, 151, 38, 46, 57, 20]
[28, 57, 57, 38, 1038]
[28, 98, 38, 1038]
[12, 96, 351, 57, 38, 1038]
[28, 98, 38, 1038]
# get 10 best
print(sp.nbest_encode_as_pieces('hello world', 10))
print(sp.nbest_encode_as_ids('hello world', 10))
[['▁he', 'll', 'o', '▁world'], ['▁he', 'l', 'l', 'o', '▁world'], ['▁', 'he', 'll', 'o', '▁world'], ['▁', 'h', 'e', 'll', 'o', '▁world'], ['▁he', 'll', 'o', '▁wor', 'l', 'd'], ['▁', 'he', 'l', 'l', 'o', '▁world'], ['▁', 'h', 'el', 'l', 'o', '▁world'], ['▁he', 'll', 'o', '▁w', 'or', 'l', 'd'], ['▁', 'h', 'e', 'l', 'l', 'o', '▁world'], ['▁he', 'l', 'l', 'o', '▁wor', 'l', 'd']]
[[28, 98, 38, 1038], [28, 57, 57, 38, 1038], [12, 489, 98, 38, 1038], [12, 96, 25, 98, 38, 1038], [28, 98, 38, 1246, 57, 20], [12, 489, 57, 57, 38, 1038], [12, 96, 351, 57, 38, 1038], [28, 98, 38, 254, 105, 57, 20], [12, 96, 25, 57, 57, 38, 1038], [28, 57, 57, 38, 1246, 57, 20]]

BPE (Byte pair encoding) model (BPE vs Unigram)

spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_bpe --vocab_size=2000 --model_type=bpe')
sp_bpe = spm.SentencePieceProcessor()
sp_bpe.load('m_bpe.model')

print('*** BPE ***')
print(sp_bpe.encode_as_pieces('thisisatesthelloworld'))
print(sp_bpe.nbest_encode_as_pieces('hello world', 5))  # returns an empty list.
*** BPE ***
['▁this', 'is', 'at', 'est', 'he', 'llow', 'or', 'ld']
[]
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_unigram --vocab_size=2000 --model_type=unigram')
sp_unigram = spm.SentencePieceProcessor()
sp_unigram.load('m_unigram.model')

print('*** Unigram ***')
print(sp_unigram.encode_as_pieces('thisisatesthelloworld'))
print(sp_unigram.nbest_encode_as_pieces('thisisatesthelloworld', 5))
*** Unigram ***
['▁this', 'is', 'ate', 's', 'the', 'llow', 'or', 'l', 'd']
[['▁this', 'is', 'ate', 's', 'the', 'llow', 'or', 'l', 'd'], ['▁this', 'i', 's', 'ate', 's', 'the', 'llow', 'or', 'l', 'd'], ['▁this', 'is', 'ate', 'st', 'he', 'llow', 'or', 'l', 'd'], ['▁this', 'is', 'at', 'es', 'the', 'llow', 'or', 'l', 'd'], ['▁this', 'is', 'at', 'est', 'he', 'llow', 'or', 'l', 'd']]

Character and word model (not Subword)

spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_char --model_type=char --vocab_size=400')

sp_char = spm.SentencePieceProcessor()
sp_char.load('m_char.model')

print(sp_char.encode_as_pieces('this is a test.'))
print(sp_char.encode_as_ids('this is a test.'))
['▁', 't', 'h', 'i', 's', '▁', 'i', 's', '▁', 'a', '▁', 't', 'e', 's', 't', '.']
[3, 5, 10, 9, 11, 3, 9, 11, 3, 7, 3, 5, 4, 11, 5, 23]
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_word --model_type=word --vocab_size=2000')

sp_word = spm.SentencePieceProcessor()
sp_word.load('m_word.model')

print(sp_word.encode_as_pieces('this is a test.'))  # '.' will not be one token.
print(sp_word.encode_as_ids('this is a test.'))
['▁this', '▁is', '▁a', '▁test.']
[31, 17, 8, 0]

Text normalization

# NFKC normalization and lower casing.
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --normalization_rule_name=nfkc_cf')

sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.encode_as_pieces('HELLO WORLD.'))  # lower casing and normalization
['▁', 'hello', '▁world', '.']
def tocode(s):
    out = []
    for c in s:
        out.append(str(hex(ord(c))).replace('0x', 'U+'))
    return ' '.join(out)


# TSV format:  source Unicode code points <tab> target code points
# normalize "don't => do not,  I'm => I am"
with open('normalization_rule.tsv', 'w') as f:
  f.write(tocode("I'm") + '\t' + tocode("I am") + '\n')
  f.write(tocode("don't") + '\t' + tocode("do not") + '\n')

print(open('normalization_rule.tsv', 'r').read())

spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --normalization_rule_tsv=normalization_rule.tsv')

sp = spm.SentencePieceProcessor()
# m.model embeds the normalization rule compiled into an FST.
sp.load('m.model')
print(sp.encode_as_pieces("I'm busy"))  # normalzied to `I am busy'
print(sp.encode_as_pieces("I don't know it."))  # normalized to 'I do not know it.'
U+49 U+27 U+6d	U+49 U+20 U+61 U+6d
U+64 U+6f U+6e U+27 U+74	U+64 U+6f U+20 U+6e U+6f U+74

['▁I', '▁am', '▁bu', 's', 'y']
['▁I', '▁do', '▁not', '▁know', '▁it', '.']

Randomizing training data

spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --input_sentence_size=1000')

sp = spm.SentencePieceProcessor()
sp.load('m.model')

sp.encode_as_pieces('this is a test.')
['▁this', '▁is', '▁a', '▁t', 'est', '.']

Put it All together

from __future__ import absolute_import, division, print_function, unicode_literals
import os
import sentencepiece as spm

input = "/workspace/spm_tutorial/botchan.txt"
model_type = "unigram"
vocab_size = 2000

path, fname = os.path.split(input)
prefix = os.path.join(path, fname + "." + str(vocab_size) + "." + model_type + ".wp")

# train
print("Computing word pieces...\n", flush=True)
train_cmd = (
    "--input={input} --model_prefix={prefix} --vocab_size={vocab_size}"
    " --character_coverage=1.0 --model_type={model_type}"
    " --split_by_unicode_script=false".format(
        input=input, 
        prefix=prefix, 
        vocab_size=vocab_size,
        model_type=model_type # ["unigram", "bpe", "char"]
    )
)
spm.SentencePieceTrainer.Train(train_cmd)
(py38) root@557bec2a5c9d:/workspace/spm_tutorial# python spm.py      
Computing word pieces...

sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=/workspace/spm_tutorial/botchan.txt --model_prefix=/workspace/spm_tutorial/botchan.txt.2000.unigram.wp --vocab_size=2000 --character_coverage=1.0 --model_type=unigram --split_by_unicode_script=false
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: /workspace/spm_tutorial/botchan.txt
  input_format: 
  model_prefix: /workspace/spm_tutorial/botchan.txt.2000.unigram.wp
  model_type: UNIGRAM
  vocab_size: 2000
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 0
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
}
normalizer_spec {
  name: nmt_nfkc
  add_dummy_prefix: 1
  remove_extra_whitespaces: 1
  escape_whitespaces: 1
  normalization_rule_tsv: 
}
denormalizer_spec {}
trainer_interface.cc(329) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator.
trainer_interface.cc(178) LOG(INFO) Loading corpus: /workspace/spm_tutorial/botchan.txt
trainer_interface.cc(385) LOG(INFO) Loaded all 4288 sentences
trainer_interface.cc(400) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(400) LOG(INFO) Adding meta_piece: <s>
trainer_interface.cc(400) LOG(INFO) Adding meta_piece: </s>
trainer_interface.cc(405) LOG(INFO) Normalizing sentences...
trainer_interface.cc(466) LOG(INFO) all chars count=274252
trainer_interface.cc(487) LOG(INFO) Alphabet size=83
trainer_interface.cc(488) LOG(INFO) Final character coverage=1
trainer_interface.cc(520) LOG(INFO) Done! preprocessed 4288 sentences.
unigram_model_trainer.cc(139) LOG(INFO) Making suffix array...
unigram_model_trainer.cc(143) LOG(INFO) Extracting frequent sub strings...
unigram_model_trainer.cc(194) LOG(INFO) Initialized 20470 seed sentencepieces
trainer_interface.cc(526) LOG(INFO) Tokenizing input sentences with whitespace: 4288
trainer_interface.cc(537) LOG(INFO) Done! 9183
unigram_model_trainer.cc(489) LOG(INFO) Using 9183 sentences for EM training
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=7200 obj=9.97176 num_tokens=16901 num_tokens/piece=2.34736
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=6197 obj=8.34115 num_tokens=17017 num_tokens/piece=2.74601
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=4644 obj=8.37819 num_tokens=18438 num_tokens/piece=3.97028
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=4640 obj=8.31629 num_tokens=18495 num_tokens/piece=3.98599
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=3480 obj=8.59811 num_tokens=20668 num_tokens/piece=5.93908
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=3480 obj=8.51923 num_tokens=20669 num_tokens/piece=5.93937
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=2610 obj=8.91866 num_tokens=23381 num_tokens/piece=8.95824
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=2610 obj=8.83068 num_tokens=23383 num_tokens/piece=8.959
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=2200 obj=9.07924 num_tokens=25089 num_tokens/piece=11.4041
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=2200 obj=9.03234 num_tokens=25090 num_tokens/piece=11.4045
trainer_interface.cc(615) LOG(INFO) Saving model: /workspace/spm_tutorial/botchan.txt.2000.unigram.wp.model
trainer_interface.cc(626) LOG(INFO) Saving vocabs: /workspace/spm_tutorial/botchan.txt.2000.unigram.wp.vocab
import sentencepiece as spm
prefix = "botchan.txt.2000.unigram.wp"
input = "/workspace/spm_tutorial/botchan.txt"

sp = spm.SentencePieceProcessor()
sp.Load(prefix + ".model")

with open(input, 'rt') as f:
    lines = f.readlines()

for i in range(len(lines)):
    if i == 5 : 
        break;
    print('original line : {}'.format(lines[i]))
    print('processed line : {}'.format(sp.EncodeAsPieces(lines[i])))
    print()
original line : Project Gutenberg's Botchan (Master Darling), by Kin-nosuke Natsume

processed line : ['▁Project', '▁Gutenberg', "'s", '▁', 'Botchan', '▁(M', 'aster', '▁Darling', ')', ',', '▁by', '▁K', 'in', '-', 'nosu', 'ke', '▁Natsume']

original line : This eBook is for the use of anyone anywhere at no cost and with

processed line : ['▁This', '▁eBook', '▁is', '▁for', '▁the', '▁use', '▁of', '▁anyone', '▁any', 'where', '▁at', '▁no', '▁cost', '▁and', '▁with']

original line : almost no restrictions whatsoever.  You may copy it, give it away or

processed line : ['▁almost', '▁no', '▁re', 'strict', 'ion', 's', '▁what', 's', 'o', 'ever', '.', '▁You', '▁may', '▁copy', '▁it', ',', '▁give', '▁it', '▁away', '▁or']

original line : re-use it under the terms of the Project Gutenberg License included

processed line : ['▁re', '-', 'us', 'e', '▁it', '▁under', '▁the', '▁terms', '▁of', '▁the', '▁Project', '▁Gutenberg', '▁License', '▁includ', 'ed']

original line : with this eBook or online at www.gutenberg.org

processed line : ['▁with', '▁this', '▁eBook', '▁or', '▁on', 'line', '▁at', '▁w', 'ww.gutenberg.org']

References