Word Segmentation Algorithms (Training Tokenizer with SentencePiece)

< 목차 >

tmp
SPM Example
References

tmp

## Character (Letter) Tokenizer
THE CITY OF MONTREAL => ['T','H','E', '_', 'C','I','T','Y','_', 'O', 'F', '_, 'M','O','N','T','R','E','A','L']

## Word Tokenizer
THE CITY OF MONTREAL => ['THE','CITY','OF','MONTREAL']

## Byte-Pair Encoding (BPE) Tokenizer
THE CITY OF MONTREAL => ['THE', '▁CITY', '▁OF', '▁MO', 'NT', 'RE', 'AL']

byte-pair-encoding (BPE)

WordPiece Model (WPM)

unigram language model

SPM Example

mkdir -p /workspace/spm_tutorial
cd /workspace/spm_tutorial
wget wget https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt

(py38) root@557bec2a5c9d:/workspace/spm_tutorial# tree .
.
`-- botchan.txt

Basic end-to-end example

import sentencepiece as spm

# train sentencepiece model from `botchan.txt` and makes `m.model` and `m.vocab`
# `m.vocab` is just a reference. not used in the segmentation.
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')

# makes segmenter instance and loads the model file (m.model)
sp = spm.SentencePieceProcessor()
sp.load('m.model')

# encode: text => id
print(sp.encode_as_pieces('This is a test'))
print(sp.encode_as_ids('This is a test'))

# decode: id => text
print(sp.decode_pieces(['▁This', '▁is', '▁a', '▁t', 'est']))
print(sp.decode_ids([209, 31, 9, 375, 586]))

['▁This', '▁is', '▁a', '▁t', 'est']
[209, 31, 9, 375, 586]
This is a test
This is a test

(py38) root@557bec2a5c9d:/workspace/spm_tutorial# tree .
.
|-- botchan.txt
|-- m.model
`-- m.vocab

(py38) root@557bec2a5c9d:/workspace/spm_tutorial# head -20 m.vocab 
<unk>   0
<s>     0
</s>    0
,       -3.2299
.       -3.36342
▁the    -3.40218
▁I      -3.74108
s       -3.89451
▁to     -3.91479
▁a      -4.02289
▁and    -4.09969
▁of     -4.10888
▁       -4.38181
ing     -4.44207
ed      -4.44878
▁in     -4.53349
▁was    -4.61109
▁"      -4.64108
▁it     -4.78218
t       -4.80042

여기서 특별한 옵션을 주지 않았다면 <unk>, <s>, </s> 이 세 가지는 special token 으로서 dictionary 의 0, 1, 2 번으로 매핑되며 특히 <s>와 </s> 는 control symbol 로 정의됩니다.

# returns vocab size
print(sp.get_piece_size())

# id <=> piece conversion
print(sp.id_to_piece(209))
print(sp.piece_to_id('▁This'))

# returns 0 for unknown tokens (we can change the id for UNK)
print(sp.piece_to_id('__MUST_BE_UNKNOWN__'))

# <unk>, <s>, </s> are defined by default. Their ids are (0, 1, 2)
# <s> and </s> are defined as 'control' symbol.
for id in range(3):
  print(sp.id_to_piece(id), sp.is_control(id))

2000
▁This
209
0
<unk> False
<s> True
</s> True

User defined and control symbols (for BERT or something)

# Example of user defined symbols
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_user --user_defined_symbols=<sep>,<cls> --vocab_size=2000')

sp_user = spm.SentencePieceProcessor()
sp_user.load('m_user.model')

# ids are reserved in both mode.
# <unk>=0, <s>=1, </s>=2, <sep>=3, <cls>=4
# user defined symbols allow these symbol to apper in the text.
print(sp_user.encode_as_pieces('this is a test<sep> hello world<cls>'))
print(sp_user.piece_to_id('<sep>'))  # 3
print(sp_user.piece_to_id('<cls>'))  # 4
print('3=', sp_user.decode_ids([3]))  # decoded to <sep>
print('4=', sp_user.decode_ids([4]))  # decoded to <cls>

['▁this', '▁is', '▁a', '▁t', 'est', '<sep>', '▁he', 'll', 'o', '▁world', '<cls>']
3
4
3= <sep>
4= <cls>

# Example of control symbols
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_ctrl --control_symbols=<sep>,<cls> --vocab_size=2000')

sp_ctrl = spm.SentencePieceProcessor()
sp_ctrl.load('m_ctrl.model')

# control symbols just reserve ids.
print(sp_ctrl.encode_as_pieces('this is a test<sep> hello world<cls>'))
print(sp_ctrl.piece_to_id('<sep>'))  # 3
print(sp_ctrl.piece_to_id('<cls>'))  # 4
print('3=', sp_ctrl.decode_ids([3]))  # decoded to empty
print('4=', sp_ctrl.decode_ids([4]))  # decoded to empty

['▁this', '▁is', '▁a', '▁t', 'est', '<', 'se', 'p', '>', '▁he', 'll', 'o', '▁world', '<', 'c', 'l', 's', '>']
3
4
3= 
4= 

['▁', '<', 's', '>', '▁he', 'll', 'o', '</', 's', '>']
['▁', '<s>', '▁he', 'll', 'o', '</s>']

spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_bos_as_user --user_defined_symbols=<s>,</s> --vocab_size=2000')

sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.encode_as_pieces('<s> hello</s>'))   # <s>,</s> are segmented. (default behavior)

sp = spm.SentencePieceProcessor()
sp.load('m_bos_as_user.model')
print(sp.encode_as_pieces('<s> hello</s>'))   # <s>,</s> are handled as one token.

Manipulating BOS/EOS/EOS/PAD symbols

spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')

sp = spm.SentencePieceProcessor()
sp.load('m.model')

print('bos=', sp.bos_id())
print('eos=', sp.eos_id())
print('unk=', sp.unk_id())
print('pad=', sp.pad_id())  # disabled by default


print(sp.encode_as_ids('Hello world'))

# Prepend or append bos/eos ids.
print([sp.bos_id()] + sp.encode_as_ids('Hello world') + [sp.eos_id()])

bos= 1
eos= 2
unk= 0
pad= -1
[12, 1828, 1038]
[1, 12, 1828, 1038, 2]

Changing the vocab id and surface representation of UNK/BOS/EOS/PAD symbols

spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m --pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3 --pad_piece=[PAD] --unk_piece=[UNK] --bos_piece=[BOS] --eos_piece=[EOS]')
sp = spm.SentencePieceProcessor()
sp.load('m.model')


for id in range(4):
    print(sp.id_to_piece(id), sp.is_control(id))

[PAD] True
[UNK] False
[BOS] True
[EOS] True

# Disable BOS/EOS
spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m --bos_id=-1 --eos_id=-1')
sp = spm.SentencePieceProcessor()
sp.load('m.model')

# <s>, </s> are UNK.
print(sp.unk_id())
print(sp.piece_to_id('<s>'))
print(sp.piece_to_id('</s>'))

0
0
0

Sampling and nbest segmentation for subword regularization (can be used for Lexicon Generation)

spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')

# Can obtain different segmentations per request.
# There are two hyperparamenters for sampling (nbest_size and inverse temperature). see the paper [kudo18] for detail.
for n in range(10):
  print(sp.sample_encode_as_pieces('hello world', -1, 0.1))

for n in range(10):
  print(sp.sample_encode_as_ids('hello world', -1, 0.1))

['▁', 'h', 'e', 'll', 'o', '▁w', 'o', 'r', 'l', 'd']
['▁he', 'l', 'l', 'o', '▁world']
['▁he', 'l', 'l', 'o', '▁w', 'or', 'l', 'd']
['▁', 'he', 'l', 'l', 'o', '▁world']
['▁', 'he', 'll', 'o', '▁w', 'o', 'r', 'l', 'd']
['▁', 'he', 'll', 'o', '▁world']
['▁he', 'll', 'o', '▁world']
['▁', 'he', 'll', 'o', '▁world']
['▁he', 'll', 'o', '▁w', 'o', 'r', 'l', 'd']
['▁', 'h', 'e', 'l', 'l', 'o', '▁w', 'o', 'r', 'l', 'd']
[12, 489, 57, 57, 38, 1246, 57, 20]
[28, 98, 38, 1038]
[12, 489, 98, 38, 12, 151, 105, 57, 20]
[12, 489, 98, 38, 1038]
[28, 98, 38, 254, 105, 57, 20]
[12, 489, 98, 38, 12, 151, 38, 46, 57, 20]
[28, 57, 57, 38, 1038]
[28, 98, 38, 1038]
[12, 96, 351, 57, 38, 1038]
[28, 98, 38, 1038]

# get 10 best
print(sp.nbest_encode_as_pieces('hello world', 10))
print(sp.nbest_encode_as_ids('hello world', 10))

[['▁he', 'll', 'o', '▁world'], ['▁he', 'l', 'l', 'o', '▁world'], ['▁', 'he', 'll', 'o', '▁world'], ['▁', 'h', 'e', 'll', 'o', '▁world'], ['▁he', 'll', 'o', '▁wor', 'l', 'd'], ['▁', 'he', 'l', 'l', 'o', '▁world'], ['▁', 'h', 'el', 'l', 'o', '▁world'], ['▁he', 'll', 'o', '▁w', 'or', 'l', 'd'], ['▁', 'h', 'e', 'l', 'l', 'o', '▁world'], ['▁he', 'l', 'l', 'o', '▁wor', 'l', 'd']]
[[28, 98, 38, 1038], [28, 57, 57, 38, 1038], [12, 489, 98, 38, 1038], [12, 96, 25, 98, 38, 1038], [28, 98, 38, 1246, 57, 20], [12, 489, 57, 57, 38, 1038], [12, 96, 351, 57, 38, 1038], [28, 98, 38, 254, 105, 57, 20], [12, 96, 25, 57, 57, 38, 1038], [28, 57, 57, 38, 1246, 57, 20]]

BPE (Byte pair encoding) model (BPE vs Unigram)

spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_bpe --vocab_size=2000 --model_type=bpe')
sp_bpe = spm.SentencePieceProcessor()
sp_bpe.load('m_bpe.model')

print('*** BPE ***')
print(sp_bpe.encode_as_pieces('thisisatesthelloworld'))
print(sp_bpe.nbest_encode_as_pieces('hello world', 5))  # returns an empty list.

*** BPE ***
['▁this', 'is', 'at', 'est', 'he', 'llow', 'or', 'ld']
[]

spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_unigram --vocab_size=2000 --model_type=unigram')
sp_unigram = spm.SentencePieceProcessor()
sp_unigram.load('m_unigram.model')

print('*** Unigram ***')
print(sp_unigram.encode_as_pieces('thisisatesthelloworld'))
print(sp_unigram.nbest_encode_as_pieces('thisisatesthelloworld', 5))

*** Unigram ***
['▁this', 'is', 'ate', 's', 'the', 'llow', 'or', 'l', 'd']
[['▁this', 'is', 'ate', 's', 'the', 'llow', 'or', 'l', 'd'], ['▁this', 'i', 's', 'ate', 's', 'the', 'llow', 'or', 'l', 'd'], ['▁this', 'is', 'ate', 'st', 'he', 'llow', 'or', 'l', 'd'], ['▁this', 'is', 'at', 'es', 'the', 'llow', 'or', 'l', 'd'], ['▁this', 'is', 'at', 'est', 'he', 'llow', 'or', 'l', 'd']]

Character and word model (not Subword)

spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_char --model_type=char --vocab_size=400')

sp_char = spm.SentencePieceProcessor()
sp_char.load('m_char.model')

print(sp_char.encode_as_pieces('this is a test.'))
print(sp_char.encode_as_ids('this is a test.'))

['▁', 't', 'h', 'i', 's', '▁', 'i', 's', '▁', 'a', '▁', 't', 'e', 's', 't', '.']
[3, 5, 10, 9, 11, 3, 9, 11, 3, 7, 3, 5, 4, 11, 5, 23]

spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_word --model_type=word --vocab_size=2000')

sp_word = spm.SentencePieceProcessor()
sp_word.load('m_word.model')

print(sp_word.encode_as_pieces('this is a test.'))  # '.' will not be one token.
print(sp_word.encode_as_ids('this is a test.'))

['▁this', '▁is', '▁a', '▁test.']
[31, 17, 8, 0]

Text normalization

# NFKC normalization and lower casing.
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --normalization_rule_name=nfkc_cf')

sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.encode_as_pieces('ＨＥＬＬＯ　ＷＯＲＬＤ.'))  # lower casing and normalization

['▁', 'hello', '▁world', '.']

def tocode(s):
    out = []
    for c in s:
        out.append(str(hex(ord(c))).replace('0x', 'U+'))
    return ' '.join(out)


# TSV format:  source Unicode code points <tab> target code points
# normalize "don't => do not,  I'm => I am"
with open('normalization_rule.tsv', 'w') as f:
  f.write(tocode("I'm") + '\t' + tocode("I am") + '\n')
  f.write(tocode("don't") + '\t' + tocode("do not") + '\n')

print(open('normalization_rule.tsv', 'r').read())

spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --normalization_rule_tsv=normalization_rule.tsv')

sp = spm.SentencePieceProcessor()
# m.model embeds the normalization rule compiled into an FST.
sp.load('m.model')
print(sp.encode_as_pieces("I'm busy"))  # normalzied to `I am busy'
print(sp.encode_as_pieces("I don't know it."))  # normalized to 'I do not know it.'

U+49 U+27 U+6d	U+49 U+20 U+61 U+6d
U+64 U+6f U+6e U+27 U+74	U+64 U+6f U+20 U+6e U+6f U+74

['▁I', '▁am', '▁bu', 's', 'y']
['▁I', '▁do', '▁not', '▁know', '▁it', '.']

Randomizing training data

spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --input_sentence_size=1000')

sp = spm.SentencePieceProcessor()
sp.load('m.model')

sp.encode_as_pieces('this is a test.')

['▁this', '▁is', '▁a', '▁t', 'est', '.']

Put it All together

from __future__ import absolute_import, division, print_function, unicode_literals
import os
import sentencepiece as spm

input = "/workspace/spm_tutorial/botchan.txt"
model_type = "unigram"
vocab_size = 2000

path, fname = os.path.split(input)
prefix = os.path.join(path, fname + "." + str(vocab_size) + "." + model_type + ".wp")

# train
print("Computing word pieces...\n", flush=True)
train_cmd = (
    "--input={input} --model_prefix={prefix} --vocab_size={vocab_size}"
    " --character_coverage=1.0 --model_type={model_type}"
    " --split_by_unicode_script=false".format(
        input=input, 
        prefix=prefix, 
        vocab_size=vocab_size,
        model_type=model_type # ["unigram", "bpe", "char"]
    )
)
spm.SentencePieceTrainer.Train(train_cmd)

(py38) root@557bec2a5c9d:/workspace/spm_tutorial# python spm.py      
Computing word pieces...

sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=/workspace/spm_tutorial/botchan.txt --model_prefix=/workspace/spm_tutorial/botchan.txt.2000.unigram.wp --vocab_size=2000 --character_coverage=1.0 --model_type=unigram --split_by_unicode_script=false
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: /workspace/spm_tutorial/botchan.txt
  input_format: 
  model_prefix: /workspace/spm_tutorial/botchan.txt.2000.unigram.wp
  model_type: UNIGRAM
  vocab_size: 2000
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 0
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
}
normalizer_spec {
  name: nmt_nfkc
  add_dummy_prefix: 1
  remove_extra_whitespaces: 1
  escape_whitespaces: 1
  normalization_rule_tsv: 
}
denormalizer_spec {}
trainer_interface.cc(329) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator.
trainer_interface.cc(178) LOG(INFO) Loading corpus: /workspace/spm_tutorial/botchan.txt
trainer_interface.cc(385) LOG(INFO) Loaded all 4288 sentences
trainer_interface.cc(400) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(400) LOG(INFO) Adding meta_piece: <s>
trainer_interface.cc(400) LOG(INFO) Adding meta_piece: </s>
trainer_interface.cc(405) LOG(INFO) Normalizing sentences...
trainer_interface.cc(466) LOG(INFO) all chars count=274252
trainer_interface.cc(487) LOG(INFO) Alphabet size=83
trainer_interface.cc(488) LOG(INFO) Final character coverage=1
trainer_interface.cc(520) LOG(INFO) Done! preprocessed 4288 sentences.
unigram_model_trainer.cc(139) LOG(INFO) Making suffix array...
unigram_model_trainer.cc(143) LOG(INFO) Extracting frequent sub strings...
unigram_model_trainer.cc(194) LOG(INFO) Initialized 20470 seed sentencepieces
trainer_interface.cc(526) LOG(INFO) Tokenizing input sentences with whitespace: 4288
trainer_interface.cc(537) LOG(INFO) Done! 9183
unigram_model_trainer.cc(489) LOG(INFO) Using 9183 sentences for EM training
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=7200 obj=9.97176 num_tokens=16901 num_tokens/piece=2.34736
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=6197 obj=8.34115 num_tokens=17017 num_tokens/piece=2.74601
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=4644 obj=8.37819 num_tokens=18438 num_tokens/piece=3.97028
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=4640 obj=8.31629 num_tokens=18495 num_tokens/piece=3.98599
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=3480 obj=8.59811 num_tokens=20668 num_tokens/piece=5.93908
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=3480 obj=8.51923 num_tokens=20669 num_tokens/piece=5.93937
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=2610 obj=8.91866 num_tokens=23381 num_tokens/piece=8.95824
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=2610 obj=8.83068 num_tokens=23383 num_tokens/piece=8.959
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=2200 obj=9.07924 num_tokens=25089 num_tokens/piece=11.4041
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=2200 obj=9.03234 num_tokens=25090 num_tokens/piece=11.4045
trainer_interface.cc(615) LOG(INFO) Saving model: /workspace/spm_tutorial/botchan.txt.2000.unigram.wp.model
trainer_interface.cc(626) LOG(INFO) Saving vocabs: /workspace/spm_tutorial/botchan.txt.2000.unigram.wp.vocab

import sentencepiece as spm
prefix = "botchan.txt.2000.unigram.wp"
input = "/workspace/spm_tutorial/botchan.txt"

sp = spm.SentencePieceProcessor()
sp.Load(prefix + ".model")

with open(input, 'rt') as f:
    lines = f.readlines()

for i in range(len(lines)):
    if i == 5 : 
        break;
    print('original line : {}'.format(lines[i]))
    print('processed line : {}'.format(sp.EncodeAsPieces(lines[i])))
    print()

original line : ﻿Project Gutenberg's Botchan (Master Darling), by Kin-nosuke Natsume

processed line : ['▁Project', '▁Gutenberg', "'s", '▁', 'Botchan', '▁(M', 'aster', '▁Darling', ')', ',', '▁by', '▁K', 'in', '-', 'nosu', 'ke', '▁Natsume']

original line : This eBook is for the use of anyone anywhere at no cost and with

processed line : ['▁This', '▁eBook', '▁is', '▁for', '▁the', '▁use', '▁of', '▁anyone', '▁any', 'where', '▁at', '▁no', '▁cost', '▁and', '▁with']

original line : almost no restrictions whatsoever.  You may copy it, give it away or

processed line : ['▁almost', '▁no', '▁re', 'strict', 'ion', 's', '▁what', 's', 'o', 'ever', '.', '▁You', '▁may', '▁copy', '▁it', ',', '▁give', '▁it', '▁away', '▁or']

original line : re-use it under the terms of the Project Gutenberg License included

processed line : ['▁re', '-', 'us', 'e', '▁it', '▁under', '▁the', '▁terms', '▁of', '▁the', '▁Project', '▁Gutenberg', '▁License', '▁includ', 'ed']

original line : with this eBook or online at www.gutenberg.org

processed line : ['▁with', '▁this', '▁eBook', '▁or', '▁on', 'line', '▁at', '▁w', 'ww.gutenberg.org']

Notes

Word Segmentation Algorithms (Training Tokenizer with SentencePiece)

tmp

byte-pair-encoding (BPE)

WordPiece Model (WPM)

unigram language model

SPM Example

Basic end-to-end example

User defined and control symbols (for BERT or something)

Manipulating BOS/EOS/EOS/PAD symbols

Changing the vocab id and surface representation of UNK/BOS/EOS/PAD symbols

Sampling and nbest segmentation for subword regularization (can be used for Lexicon Generation)

BPE (Byte pair encoding) model (BPE vs Unigram)

Character and word model (not Subword)

Text normalization

Randomizing training data

Put it All together

References