GitHub - 30stomercury/unigram-tokenization: A C++ implementation of unigram tokenization

Unigram tokenization in C++

This is a minimal, clean implementation of unigram tokenization in C++ (for fun). In particular, it doesn't treat whitespace as a delimiter between text segments.

More details can be found in the original paper:

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Key features

Store frequent substrings with a trie.
Prune the search space with maximum substring length during Viterbi decoding.

Compile

mkdir bin
g++ src/bpe.cpp src/unigram.cpp -o bin/bpe.out
g++ src/decode.cpp src/unigram.cpp -o bin/unigram_decode.out
g++ src/unigram_aggr.cpp src/unigram.cpp -o bin/unigram_aggr.out

Run

The training consists of two steps.

Collecting frequent substrings with BPE.

bin/bpe.out input_text_file output init_vocab_size target_vocab_size

Initialize the unigram distribution with frequent substrings, and update the distribution with EM algorithm and iterative vocab refinements.

mkdir out_dir
./train_parallel.sh input_text_file out_dir target_vocab_size min_vocab_size output.freq

If your data is not too large:

./train.sh input_text_file out_dir target_vocab_size min_vocab_size output.freq

Example

mkdir exp
bin/bpe.out test.txt test 39 256
./train.sh test.txt exp 128 39 test.freq

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Unigram tokenization in C++

Key features

Compile

Run

Example

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
src		src
README.md		README.md
test.txt		test.txt
train.sh		train.sh
train_parallel.sh		train_parallel.sh

30stomercury/unigram-tokenization

Folders and files

Latest commit

History

Repository files navigation

Unigram tokenization in C++

Key features

Compile

Run

Example

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages