Skip to content

[Proposal] Claude Tokenizer #15

@BarberAlec

Description

@BarberAlec

tiktoken_ruby gem currently supports 4 encoders:

  • r50k_base
  • p50k_base
  • p50k_edit
  • cl100k_base

Claude appears to use tiktoken parameters outlined here and implemented here.

The BPE rankings are in an alternate format but doing some reverse engineering by looking at the javascript tiktoken implementation here I was able to use the following code to create a tiktoken encoder for Claude in Python. Note claude.json was sourced from the referenced javascript tiktoken library which is apart of the official Anthropic account.

import tiktoken
import json
import base64


def decode_claude_bpe(claude_configs):
    _, offset, *tokens = claude_configs['bpe_ranks'].split(" ")
    offset = int(offset)

    # This starts at 5 (offset) for some reason, this is what the original JS code does
    rankMap = {base64.b64decode(token): offset+idx for idx, token in enumerate(tokens)}

    return rankMap

if __name__ == "__main__":
    with open("claude.json") as f:
        claude_configs = json.load(f)
        bpe_ranks = decode_claude_bpe(claude_configs)

    enc = tiktoken.Encoding(
        name="claude_tokenizer",
        pat_str=claude_configs['pat_str'],
        mergeable_ranks=bpe_ranks,
        special_tokens=claude_configs['special_tokens'],
    )
    print(enc.encode("hello world"))

Alternatively an option to create a tiktoken encoder using custom BPE ranks etc. like in the Python library would be a more general solution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions