= "Hi, I am new to deep learning and machine learning."
text print(text)
Hi, I am new to deep learning and machine learning.
Vidyasagar Bhargava
March 1, 2025
Training large language models such as GPT or LLaMA requires input data, typically in the form of raw text. Since neural networks cannot process text directly, the text is first tokenized into discrete units—usually subword tokens—using a tokenizer. These tokens are then mapped to corresponding embeddings, i.e., high-dimensional vector representations, which serve as the actual inputs to the model.
Next we will convert token into vector representation aka embeddings for llm training.
Step 1 :- Input Text –> Tokenized Text
Step 2 :- Tokenized Text –> Token IDs
Step 3 :- Token IDs –> Embeddings Vectors
The Tokenization process cover step 1 and step 2 mentioned above i.e. converting input text into individual words (Tokenized Text) then into tokens (Token IDs).
Alright Let’s first we develop a simple tokenizer which converts words into tokens, but later on we will use more sophisticated tokenizers like Byte-Pair Encoding (BPE) from tiktoken library.
Let’s say we have some text like Hi, I am new to the machine learning
.
Hi, I am new to deep learning and machine learning.
now we create a simple tokenizer using regular expression.
Remembering this expression is not required as in future we will going to use more sophisticated tokenizer
lets see how it works on our text.
['Hi', ',', 'I', 'am', 'new', 'to', 'deep', 'learning', 'and', 'machine', 'learning', '.']
It works well. Next we can also check total tokenized texts using len()
function.
In this step we will convert tokenized text into integer for that first we need to build a vocabulary by removing duplicate tokenized text and alphabetically sorting them. These unique tokenized text are then aggregated in a vocabulary where they are mapped with unique integer value.
After determine the vocab size we will create the vocabulary
{',': 0, '.': 1, 'Hi': 2, 'I': 3, 'am': 4, 'and': 5, 'deep': 6, 'learning': 7, 'machine': 8, 'new': 9, 'to': 10}
Now our vocab is prepared what we need next is how can we use this vocab to convert text into token IDs and vice versa.
We need inverse version of token IDs as well when we convert output of LLM from token ID back to tokenized text and then concatenates the them to natural text.
In order to have these conversion functionalities lets create class SimpleTokenizer
which has two methods encode
for converting text to token IDs and decode
to convert token IDs back to text.
class SimpleTokenizer:
def __init__(self,vocab) -> None:
self.str_to_int = vocab
self.int_to_str = {i:s for s,i in vocab.items()}
def encode(self,text):
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
preprocessed = [item.strip() for item in result if item.strip()]
ids = [self.str_to_int[s] for s in preprocessed]
return ids
def decode(self,ids):
text = " ".join([self.int_to_str[i] for i in ids])
text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
return text
Let’s use our new class for our next new example
tokenizer = SimpleTokenizer(vocab)
text = """I am learning deep learning"""
ids = tokenizer.encode(text)
print(ids)
[3, 4, 7, 6, 7]
Let’s decode as well
Now what if there is some text which is not part of our vocab. Then our SimpleTokenizer
class may get in trouble. For example
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[10], line 2 1 text = "Hello, I am learning deep learning" ----> 2 print(tokenizer.encode(text)) Cell In[7], line 9, in SimpleTokenizer.encode(self, text) 7 result = re.split(r'([,.:;?_!"()\']|--|\s)', text) 8 preprocessed = [item.strip() for item in result if item.strip()] ----> 9 ids = [self.str_to_int[s] for s in preprocessed] 10 return ids Cell In[7], line 9, in <listcomp>(.0) 7 result = re.split(r'([,.:;?_!"()\']|--|\s)', text) 8 preprocessed = [item.strip() for item in result if item.strip()] ----> 9 ids = [self.str_to_int[s] for s in preprocessed] 10 return ids KeyError: 'Hello'
The word “Hello” is not contained in the vocabulary. So we need to have some mechanism to handle unknown words.
Now we will implement a improved version of our previous SimpleTokenizer class which will consider unknown words or words not part of training data with |unk| token and using |endoftext| token we can separate two unrelated text sources.
lets modify our vocab
# let's modify the vocab
all_words.extend(["<|endoftext|>","<|unk|>"])
vocab = {token:integer for integer,token in enumerate(all_words)}
print(len(vocab.items()))
13
lets create new class
class SimpleTokenizerV2:
def __init__(self,vocab) -> None:
self.str_to_int = vocab
self.int_to_str = {i:s for s,i in vocab.items()}
def encode(self,text):
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
preprocessed = [item.strip() for item in result if item.strip()]
preprocessed = [item if item in self.str_to_int else "<|unk|>" for item in preprocessed]
ids = [self.str_to_int[s] for s in preprocessed]
return ids
def decode(self,ids):
text = " ".join([self.int_to_str[i] for i in ids])
text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
return text
text1 = "Hello, I am learning deep learning"
text2 = "I am learning machine learning"
text = " <|endoftext|> ".join((text1, text2))
print(text)
Hello, I am learning deep learning <|endoftext|> I am learning machine learning
[12, 0, 3, 4, 7, 6, 7, 11, 3, 4, 7, 8, 7]
BPE has 50256 vocab size
Initiate a tokenizer
Goal is how to provide token ID in efficient manner in chunks to llms
Neural Networks are trained using mathematical operations so if we have text data we need to some how convert that text into numerical form. One such way is to represent text into some continuous-valued vectors.
The concept of converting data into vector format is called embedding. Now there are ways to convert text to embedding.