Vectorizer
In this section, we will fit and implement a custom text vectorizer based on the sentiment analysis IMDB dataset.
It will be the opportunity to go through modelkit's assets management basics, learn how to fetch artifacts, and get deeper into its API.
Installation¶
First, let's download the IMDB reviews dataset:
# download the remote archive
curl https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz --output imdb.tar.gz
# untar it
tar -xvf imdb.tar.gz
# remove the unsupervised directory we will not be using
rm -rf aclImdb/train/unsup
The different files consist in text reviews left by IMDB users, each file corresponding to one review.
The dataset is organized with train and test directories, containing positive and negative examples in subfolders corresponding to the target classes.
Creating an asset file¶
Fitting¶
First thing first, let's define a helper function to read through the training set.
We will only be using generators to avoid loading the entire dataset in memory, (and later use Model.predict_gen
to process them).
import glob
from typing import Generator
def read_dataset(path: str) -> Generator[str, None, None]:
for review in glob.glob(os.path.join(path, "*.txt")):
with open(review, 'r', encoding='utf-8') as f:
yield f.read()
We need to tokenize our dataset before fitting a Scikit-Learn TfidfVectorizer
,
Although sklearn includes a tokenizer, we will be using the one we implemented in the first section.
import itertools
import os
training_set = itertools.chain(
read_dataset(os.path.join("aclImdb", "train", "pos")),
read_dataset(os.path.join("aclImdb", "train", "neg")),
)
tokenized_set = Tokenizer().predict_gen(training_set)
By using generators and the predict_gen
method, we can to read huge texts corpora without filling up our memory.
Each review will be tokenized one by one, but as we discussed we can also speed up execution by setting a batch_size
greater than one, to process these reviews batch by batch.
# here, an example with a 64 batch size
tokenized_set = Tokenizer().predict_gen(training_set, batch_size=64)
We are all set! Let's fit our TfidfVectorizer
using the tokenized_set
, and disabling the embedded tokenizer.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(
tokenizer=lambda x: x, lowercase=False, max_df=0.95, min_df=0.01
).fit(tokenized_set)
We now need to save the vocabulary we just fitted to disk, before writing our on vectorizer Model
.
Of course, one could serialize the TfidfVectorizer
, but there are many reasons why we would not want this in production code:
TfidfVectorizer
is only used to build the vocabulary using neat features such as min/max document frequencies, which are no longer useful during inference- it requires a heavy code dependency:
scikit-learn
, only for vectorization - pickling scikit-learn models come with some tricky issues relative to dependency, version and security
It is also common practice to separate the research / training phase, from the production phase.
As a result, we will be implementing our own vectorizer for production using modelkit
, based on the vocabulary created with scikit-learn's TfidfVectorizer
.
We just need to write a list of strings to disk:
# we only keep strings from the vocabulary
# we will be using our own str -> int mapping
vocabulary = next(zip(*sorted(vectorizer.vocabulary_.items(), key=lambda x: x[1])))
with open("vocabulary.txt", "w", encoding="utf-8") as f:
for row in vocabulary:
f.write(row + "\n")
Using the file in a Model¶
In this subsection, you will learn how to leverage the basics of Modelkit's Assets management.
Note
Most of the features (assets remote storage, updates, versioning, etc.) will not be adressed in this tutorial, but are the way to go when things get serious.
First, let's implement our Vectorizer
as a modelkit.Model
which loads the vocabulary.txt
file we just created:
import modelkit
from typing import List
class Vectorizer(modelkit.Model):
CONFIGURATIONS = {"imdb_vectorizer": {"asset": "vocabulary.txt"}}
def _load(self):
self.vocabulary = {}
with open(self.asset_path, "r", encoding="utf-8") as f:
for i, k in enumerate(f):
self.vocabulary[k.strip()] = i
Here, we define a CONFIGURATIONS
class attribute, which lets modelkit
know how to find the vocabulary. vocabulary.txt
will be found in the current working directory, and its absolute path written to the Vectorizer.asset_path
attribute before _load
is called.
Remote assets
When assets are stored remotely, the remote versioned asset is retrieved and cached on the local disk before the _load
method is called, and its local path is set in the self.asset_path
attribute.
modelkit
guarantees that whenever _load
is called, the file is present and its absolute path is written to Model.asset_path
.
What can be an asset
An asset can be anything (a file, a directory), and the user is responsible for defining the loading logic in _load
.
You can refer to it with a relative path (which will be relative to the current working directeory), an absolute path, or a remote asset specification.
In this case, it is rather straighforward: self.asset_path
directly points to vocabulary.txt
.
In addition, the imdb_vectorizer
configuration key can now be used to refer to the Vectorizer
in a ModelLibrary
object. This allows you to define multiple Vectorizer
objects with different vocabularies, without rewriting the prediction logic.
Prediction¶
Now let us add a more complex prediction logic and input specification:
import numpy as np
import modelkit
from typing import List
class Vectorizer(modelkit.Model[List[str], List[int]]):
CONFIGURATIONS = {"imdb_vectorizer": {"asset": "vocabulary.txt"}}
def _load(self):
self.vocabulary = {}
with open(self.asset_path, "r", encoding="utf-8") as f:
for i, k in enumerate(f):
self.vocabulary[k.strip()] = i + 2
self._vectorizer = np.vectorize(lambda x: self.vocabulary.get(x, 1))
def _predict(self, tokens, length=None, drop_oov=True):
vectorized = (
np.array(self._vectorizer(tokens), dtype=np.int)
if tokens
else np.array([], dtype=int)
)
if drop_oov and len(vectorized):
vectorized = np.delete(vectorized, vectorized == 1)
if not length:
return vectorized.tolist()
result = np.zeros(length)
vectorized = vectorized[:length]
result[: len(vectorized)] = vectorized
return result.tolist()
We add several advanced features, first, to deal with out of vocabulary or padding tokens, we reserve the following integers:
0
for padding1
for the unknown token (out-of-vocabulary words)2+i
for known vocabulary words
We use np.vectorize
to map a tokens list to an indices list, which we store in the _vectorize
attribute in the _load
.
We also add keyword arguments to the _predict
: length
and drop_oov
. These can be used during prediction as well, and would be passed to _predict
or _predict_batch
:
vectorizer = modelkit.load_model("imdb_vectorizer", models=Vectorizer)
vectorizer.predict(item, length=10, drop_oov=False)
vectorizer.predict_batch(items, length=10, drop_oov=False)
vectorizer.predict_gen(items, length=10, drop_oov=False)
Test cases¶
Now let us add test cases too. The only trick here is that we have to add the information about our new keyword arguments when we want to test different values.
To do so, we use the keyword_args
field in the test cases:
class Vectorizer(modelkit.Model[List[str], List[int]]):
...
TEST_CASES = [
{"item": [], "result": []},
{"item": [], "keyword_args": {"length": 10}, "result": [0] * 10},
{"item": ["movie"], "result": [888]},
{"item": ["unknown_token"], "result": []},
{
"item": ["unknown_token"],
"keyword_args": {"drop_oov": False},
"result": [1],
},
{"item": ["movie", "unknown_token", "scenes"], "result": [888, 1156]},
{
"item": ["movie", "unknown_token", "scenes"],
"keyword_args": {"drop_oov": False},
"result": [888, 1, 1156],
},
{
"item": ["movie", "unknown_token", "scenes"],
"keyword_args": {"length": 10},
"result": [888, 1156, 0, 0, 0, 0, 0, 0, 0, 0],
},
{
"item": ["movie", "unknown_token", "scenes"],
"keyword_args": {"length": 10, "drop_oov": False},
"result": [888, 1, 1156, 0, 0, 0, 0, 0, 0, 0],
},
]
...
Final vectorizer¶
Putting it all together, we obtain:
import modelkit
import numpy as np
from typing import List
class Vectorizer(modelkit.Model[List[str], List[int]]):
CONFIGURATIONS = {"imdb_vectorizer": {"asset": "vocabulary.txt"}}
TEST_CASES = [
{"item": [], "result": []},
{"item": [], "keyword_args": {"length": 10}, "result": [0] * 10},
{"item": ["movie"], "result": [888]},
{"item": ["unknown_token"], "result": []},
{
"item": ["unknown_token"],
"keyword_args": {"drop_oov": False},
"result": [1],
},
{"item": ["movie", "unknown_token", "scenes"], "result": [888, 1156]},
{
"item": ["movie", "unknown_token", "scenes"],
"keyword_args": {"drop_oov": False},
"result": [888, 1, 1156],
},
{
"item": ["movie", "unknown_token", "scenes"],
"keyword_args": {"length": 10},
"result": [888, 1156, 0, 0, 0, 0, 0, 0, 0, 0],
},
{
"item": ["movie", "unknown_token", "scenes"],
"keyword_args": {"length": 10, "drop_oov": False},
"result": [888, 1, 1156, 0, 0, 0, 0, 0, 0, 0],
},
]
def _load(self):
self.vocabulary = {}
with open(self.asset_path, "r", encoding="utf-8") as f:
for i, k in enumerate(f):
self.vocabulary[k.strip()] = i + 2
self._vectorizer = np.vectorize(lambda x: self.vocabulary.get(x, 1))
def _predict(self, tokens, length=None, drop_oov=True):
vectorized = (
np.array(self._vectorizer(tokens), dtype=np.int)
if tokens
else np.array([], dtype=int)
)
if drop_oov and len(vectorized):
vectorized = np.delete(vectorized, vectorized == 1)
if not length:
return vectorized.tolist()
result = np.zeros(length)
vectorized = vectorized[:length]
result[: len(vectorized)] = vectorized
return result.tolist()
In the next section, we will train a classifier using the Tokenizer and Vectorizer models we just created. This will show us how to compose and store models in modelkit
.