Train your own model with TensorFlow

Download this notebook

This section is a supplement to Train your own model with PhotonAI, as PhotonAI is not suitable for many more specific use cases. If your own problem has already been solved satisfactorily with a PhotonAI model, you can skip this chapter. On the other hand, the creation of a model is dealt with even more briefly in this chapter and it may well make sense to read the previous chapter first. Apart from that, this is also not an introduction to TensorFlow or the creation of machine learning models in general. Rather, the aim is merely to create a simple model that will serve as an example in the deployment and publication process in the following parts. However, the TensorFlow Documentation offers numerous tutorials and explanations to learn more about creating machine learning models with TensorFlow.

What is TensorFlow?

TensorFlow

TensorFlow is a framework that enables the processing of multidimensional data and the training of deep neural networks in particular. It offers interfaces to numerous programming languages so that the finished models can also be used on mobile devices, for example. The large number of available functions and layers means that individual and very specific requirements can be met. In direct comparison to PhotonAI, however, it is also more complex to use and requires a longer training period. With version 2.0, however, Keras became the standard API and its use was simplified.

For Python, TensorFlow can be installed directly in the terminal via pip:

pip install tensorflow

import pickle
from pathlib import Path

import matplotlib.pyplot as plt
import pandas as pd
import tensorflow as tf
from keras import layers
from keras.models import Sequential

from sklearn import preprocessing
from sklearn.model_selection import train_test_split

Preparing the data

Firstly, we define our data and project structure again. In this case, we are in the /incubaitor/1_2_Tensorflow/notebooks/ folder with the notebook. Relative to this, we have now loaded the data into the folder /incubaitor/1_2_Tensorflow/data/ and will save our models in /incubaitor/1_2_Tensorflow/models/. This allows us to use the relative paths ../data/ and ../models/. If there are problems here, explicit paths can also be used as in the PhotonAI example.

As with PhotonAI, we use the read_csv() method from the Python package Pandas to read in our data. As we also want to use columns that contain non-numerical values in this case, we prepare the corresponding values when reading in the data. The converters parameter can be used to specify functions that are then applied to each individual value of a column. Here we have defined our own function str_to_category(), which removes existing spaces at the beginning and end of the strings, converts everything to lower case and performs a few other steps. We then remove the price column from our data and save it separately as a label instead.

# Our function, which we call on the strings, is very simple, 
# but can of course be arbitrarily extended:
def str_to_category(string):
    """
    Converts a string to a category.
    """
    return string.strip(" \t\n").lower().replace(" ", "_").replace("-", "")


# Load data and split into labels and features
data = pd.read_csv(
    "../data/vw.csv",
    converters={
        "model": str_to_category,
        "transmission": str_to_category,
        "fuelType": str_to_category,
    },
)
label = data.pop("price")

As our model can only process numbers, we also need to convert our categorical variables into numbers. To do this, we can use the OrdinalEncoder() from the Python package sklearn. This numbers all the categories found and then sorts them by their unique number. Even if this is a simple way of converting categorical data into numbers, the sense of such a conversion should be checked for each use case, as this implicitly suggests similarities between categories that are not correct. Even in this example, the assigned numbers are not entirely correct because there is no clear sorting of the different car models. Alternatively, a OneHotEncoder() can be used, which assigns vectors with only one entry each to the categories and thus ensures that the distance between every two categories is always the same. For our simple example with few categories, however, we will leave it at the OrdinalEncoder().

# Encode categorical data
model_enc = preprocessing.OrdinalEncoder()
data.loc[:, "model"] = model_enc.fit_transform(data.loc[:, ["model"]])
transmission_enc = preprocessing.OrdinalEncoder()
data.loc[:, "transmission"] = transmission_enc.fit_transform(
    data.loc[:, ["transmission"]]
)
fuelType_enc = preprocessing.OrdinalEncoder()
data.loc[:, "fuelType"] = fuelType_enc.fit_transform(
    data.loc[:, ["fuelType"]]
    )

In order to be able to evaluate our model at the end, we have to split off part of our data in advance. We do not use 20% of our data for training in order to be able to carry out an evaluation with it afterwards. When splitting the data, it is important to maintain the order of the features and labels so that they can be assigned correctly. Fortunately, the train_test_split() function from sklearn helps us here. It randomly assigns the individual data points to the training or test data set so that the desired size ratio is created at the end and ensures that the pairs of features and labels are retained. To ensure that we always use the same test data set for several runs, for example after adapting our model, so that the results are comparable, it is possible to set a seed using the random_state parameter. The prerequisite is, of course, that the input of the function does not change.

# Split data into training and test sets
train_data, test_data, train_label, test_label = train_test_split(
    data, label, test_size=0.2, random_state=42
)

Finally, we need to scale our data and labels. To do this, we use the StandardScaler() also from sklearn. By scaling, we can ensure that individual features do not dominate the model due to their size alone. We select the parameters for the scaler exclusively on the basis of the training data (see fit_transform() instead of just transform()). This is important to ensure a correct evaluation. Later in productive use, it is also not possible to define a scaler based on a single data point. In addition, the results would be falsified by a scaler that was defined for a different area.

# Normalize data
data_scaler = preprocessing.StandardScaler()
train_data = data_scaler.fit_transform(train_data)
test_data = data_scaler.transform(test_data)

# Normalize labels
label_scaler = preprocessing.StandardScaler()
train_label = label_scaler.fit_transform(train_label.values.reshape(-1, 1))
test_label = label_scaler.transform(test_label.values.reshape(-1, 1))

Creation and training of the model

Unlike the PhotonAI example, we will not perform a hyperparameter search or compare multiple models. Instead, we will only define one specific model. Since we only have eight features, we opt for a simple multilayer perceptron (MLP) with two hidden layers of size 64 and with ReLU activation:

# Create the model
model = Sequential(
    [
        layers.InputLayer(shape=(train_data.shape[1],)),
        layers.Dense(64, activation=tf.nn.relu),
        layers.Dense(64, activation=tf.nn.relu),
        layers.Dense(1),
    ]
)

Before we can train the model, we have to compile it. We also have to specify an optimizer and a loss function. With the metrics parameter, we can specify further metrics that we can then use to evaluate our model during training. The summary() method outputs a summary of the model on the command line. In this way, for example, the number of trainable parameters can be checked again.

# Compile the model
model.compile(optimizer="adam", loss="mse", metrics=["mae"])
model.summary()

The model is then ready for training. In addition to the batch size and the number of epochs, we can also specify the size of a validation set here. After each epoch, the loss and the other metrics are also calculated on this data in order to be able to recognize overfitting in good time, for example.

# Train the model
model.fit(
    train_data, 
    train_label, 
    batch_size=64, 
    epochs=50, 
    validation_split=0.2
    )

During training, we are regularly informed about the training progress in the console. This allows us to estimate how much longer the training will take and check that the loss is actually decreasing and our model is converging.

Evaluation and storage

Once the training is completed, we can use the previously set aside test data to evaluate the quality of our model. In order to obtain independent and comparable results, it is important that we have not used this data either directly or indirectly (such as the validation data as a termination condition) during training. TensorFlow calculates the scores for us and only requires the test data together with the ground truth labels.

# Evaluate the model
model.evaluate(test_data, test_label)

We receive our loss and the results of the other specified metrics as output. Although these values give us a good indication of the quality of our model, we can also plot the results using the Python package matplotlib in order to visualize them more clearly. Unlike with PhotonAI, we have to take care of meaningful representations ourselves. Since we have trained a regression model in this example, the representation as a scatterplot is helpful. For a classification model, on the other hand, a confusion matrix would be even easier to read.

During preprocessing, we normalized our data and labels. So that we can better classify the results in our plot, we should now undo the normalization. The scalers conveniently have the inverse_transform() method, which we can simply call for our predictions and ground truth labels. In addition, we plot a line as orientation, which indicates the optimal predictions.

# Plot some predictions
predictions = model.predict(test_data)
predictions = label_scaler.inverse_transform(predictions)
test_label = label_scaler.inverse_transform(test_label)
plt.scatter(test_label, predictions, s=0.1)
plt.plot([0, test_label.max()], [0, test_label.max()], "--", color="red")
plt.xlabel("True Values")
plt.ylabel("Predictions")
plt.show()

Based on the plot created, we can easily see that the predicted prices of our model roughly correspond to the actual prices. For higher-priced vehicles, the deviations increase due to fewer data points, but the quality is sufficient for our use case.

So that we can use the model at a later date without having to retrain it, it is necessary to save it manually. But it is not only our trained model that needs to be saved. So that we can prepare the input data correctly, we also need all encoders (to convert the categorical variables into numbers) and all scalers (to normalize our features and labels). We need to save all these objects. To do this, we can either construct a new object that contains all the things we need, or we can save them as individual files. We have chosen the latter option.

# Save the model, encoder and scaler
Path("../models/").mkdir(parents=True, exist_ok=True)
model.save("../models/model.keras")
pickle.dump(model_enc, open("../models/model_enc.pkl", "wb"))
pickle.dump(transmission_enc, open("../models/transmission_enc.pkl", "wb"))
pickle.dump(fuelType_enc, open("../models/fuelType_enc.pkl", "wb"))
pickle.dump(data_scaler, open("../models/data_scaler.pkl", "wb"))
pickle.dump(label_scaler, open("../models/label_scaler.pkl", "wb"))

Using the model

Analogous to saving our objects, we must now first load them from the individual files:

# Load model, encoder and scaler
model = tf.keras.models.load_model("../models/model.keras")
model_enc = pickle.load(open("../models/model_enc.pkl", "rb"))
transmission_enc = pickle.load(open("../models/transmission_enc.pkl", "rb"))
fuelType_enc = pickle.load(open("../models/fuelType_enc.pkl", "rb"))
data_scaler = pickle.load(open("../models/data_scaler.pkl", "rb"))
label_scaler = pickle.load(open("../models/label_scaler.pkl", "rb"))

We then define our own input and prepare it for our trained MLP using the encoder and scaler. To avoid mixing up the different features, we create a Pandas dataframe again and name all the columns. Instead, we could also just create a Numpy array with the values. However, apart from the order, the model has no way of checking whether our values actually match the respective feature. Therefore, care should be taken to preserve the feature sequence from the training data set.

# Define and prepare test data
dummy_data = pd.DataFrame(
    {
        "model": [str_to_category("T-Roc")],
        "year": [2019],
        "transmission": [str_to_category("Manual")],
        "mileage": [12132],
        "fuelType": [str_to_category("Petrol")],
        "tax": [145],
        "mpg": [42.7],
        "engineSize": [2.0],
    }
)
dummy_data.loc[:, "model"] = model_enc.transform(
    dummy_data.loc[:, ["model"]]
    )
dummy_data.loc[:, "transmission"] = transmission_enc.transform(
    dummy_data.loc[:, ["transmission"]]
)
dummy_data.loc[:, "fuelType"] = fuelType_enc.transform(
    dummy_data.loc[:, ["fuelType"]]
    )
dummy_data = data_scaler.transform(dummy_data)

Three of our features also contain categorical data instead of numbers, which we first converted into numbers using the encoders. When the encoders were created, all categories contained in the training data were included. As a result, they can now only translate these categories into numbers. This should be taken into account when entering new data into the model. The exact spelling and any spaces at the beginning or end also mean that the encoders no longer recognize a category. However, we had already defined a function str_to_category() for reading in our training data, which removes many stumbling blocks for us. If we are still not sure which categories are available for selection, we can use the categories_ attribute to display all categories that an encoder recognizes:

import pickle

fuelType_enc = pickle.load(open("../models/fuelType_enc.pkl", "rb"))
print(fuelType_enc.categories_)

Finally, we can pass our coded and normalized data to our model, which calculates a prediction based on it. We only need to scale this prediction back with the label scaler to get the price:

# Predict
result = model(dummy_data)
result = label_scaler.inverse_transform(result)[0, 0]
print(result)

The result is around 27516.51, which is close to the result of the PhotonAI model and also corresponds to a realistic price.

We were therefore able to design, train and now even use our own model with TensorFlow. Although this required significantly more manual steps, which were previously taken away from us by the PhotonAI Hyperpipe, we were also able to use categorical features for our prediction.

Folder structure

The code that we have built and explained in this notebook can also be found in the folder incubaitor/1_Frameworks/1_2_Tensorflow/app. There we find three files: train.py, test.py and utils.py.

In train.py the data is prepared, the model is created and trained, evaluated and saved.

In test.py we then find the code that enables us to use and test the model.

utils.py only contains the function str_to_category(). Auxiliary functions like this are usually moved to another file to avoid cluttering the script.

The folder incubaitor/1_Frameworks/1_2_Tensorflow/app allows us to deploy the model later via Flask and Docker.