# Machine Learning again again

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf

Today we will be doing... machine learning.  Specifically, we'll be doing a model similar to the one you just saw in lecture, where the features are a mix of two different types.  We'll actually be using the same types as in lecture: text and scalars.  However, it will be a lot more complicated because we'll have a lot of different text features.

Our dataset will be the [TMDB 5000 Movie Dataset from Kaggle](https://www.kaggle.com/tmdb/tmdb-movie-metadata?select=tmdb_5000_movies.csv).  As always, I've put it on CCLE.  Our ultimate goal for this dataset is to predcit the score of the movie from other features.  But first, let's take a look at the dataset!  (We'll only be checking the top row of the dataframe since each row takes up quite a bit of vertical space.)

In [2]:
df = pd.read_csv('tmdb_5000_movies.csv')
df.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


There are a lot of features here!  As always, some of are pretty useless.  Let's remove those.  We all will remove the "original_language" feature because it's hard to work with, even though it might be useful.

Some of the features are stored in, of all things, a list of Python dictionaries? (actually, a string of a list of Python dictionaries!)  These represent features that can have multiple different values!  The dictionary bit is a little extraneous, but it just provides different ways of looking at the values.  Let's unpack the dictionaries to make the features a string containing the simplest representation of all the values.  Then we can deal with these features using normal text processing layers.

Also, might as well drop NaNs now!

In [3]:
df.drop(columns=['homepage', 'id', 'original_title', 'title', 'vote_count', 'original_language'], inplace=True)
df['genres'] = df['genres'].apply(lambda x : ' '.join([str(y['id']) for y in eval(x)]))
df['keywords'] = df['keywords'].apply(lambda x : ' '.join([str(y['id']) for y in eval(x)]))
df['production_companies'] = df['production_companies'].apply(lambda x : ' '.join([str(y['id']) for y in eval(x)]))
df['production_countries'] = df['production_countries'].apply(lambda x : ' '.join([str(y['iso_3166_1']) for y in eval(x)]))
df['spoken_languages'] = df['spoken_languages'].apply(lambda x : ' '.join([str(y['iso_639_1']) for y in eval(x)]))
df.dropna(inplace=True)
display(df.head(1))
df.shape

Unnamed: 0,budget,genres,keywords,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,vote_average
0,237000000,28 12 14 878,1463 2964 3386 3388 3679 3801 9685 9840 9882 9...,"In the 22nd century, a paraplegic Marine is di...",150.437577,289 306 444 574,US GB,2009-12-10,2787965087,162.0,en es,Released,Enter the World of Pandora.,7.2


(3959, 14)

As in the previous example, budget and revenue are going to likely be somewhat on a log scale.  Let's take the log of them.  The +1 is to fix the values that are zero (whose log would be -infinity) without changing the other values too much.

In [4]:
df['budget'] = np.log(df['budget'] + 1) #some are zero
df['revenue'] = np.log(df['revenue'] + 1)
df.rename(columns = {'budget': 'log(budget)', 'revenue':'log(revenue)'}, inplace=True)
df.head(1)

Unnamed: 0,log(budget),genres,keywords,overview,popularity,production_companies,production_countries,release_date,log(revenue),runtime,spoken_languages,status,tagline,vote_average
0,19.283571,28 12 14 878,1463 2964 3386 3388 3679 3801 9685 9840 9882 9...,"In the 22nd century, a paraplegic Marine is di...",150.437577,289 306 444 574,US GB,2009-12-10,21.748578,162.0,en es,Released,Enter the World of Pandora.,7.2


We need to do one hot encoding for `status`.  There are only three predefined values it could take, so it's safe to do this before splitting.  We can use Pandas `get_dummies` then because it's easier.

In [5]:
df = df.join(pd.get_dummies(df['status']))
df.drop(columns=['status'], inplace=True)
df.head(1)

Unnamed: 0,log(budget),genres,keywords,overview,popularity,production_companies,production_countries,release_date,log(revenue),runtime,spoken_languages,tagline,vote_average,Post Production,Released,Rumored
0,19.283571,28 12 14 878,1463 2964 3386 3388 3679 3801 9685 9840 9882 9...,"In the 22nd century, a paraplegic Marine is di...",150.437577,289 306 444 574,US GB,2009-12-10,21.748578,162.0,en es,Enter the World of Pandora.,7.2,0,1,0


The date feature is a little weird.  It'll probably be easier to work with month and year, so let's do that.  We can also adjust year so that it starts at the lowest year on the list, since only the relative difference matters.

In [6]:
df = df.join(df[['release_date']].apply(
    axis=1, result_type='expand',
    func=(lambda x : [int(x['release_date'][:4]), int(x['release_date'][5:7])])))
df.drop(columns=['release_date'], inplace=True)
df.rename(columns={0: 'Year', 1: 'Month'}, inplace=True)
df['Year'] = df['Year'] - min(df['Year'])
df.head(1)

Unnamed: 0,log(budget),genres,keywords,overview,popularity,production_companies,production_countries,log(revenue),runtime,spoken_languages,tagline,vote_average,Post Production,Released,Rumored,Year,Month
0,19.283571,28 12 14 878,1463 2964 3386 3388 3679 3801 9685 9840 9882 9...,"In the 22nd century, a paraplegic Marine is di...",150.437577,289 306 444 574,US GB,21.748578,162.0,en es,Enter the World of Pandora.,7.2,0,1,0,93,12


Then we split!  Both to labels/features and trains/test/validation.  Also, our label `vote_average` takes values between 0 and 10.  For reasons that will be clearly later, it's nicer if it takes values between 0 and 1.  Let's change that.

In [7]:
from sklearn.model_selection import train_test_split

x = df.drop(columns = ['vote_average'])
y = df['vote_average']/10

x_trv, x_test, y_trv, y_test = train_test_split(x, y, random_state=209)
x_train, x_val, y_train, y_val = train_test_split(x_trv, y_trv)

display(x_train.head(1))
x_train.shape

Unnamed: 0,log(budget),genres,keywords,overview,popularity,production_companies,production_countries,log(revenue),runtime,spoken_languages,tagline,Post Production,Released,Rumored,Year,Month
3097,16.118096,35 10749,596 2041 2580 157524,Stranded and alone on a desert island during a...,4.570043,3287 13419 57736,IT GB,13.302426,89.0,it el en,A snooty socialite is stranded on a Mediterran...,0,1,0,86,10


(2226, 16)

As in lecture, we'll be using Tensorflow Datasets.  This is very similar to what we did before, except we need to split our features into more parts.  Each "text" column needs to be its own group.

In [8]:
def make_data(x, y):
    return tf.data.Dataset.from_tensor_slices(
        (
            {
                "genres": x['genres'],
                "keywords": x['keywords'],
                "overview": x['overview'],
                "production_companies": x['production_companies'],
                "production_countries": x['production_countries'],
                "spoken_languages": x['spoken_languages'],
                "tagline": x['tagline'],
                "scalars": x[['log(budget)', 'popularity', 'log(revenue)',
                              'runtime', 'Post Production', 'Released',
                              'Rumored', 'Year', 'Month']]
            },
            {
                'vote_average': y
            }
        )
    )

train = make_data(x_train, y_train).batch(20)
val = make_data(x_val, y_val).batch(20)
test = make_data(x_test, y_test).batch(20)

Now we do the vectorization layer.  This is the same as before except we need to do it A LOT.  Let's create a function for it so we don't have to keep rewriting code.

In [9]:
from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow import keras
import re
import string

from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.layers.experimental.preprocessing import StringLookup

size_vocabulary = 2000

def standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    no_punctuation = tf.strings.regex_replace(lowercase,
                                  '[%s]' % re.escape(string.punctuation),'')
    return no_punctuation 

def create_vectorize_layer(train, feature):
    vectorize_layer = TextVectorization(
        standardize=standardization,
        max_tokens=size_vocabulary,
        output_mode='int',
        output_sequence_length=500) 

    vectorize_layer.adapt(train.map(lambda x, y: x[feature]))
    return vectorize_layer

vectorize_genres    = create_vectorize_layer(train, 'genres')
vectorize_keywords  = create_vectorize_layer(train, 'keywords')
vectorize_overview  = create_vectorize_layer(train, 'overview')
vectorize_companies = create_vectorize_layer(train, 'production_companies')
vectorize_countries = create_vectorize_layer(train, 'production_countries')
vectorize_languages = create_vectorize_layer(train, 'spoken_languages')
vectorize_tagline   = create_vectorize_layer(train, 'tagline')

Same with the inputs!

In [10]:
def create_string_input(name):
    return keras.Input(
        shape = (1,), 
        name = name,
        dtype = "string"
    )

genres_input    = create_string_input('genres')
keywords_input  = create_string_input('keywords')
overview_input  = create_string_input('overview')
companies_input = create_string_input('production_companies')
countries_input = create_string_input('production_countries')
languages_input = create_string_input('spoken_languages')
tagline_input   = create_string_input('tagline')

scalars_input = keras.Input(
    shape = (9,), 
    name = "scalars",
    dtype = "float64"
)

Now we actually make the structure of our neural network.  This is the same for all the text features right now, but we might want to change them independently.  So here I did copy and paste it.  (Also, note the activation functions!  Those are important)

At the end, we have one output node.  We'll be doing regression, so this is find.  We put a sigmoid activation function on this layer.  This forces all the output values to be between 0 and 1.  This means that the predicted scores will never be out of range, and it will be easier for our model to learn to predict scores! (Sneaky, right?)

In [11]:
genres_features = vectorize_genres(genres_input)
genres_features = layers.Embedding(size_vocabulary, 3, name = "embedding_genres")(genres_features)
genres_features = layers.Dropout(0.2)(genres_features)
genres_features = layers.GlobalAveragePooling1D()(genres_features)
genres_features = layers.Dropout(0.2)(genres_features)
genres_features = layers.Dense(32, activation='sigmoid')(genres_features)

keywords_features = vectorize_keywords(keywords_input)
keywords_features = layers.Embedding(size_vocabulary, 3, name = "embedding_keywords")(keywords_features)
keywords_features = layers.Dropout(0.2)(keywords_features)
keywords_features = layers.GlobalAveragePooling1D()(keywords_features)
keywords_features = layers.Dropout(0.2)(keywords_features)
keywords_features = layers.Dense(32, activation='sigmoid')(keywords_features)

overview_features = vectorize_overview(overview_input)
overview_features = layers.Embedding(size_vocabulary, 3, name = "embedding_overview")(overview_features)
overview_features = layers.Dropout(0.2)(overview_features)
overview_features = layers.GlobalAveragePooling1D()(overview_features)
overview_features = layers.Dropout(0.2)(overview_features)
overview_features = layers.Dense(32, activation='sigmoid')(overview_features)

companies_features = vectorize_companies(companies_input)
companies_features = layers.Embedding(size_vocabulary, 3, name = "embedding_companies")(companies_features)
companies_features = layers.Dropout(0.2)(companies_features)
companies_features = layers.GlobalAveragePooling1D()(companies_features)
companies_features = layers.Dropout(0.2)(companies_features)
companies_features = layers.Dense(32, activation='sigmoid')(companies_features)

countries_features = vectorize_countries(countries_input)
countries_features = layers.Embedding(size_vocabulary, 3, name = "embedding_countries")(countries_features)
countries_features = layers.Dropout(0.2)(countries_features)
countries_features = layers.GlobalAveragePooling1D()(countries_features)
countries_features = layers.Dropout(0.2)(countries_features)
countries_features = layers.Dense(32, activation='sigmoid')(countries_features)

languages_features = vectorize_languages(languages_input)
languages_features = layers.Embedding(size_vocabulary, 3, name = "embedding_languages")(languages_features)
languages_features = layers.Dropout(0.2)(languages_features)
languages_features = layers.GlobalAveragePooling1D()(languages_features)
languages_features = layers.Dropout(0.2)(languages_features)
languages_features = layers.Dense(32, activation='sigmoid')(languages_features)

tagline_features = vectorize_tagline(tagline_input)
tagline_features = layers.Embedding(size_vocabulary, 3, name = "embedding_tagline")(tagline_features)
tagline_features = layers.Dropout(0.2)(tagline_features)
tagline_features = layers.GlobalAveragePooling1D()(tagline_features)
tagline_features = layers.Dropout(0.2)(tagline_features)
tagline_features = layers.Dense(32, activation='sigmoid')(tagline_features)

scalar_features = layers.Dense(32, activation='sigmoid')(scalars_input)

main = layers.concatenate([genres_features, keywords_features, overview_features,
                           companies_features, countries_features, languages_features,
                           tagline_features, scalar_features], axis = 1)

main = layers.Dense(32)(main)
output = layers.Dense(1, name = "vote_average", activation='sigmoid')(main)

model = keras.Model(
    inputs = [genres_input, keywords_input, overview_input,
              companies_input, countries_input, languages_input,
              tagline_input, scalars_input],
    outputs = output
)

model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
genres (InputLayer)             [(None, 1)]          0                                            
__________________________________________________________________________________________________
keywords (InputLayer)           [(None, 1)]          0                                            
__________________________________________________________________________________________________
overview (InputLayer)           [(None, 1)]          0                                            
__________________________________________________________________________________________________
production_companies (InputLaye [(None, 1)]          0                                            
______________________________________________________________________________________________

Then we train!

In [12]:
model.compile(optimizer = "adam",
              loss = 'mse'
)
history = model.fit(train, 
                    validation_data=val,
                    epochs = 50)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


And lastly, we evaluate.

In [14]:
train_mse = model.evaluate(train)
train_var = y_train.var()

val_mse = model.evaluate(val)
val_var = y_val.var()

print(f'Train MSE: {train_mse}')
print(f'Train r^2: {(train_var - train_mse)/train_var}')

print(f'Validation MSE: {val_mse}')
print(f'Validation r^2: {(val_var - val_mse)/val_var}')

Train MSE: 0.006774891167879105
Train r^2: 0.33446840790368193
Validation MSE: 0.007823353633284569
Validation r^2: 0.25895581959762376
