Cooking Recipes: An analysis of Ratings, Nutrition, and Tags

William Siew

Introduction

Cooking recipes are a fascinating type of data. A single recipe is rich in data; it has a list of ingredients, nutrition information(calories,fat,protein,sodium,etc), ratings, and in some cases, a site may even use tags to categorize their recipes. Ratings of food and their recipes seems almost completely subjective and up to the individual. Perhaps they depend simply on one's enjoyment of the dish or maybe the simplicity of a recipe. Either way, I'd like to make an attempt to look at recipes not from a subjective human standpoint but from a data-driven one. Other than being an avid cook, this topic piqued my interest because of how unique it was. From the structure of the data to the unpredictability of the relationships between variables, I'm excited to see what I can find.

Data Acquisition

This dataset was acquired from Kaggle.com, a popular data science site where users can share data sets and their explorations with them. This dataset was made by Kaggle user Hugo Darwood who scraped Epicurious, a cooking website/blog.

We should keep in mind that this dataset is from 2016 so some things may be outdated and alot of our findings will likely be catered specifically to Epicurious.

Packages/Libraries

Before we get started, let's establish the packages/libraries we will be using. Reasons for using each package listed next to it.

In [1]:
import pandas as pd # For general data representation
import numpy as np # For general data manipulation
import seaborn as sns # For graphing
import matplotlib.pyplot as plt # For graphing/combining different plots
from collections import defaultdict # For counting
from sklearn import model_selection # For split datasets for regression
from sklearn import linear_model # For regressions
import statsmodels.api as sm # For regressions
import statsmodels.formula.api as smf # For regressions
import sklearn.metrics as metrics # For regression metrics
import operator # for backward elimination efficiency

The Structure of the Data

We have 680 columns/variables. Title, rating, calories, protein, fat, and sodium are self-explanatory. All the other variables are tags. Tags are used by the website to categorize dishes. About 80% of them are ingredients. The rest are things like events, seasons, dietary restrictions, countries/states, etc. Some examples of these tags are sugar, #cakeweek, christmas, vegan, china.

In [2]:
# Read CSV file we got from Kaggle
food_data = pd.read_csv("food/food_data.csv", sep=',')
food_data.head()
Out[2]:
title rating calories protein fat sodium #cakeweek #wasteless 22-minute meals 3-ingredient recipes ... yellow squash yogurt yonkers yuca zucchini cookbooks leftovers snack snack week turkey
0 Lentil, Apple, and Turkey Wrap 2.500 426.0 30.0 7.0 559.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1 Boudin Blanc Terrine with Red Onion Confit 4.375 403.0 18.0 23.0 1439.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 Potato and Fennel Soup Hodge 3.750 165.0 6.0 7.0 165.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 Mahi-Mahi in Tomato Olive Sauce 5.000 NaN NaN NaN NaN 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 Spinach Noodle Casserole 3.125 547.0 20.0 32.0 452.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 680 columns

Data Cleaning and Tidying

Here are the list of things we need to do about this dataset:

  • Convert all the tag columns to be integers instead of floats for better performance and to better represent them as binary variables.
  • Remove rows with unrealistic values for nutrition, "unrealistic" values listed as comments.
  • Normally, we shouldn't just remove the rows with missing data(NaN) but it's hard to classify them as MNAR,MCAR, or MAR as we didn't handle data acquisition, so we will remove the NaN rows.
  • It might be worth exploring the number of tags present in a recipe, so we will add that as a column.
  • Remove title as it has no use in the data set.
In [3]:
# make all tags integers
integer_columns = food_data.columns[6:]

# Convert all the ingredients columns to be integers no floats for better performance and to better represent them as binary variables
temp = {}
for column in integer_columns:
    temp[column] = int
food_data = food_data.astype(temp)


# remove all rows with more than 10000 calories
food_data = food_data[food_data['rating'] >= 0] # Ratings should be between 0 and 5
food_data = food_data[food_data['rating'] <= 5]
food_data = food_data[food_data['calories'] < 10000] # anything above 10000, we'll consider an error.
food_data = food_data[food_data['protein'] < 5000] # anything above 5000, we'll consider an error. 
food_data = food_data[food_data['fat'] < 5000] # anything above 5000, we'll consider an error. 
food_data = food_data[food_data['sodium'] < 5000] # anything above 5000, we'll consider an error. 

# Remove NaNs
food_data.dropna(inplace=True)

# Compile number of tags used in each row
tags = food_data.columns[6:]
row_counts = []
for index, row in food_data.iterrows():
    count = 0
    for tag in tags:
        if row[tag] == 1:
            count += 1
    row_counts.append(count)
    
# insert the new column at index 5
food_data.insert (5, 'tags_count', row_counts)

# Remove title as it has no use in the data set
food_data.drop(columns=['title'], axis = 1, inplace = True)
print(food_data.head())
   rating  calories  protein   fat  tags_count  sodium  #cakeweek  #wasteless  \
0   2.500     426.0     30.0   7.0          11   559.0          0           0   
1   4.375     403.0     18.0  23.0          11  1439.0          0           0   
2   3.750     165.0      6.0   7.0           7   165.0          0           0   
4   3.125     547.0     20.0  32.0          11   452.0          0           0   
5   4.375     948.0     19.0  79.0          10  1042.0          0           0   

   22-minute meals  3-ingredient recipes  ...  yellow squash  yogurt  yonkers  \
0                0                     0  ...              0       0        0   
1                0                     0  ...              0       0        0   
2                0                     0  ...              0       0        0   
4                0                     0  ...              0       0        0   
5                0                     0  ...              0       0        0   

   yuca  zucchini  cookbooks  leftovers  snack  snack week  turkey  
0     0         0          0          0      0           0       1  
1     0         0          0          0      0           0       0  
2     0         0          0          0      0           0       0  
4     0         0          0          0      0           0       0  
5     0         0          0          0      0           0       0  

[5 rows x 680 columns]

Exploratory Data Analysis

Some Summary Statistics

First let's look at the distributions of all our variables:

  • Rating: seems to be normally distributed centered in the range 4 to 4.5. There are also some outliers near 0 but it's not enough that I'd consider it bimodal.

    • Additionally, 1 to 3-star ratings are virtually non-existent.
  • Calories, Protein, Fat, Sodium: All seem to be normally distributed and skewed right.

    • Sodium has a very wide range of values.
      • Recipes tend to be very diverse in sodium levels.
    • Protein, fat, and sodium tend to be closer to 0.
      • Most recipes have very low amounts of protein, fat, and sodium.
    • Calories has the most "normal" distribution and is centered around 400.
      • Most recipes have on average 400 calories.
  • Tag Count: Looks almost bimodal with the larger peak centered at 8 and the smaller peak at 15.

    • This just means most recipes have tags that are around 8 or 15.
In [4]:
fig = plt.figure()
fig.subplots_adjust(hspace=0.4, wspace=0.4)
variables = ["rating","calories","protein","fat","sodium", "tags_count"]

fig.set_size_inches(18.5, 15.5)
for i in range(len(variables)): 
    ax = fig.add_subplot(6, 1, i + 1)
    if variables[i] == 'rating':
        sns.distplot(food_data[variables[i]],bins=[0,0.5,1,1.5,2,2.5,3,3.5,4,4.5,5], ax=ax)
    else:
        sns.distplot(food_data[variables[i]], ax=ax)
    #ax.set_title(variables[i])
plt.show()

For our tag variables, the only summary statistics I could think of are the top 20 most popular tags.

In [29]:
tags_count = defaultdict(lambda: 0) # map to count ingrdients

# count each ingredient
for index, row in food_data.iterrows():
    for tag in tags:
        if row[tag] == 1:
            tags_count[tag] += 1
tags_count = dict(tags_count)
total_rows = len(food_data.index)

# sort them and change count to percentage
tags_count = {k: v/total_rows for k, v in sorted(tags_count.items(), key=lambda item: item[1], reverse = True)}# sort by count

# putting it into a list to ensure correct ordering(calling .keys()/.values can cause reordering sometimes)
plot_tags = []
plot_count = []
for k,v in tags_count.items():
    plot_tags.append(k)
    plot_count.append(v)

# set plot size and color scheme
sns.set(rc={'figure.figsize':(11.7,10.27),"axes.labelsize":2})
sns.color_palette("mako")
# Use seaborn to plot bar plot
plot = sns.barplot(x=plot_count[:20], y=plot_tags[:20])
plot.set_xlabel("Frequency in recipes",fontsize=20)
plot.set_ylabel("Tags",fontsize=20)
plot.set_title("Most Popular Tags",fontsize = 25)
Out[29]:
Text(0.5, 1.0, 'Most Popular Tags')

Observations: It appears that the most popular tags tend to be related to dietary restrictions and can make up anywhere from 10 to 40% of most recipes. In case you are wondering why Bon Appetit is the number one result, Bon Appetit is the name of the popular food magazine that owns Epicurious so it's high frequency is expected.

Let's start exploring their relationships!

First, let's explore the relationship between Ratings and Calories/Protein/Fat/Sodium/Count of Tags. We'll make scatterplots using seaborn and they come with automatic regression lines. Something important to note is these visuals are not completely accurate. I've added jitters to the points which randomly spreads them by a certain range so we can see which parts of the scatterplot are more concentrated.

In [6]:
# We add the jitter to make the distribution of data points clearer, it does not influence the regression at all
fig = plt.figure()
fig.set_size_inches(18.5, 15.5)
fig.subplots_adjust(hspace=0.4, wspace=0.4)
predictors = ["calories","protein","fat","sodium","tags_count"]

# For each predictor, make a subplot and add jitters to it
for i in range(len(predictors)): 
    ax = fig.add_subplot(3, 2, i + 1)
    ax.set(ylim=(0, 8))
    if predictors[i] == "tags_count":
        sns.regplot(x="tags_count", y="rating", data=food_data, x_jitter = .45, y_jitter = .45, scatter_kws={"color": "blue"}, line_kws={"color": "red"}, ax=ax)
    else:           
        sns.regplot(x=predictors[i], y="rating", data=food_data, y_jitter=.2, scatter_kws={"color": "blue"}, line_kws={"color": "red"}, ax=ax)
    ax.set_title(predictors[i], fontsize = 20)
    ax.set_ylabel('Ratings', fontsize = 15)

plt.show()

Observations:

  • It's hard to see how much each predictor affects ratings simply from the slope of the lines as they all operate on different scales with different units.
    • E.g an increase in 1 tag is not comparable to an increase in 1 gram of sodium when talking about their impact of rating.
  • It looks like these relationships are linear and we will not need to use polynomials for regression.
  • Calories,Protein, and Fat seem to have a positive correlation with Ratings and very large spreads so they are likely to be poor predictors.
  • Sodium and Tags Count seem to have a positive correlation with Ratings and a small spread. These are likely to be better predictors.

Multiple Linear Regression

The graphs before were simple linear regressions. They were plotted with only 1 predictor and 1 response. Therefore, our previous results may not be that accurate, so let's plot a multiple linear regression. I like to structure all my regression models in the form of a question for easier interpretation, so let's jump right in!

Question: Can we predict ratings using just nutrition information(calories, protein, fat, sodium)?

  1. First, we need to split the data into a training set and a testing set. We'll use the training set to build the model and the testing set to evaluate it's predictive performance. We are using 80% of our data for the training and 20% for the testing.
  2. Next, we'll be using the statsmodel package to run a regression. We are using Rating as our dependent variable and Calories, Protein, Fat, Sodium as our predictors.
In [7]:
# separate tag columns for future use
tags = list(food_data.columns[6:])
# use separate data table instead of main one
X = food_data[['calories','protein','fat','sodium','tags_count'] + tags]
Y = food_data['rating']
# split data into 80% and 20%
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=.2)
    
X_temp = X_train[['calories','protein','fat','sodium']] 
Y_temp = Y_train

# Fit the Linear Regression on Train split
model = sm.OLS(Y_temp, X_temp).fit()
model.summary()
Out[7]:
OLS Regression Results
Dep. Variable: rating R-squared (uncentered): 0.475
Model: OLS Adj. R-squared (uncentered): 0.475
Method: Least Squares F-statistic: 2836.
Date: Mon, 21 Dec 2020 Prob (F-statistic): 0.00
Time: 15:03:16 Log-Likelihood: -31022.
No. Observations: 12535 AIC: 6.205e+04
Df Residuals: 12531 BIC: 6.208e+04
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
calories 0.0045 0.000 35.189 0.000 0.004 0.005
protein -0.0104 0.001 -9.490 0.000 -0.013 -0.008
fat -0.0269 0.002 -16.638 0.000 -0.030 -0.024
sodium 0.0014 4.32e-05 33.559 0.000 0.001 0.002
Omnibus: 8789.879 Durbin-Watson: 1.196
Prob(Omnibus): 0.000 Jarque-Bera (JB): 247094.714
Skew: -3.013 Prob(JB): 0.00
Kurtosis: 23.900 Cond. No. 64.8


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Observations:
  1. From the t-tests of our predictors, it seems all of our variables are significant predictors.
  2. From the coefficients, calories and sodium contribute to a higher rating while protein and fat contribute to a lower rating. Note that this is different from our previous observation.
  3. The R-squared of the value is about 50%, meaning our model only explains the variance of only half the points which is a sign that our model's predictive power is not very good.

So our model isn't the best but let's try using it to predict anyways. Two of the most popular metrics used to evaluate the predictive performance of a regression are MAE and RMSE.

Mean Absolute Error(MAE) is the average amount our prediction differs from the actual result.(We'll use this mainly as it is easier to interpret)

Root Mean Square Error(RMSE) is the average of the squared error. RMSE compared to MAE penalizes large errors far more.

In [8]:
# only pick nutrition information for this regression
X_temp = X_test[['calories','protein','fat','sodium']] 
predicted = model.predict(X_temp)
actual = Y_test

# Get MAE and RMSE
mae = metrics.mean_absolute_error(actual, predicted)
rmse = np.sqrt(metrics.mean_squared_error(actual, predicted)) 
print("MAE:",mae,"RMSE:",rmse)
MAE: 2.469128356274956 RMSE: 2.921623776555863

From our MAE result, our model on average predicts ratings with an error of 2.47(and our Rating's only go from 0 to 5 so it's around half of our entire range!). Our model is predicting things very inaccurately.

Decision: No, nutrition information alone is not enough to predict Rating but let's try again with more variables.

New Question: Can we predict ratings using nutrition information and the count of tags?

In [9]:
# only pick non-tag predictors for this regression
X_temp = X_train[['calories','protein','fat','sodium','tags_count']] 
Y_temp = Y_train

# Fit the Linear Regression on Train split
model = sm.OLS(Y_temp, X_temp).fit()
model.summary()
Out[9]:
OLS Regression Results
Dep. Variable: rating R-squared (uncentered): 0.814
Model: OLS Adj. R-squared (uncentered): 0.814
Method: Least Squares F-statistic: 1.100e+04
Date: Mon, 21 Dec 2020 Prob (F-statistic): 0.00
Time: 15:03:16 Log-Likelihood: -24507.
No. Observations: 12535 AIC: 4.902e+04
Df Residuals: 12530 BIC: 4.906e+04
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
calories 0.0004 8.06e-05 5.378 0.000 0.000 0.001
protein 0.0035 0.001 5.262 0.000 0.002 0.005
fat 0.0005 0.001 0.494 0.621 -0.001 0.002
sodium 0.0005 2.64e-05 19.198 0.000 0.000 0.001
tags_count 0.2311 0.002 151.338 0.000 0.228 0.234
Omnibus: 1156.087 Durbin-Watson: 1.890
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1628.237
Skew: -0.742 Prob(JB): 0.00
Kurtosis: 3.958 Cond. No. 104.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Observations:
  1. From the t-tests of our predictors, it seems the number of tags as a predictor is also significant.
  2. The R-squared of the value is about 80%. This is a big improvement in our model
In [10]:
# get the testing data for the variables we used
X_temp = X_test[['calories','protein','fat','sodium','tags_count']] 
predicted = model.predict(X_temp)
actual = Y_test

# Get MAE and RMSE
mae = metrics.mean_absolute_error(actual, predicted)
rmse = np.sqrt(metrics.mean_squared_error(actual, predicted)) 
print("MAE:",mae,"RMSE:",rmse)
MAE: 1.330207610015838 RMSE: 1.6499202446204

This time our MAE is a lot lower but it is still fairly high. It is predicting ratings and being off by 1.3.

Decision: Technically, yes we can but the predictive accuracy is not very high. We can do better, let's now try including the tags as categorical predictors.

New Question: Can we predict ratings using nutrition information, the number of tags, and specific tags?

  1. We are using Rating as our dependent variable and Calories, Protein, Fat, Sodium, the number of tags and almost 700 tags as our predictors.
  2. Odds are, a lot of those tags will not be significant predictors. To make the best model, we will remove the variables that are not significant. There are many methods for variable selection but I will be using Backward Elimination to remove all variables below 5%. After removing or inserting variables in a model, it needs to be refitted so expect the code to take a while like 10 minutes to finish.

For more info on Backward Elimination, I recommend checking out this Towards Data Science blog which explains it pretty well. Variable selection methods are extremely useful especially when building regression models.

In [11]:
# Made a copy because we are going to be removing variables as we go
X_temp = X_test.copy(deep = True)
Y_temp = Y_test.copy(deep = True)

# Helper method for Backward Elimination
def remove_most_insignificant(df, results):
    # use operator to find the key which belongs to the maximum value in the dictionary:
    max_p_value = max(results.pvalues.iteritems(), key=operator.itemgetter(1))[0]
    # this is the feature you want to drop:
    df.drop(columns = max_p_value, inplace = True)
    return df

insignificant_feature = True

# Backward Elimination Code
with np.errstate(divide='ignore',invalid='ignore'):
    while insignificant_feature:
            # fit regression
            model = sm.OLS(Y_temp, X_temp)
            results = model.fit()
            # find significant variables
            significant = [p_value < 0.05 for p_value in results.pvalues]
            if all(significant):
                insignificant_feature = False
            else:
                if X_temp.shape[1] == 1:  # if there's only one insignificant variable left
                    print('No significant features found')
                    results = None
                    insignificant_feature = False
                else:            
                    X_temp = remove_most_insignificant(X_temp, results)
    print(results.summary())
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:                 rating   R-squared (uncentered):                   0.905
Model:                            OLS   Adj. R-squared (uncentered):              0.902
Method:                 Least Squares   F-statistic:                              333.6
Date:                Mon, 21 Dec 2020   Prob (F-statistic):                        0.00
Time:                        15:10:33   Log-Likelihood:                         -5097.7
No. Observations:                3134   AIC:                                  1.037e+04
Df Residuals:                    3047   BIC:                                  1.090e+04
Df Model:                          87                                                  
Covariance Type:            nonrobust                                                  
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
protein                   0.0027      0.001      2.551      0.011       0.001       0.005
sodium                    0.0003   4.14e-05      7.943      0.000       0.000       0.000
tags_count                0.2187      0.007     29.439      0.000       0.204       0.233
22-minute meals           1.6039      0.628      2.554      0.011       0.372       2.836
advance prep required     0.6704      0.298      2.249      0.025       0.086       1.255
aperitif                 -2.5641      1.262     -2.031      0.042      -5.039      -0.089
artichoke                -0.5513      0.227     -2.425      0.015      -0.997      -0.106
bake                      0.2076      0.060      3.454      0.001       0.090       0.325
beef                      0.2384      0.110      2.173      0.030       0.023       0.454
blender                  -0.5071      0.134     -3.774      0.000      -0.770      -0.244
bon appétit               1.3975      0.067     20.892      0.000       1.266       1.529
buffet                   -0.7224      0.273     -2.648      0.008      -1.257      -0.188
burrito                   1.6039      0.628      2.554      0.011       0.372       2.836
butter                    0.4572      0.210      2.180      0.029       0.046       0.868
buttermilk               -0.9937      0.478     -2.077      0.038      -1.932      -0.056
campari                  -3.0246      1.254     -2.411      0.016      -5.484      -0.565
cardamom                 -0.9569      0.385     -2.487      0.013      -1.711      -0.202
chartreuse               -3.2733      0.952     -3.439      0.001      -5.140      -1.407
chicken                   0.2622      0.110      2.380      0.017       0.046       0.478
chocolate                 0.4151      0.113      3.683      0.000       0.194       0.636
christmas eve            -0.7559      0.191     -3.960      0.000      -1.130      -0.382
cocktail party            0.3301      0.119      2.771      0.006       0.097       0.564
cookie                    1.0386      0.403      2.578      0.010       0.249       1.828
cumin                     0.7370      0.341      2.164      0.031       0.069       1.405
custard                  -3.2024      1.249     -2.563      0.010      -5.652      -0.753
dairy free               -0.2757      0.090     -3.060      0.002      -0.452      -0.099
dinner                    0.1812      0.084      2.163      0.031       0.017       0.345
diwali                    1.7205      0.568      3.029      0.002       0.607       2.834
double boiler            -1.2076      0.456     -2.645      0.008      -2.103      -0.313
edible gift               0.8535      0.276      3.095      0.002       0.313       1.394
fortified wine            1.1239      0.504      2.229      0.026       0.135       2.113
fry                       0.5058      0.156      3.244      0.001       0.200       0.812
gin                      -0.7784      0.251     -3.101      0.002      -1.271      -0.286
goat cheese               0.5345      0.202      2.642      0.008       0.138       0.931
gourmet                   1.3806      0.067     20.641      0.000       1.249       1.512
green onion/scallion     -0.7389      0.331     -2.236      0.025      -1.387      -0.091
halloween                 1.3038      0.447      2.916      0.004       0.427       2.180
harpercollins            -1.1545      0.281     -4.113      0.000      -1.705      -0.604
high fiber               -0.3155      0.118     -2.665      0.008      -0.548      -0.083
house & garden            0.4932      0.178      2.766      0.006       0.144       0.843
hummus                    2.5270      1.251      2.020      0.043       0.074       4.979
jícama                   -0.8953      0.381     -2.353      0.019      -1.641      -0.149
kosher                   -0.3042      0.124     -2.452      0.014      -0.548      -0.061
lamb                      0.3522      0.168      2.092      0.036       0.022       0.682
lemon juice               0.5558      0.266      2.091      0.037       0.035       1.077
liqueur                  -0.8035      0.184     -4.377      0.000      -1.163      -0.444
lunch                    -0.4341      0.103     -4.210      0.000      -0.636      -0.232
mardi gras                1.2528      0.425      2.949      0.003       0.420       2.086
margarita                 1.4399      0.635      2.269      0.023       0.196       2.684
meatball                 -5.5963      1.307     -4.280      0.000      -8.160      -3.033
mixer                    -0.2903      0.129     -2.258      0.024      -0.542      -0.038
mother's day             -0.7876      0.234     -3.368      0.001      -1.246      -0.329
paleo                    -0.4027      0.135     -2.988      0.003      -0.667      -0.138
party                    -0.3396      0.130     -2.615      0.009      -0.594      -0.085
pastry                    1.2100      0.263      4.600      0.000       0.694       1.726
peanut                    0.5167      0.247      2.094      0.036       0.033       1.000
peanut free              -0.6703      0.091     -7.389      0.000      -0.848      -0.492
pescatarian              -0.3369      0.128     -2.625      0.009      -0.589      -0.085
philippines              -3.3267      1.249     -2.663      0.008      -5.776      -0.877
pickles                  -3.9709      0.910     -4.362      0.000      -5.756      -2.186
potluck                  -0.4748      0.162     -2.937      0.003      -0.792      -0.158
poultry                  -0.3854      0.159     -2.421      0.016      -0.698      -0.073
punch                    -1.5120      0.458     -3.302      0.001      -2.410      -0.614
radicchio                 2.9737      1.248      2.382      0.017       0.526       5.422
radish                   -0.4566      0.195     -2.345      0.019      -0.838      -0.075
roast                     0.2011      0.098      2.062      0.039       0.010       0.392
rum                      -0.5195      0.208     -2.503      0.012      -0.926      -0.113
salad dressing            0.8864      0.228      3.890      0.000       0.440       1.333
salmon                    0.5797      0.166      3.485      0.000       0.254       0.906
sauce                     0.3486      0.097      3.585      0.000       0.158       0.539
self                      2.1426      0.635      3.372      0.001       0.897       3.388
semolina                  1.4104      0.633      2.229      0.026       0.170       2.651
shower                   -0.4490      0.204     -2.196      0.028      -0.850      -0.048
soup/stew                 0.2987      0.102      2.929      0.003       0.099       0.499
spirit                   -0.8157      0.272     -3.003      0.003      -1.348      -0.283
spritzer                 -3.6976      1.254     -2.948      0.003      -6.157      -1.238
st. patrick's day        -1.9395      0.893     -2.172      0.030      -3.690      -0.189
stir-fry                  0.4436      0.190      2.331      0.020       0.070       0.817
tart                      2.4742      1.252      1.977      0.048       0.020       4.928
tomato                    0.2213      0.078      2.848      0.004       0.069       0.374
tree nut                 -0.6678      0.272     -2.457      0.014      -1.201      -0.135
tuna                      0.7590      0.267      2.845      0.004       0.236       1.282
veal                      0.6153      0.245      2.511      0.012       0.135       1.096
vegan                    -0.3365      0.103     -3.265      0.001      -0.539      -0.134
vegetable                -0.1696      0.076     -2.217      0.027      -0.320      -0.020
vegetarian                0.2536      0.084      3.002      0.003       0.088       0.419
wasabi                    2.6588      0.895      2.972      0.003       0.905       4.413
whiskey                   1.3453      0.498      2.701      0.007       0.369       2.322
==============================================================================
Omnibus:                      166.933   Durbin-Watson:                   1.956
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              341.799
Skew:                          -0.366   Prob(JB):                     6.02e-75
Kurtosis:                       4.443   Cond. No.                     1.29e+19
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.27e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Observations:

  1. Only 88 out of our 680 of original predictors, which is about 14%, were significant predictors.
  2. Protein, Sodium, and the number of tags seem to be significant predictors in our final model.
  3. The R-squared of the value is now 90%, meaning our model explains most of the variance. This is a good sign.
  4. The AIC and BIC score(which can be used to compare models) is significantly lower compared to our previous 2 models. So this model can be considered far better.
  5. In the footnotes, they mention we potentially have multicollinearity issues. For our data, this is likely true but for the simplicity of this project we won't try to address it because it's going to take a whole blog post to talk about it. Instead, if you are interested in what multicollinearity even is and how we can fix it, checkout this Towards Data Science blog.
In [12]:
# Get predictors that weren't remove by Backward Elimination
filtered_predictors = list(results.params.index)
X_temp = X_test[filtered_predictors] 
# Predict using test data
predicted = results.predict(X_temp)
actual = Y_test

# Get MAE and RMSE
mae = metrics.mean_absolute_error(actual, predicted)
rmse = np.sqrt(metrics.mean_squared_error(actual, predicted)) 
print("MAE:",mae,"RMSE:",rmse)
MAE: 0.916703536203574 RMSE: 1.230756552649044

Our MAE is a lot lower now. Our predictions of rating are only off by 0.9 which is a lot better compare to what we started with! There are probably more things we could try to improve the model, but I'll stop here.

Decision: We originally asked if we could use ALL of our predictors, so the answer is no. But we can use protein, sodium, the number of tags, and 85 of the tags to predict ratings with a relatively high accuracy.

So what tags contributes to higher Ratings for cooking recipes?

In [28]:
d = {'predictor': results.params.index, 'coefficient': results.params.values}
df = pd.DataFrame(data=d)

df.sort_values(by = 'coefficient', ascending = False, inplace = True)

sns.set(rc={'figure.figsize':(11.7,10.27),"axes.labelsize":2})
sns.color_palette("mako")

top_10_bottom_10_predictors = list(df['coefficient'][:20]) + list(df['coefficient'][-20:])
top_10_bottom_10_coefficient = list(df['predictor'][:20]) + list(df['predictor'][-20:])

plot = sns.barplot(x=top_10_bottom_10_predictors, y=top_10_bottom_10_coefficient)
plot.set_xlabel("Effect on Rating when Present",fontsize=20)
plot.set_ylabel("Tags",fontsize=20)
plot.set_title("Top and Bottom 20 Most Influential tags",fontsize = 25)
Out[28]:
Text(0.5, 1.0, 'Top and Bottom 20 Most Influential tags')

The Big Winners

Radicchio, Wasabi, Hummus

The Biggest Loser: Meatballs?

Interesting Observations:

  1. When Meatball is included as a tag in a recipe, it's rating is expected to go down by -5.59! This number is surprising especially when you consider Rating is only between 0 and 5. The only explanation for why it's greater than 5 that I can think of is that other tags present for meatball recipes tend to boost it's rating up.

  2. The top 3 tags(radicchio,wasabi, and hummus) could be categorized as healthy and trendy ingredients(this is just speculation though). While there isn't enough data to make any conclusions, it's possible the healthy and trendy cooking reflects the Epicurious userbase.

  3. Users also seem to rate 22-minute recipes higher.

  4. Alcohol seems to be an influential tag in general. 7 of the bottom 20 are alcohol related: gin,liquer,spirit,apeitif,campari,chartreuse,spritzer. 3 of the top 20 include whiskey, fortified wine, and margharitas.

  5. Jicama and Cardamom are associated with lower ratings and from a quick Google search both are somewhat "acquired tastes".

  6. There is a lot of weird randomness in the data. I'll mention this in my reflection but I really wish there was an easy way to categorize the tags as I feel it will help me make sense of these odd observations.

    • Events/Occasions like Diwali, Christmas Eve, St Patricks Day, Mother's Day, Mardi Gras, and Halloween have a surprisingly large impact on ratings.
    • Philippines is the only location tag considered significant and it's associated with lower ratings.

Conclusion and Reflection

I said at the start that cooking recipe ratings were subjective. From our analysis, I think you could agree that predicting recipe ratings with high accuracy is hard. Compared to something very numerically driven, when working with a very human topic like food, a data-driven solution may not be the best approach. However, I'm sure we've gained some interesting patterns that would not be easily discovered.

We also got to go through the entire data science process. From cleaning the data to some initial analysis, we dived headfirst into regression and steadily build regression models, each one better than the last.

This project's greatest challenge was cleaning the data and handling the large number of predictors. I also first did the project with only the ingredient tags but it introduced a lot of bias in my statistics so I decided to keep them.

I like to end all my projects with things I'd do if I had more time. Here are a list of things I'd like to do:

  1. Making my own scraper would have been ideal. Understanding, formatting, and cleaning the data would have been easier and I also would have gotten more up-to-date data.
  2. I should have gotten more sources of data if my theme was cooking recipes. Without a doubt, a lot of the data and results were biased towards Epicurious.
  3. Finding a more efficient yet effective variable selection method. Forward Selection or Stepwise Regression would be interesting to test out.
  4. Categorizing the tags. Unfortunately, there is no easy way to categorize the tags and there are too many of them to do it by hand. I'm confident that if we could categorize them, we would see even more interesting patterns.

If you've made it to the end, thanks for reading everything. As a token of my appreciation, here's a video of a pumpkin guy eating stuff.