Cooking recipes are a fascinating type of data. A single recipe is rich in data; it has a list of ingredients, nutrition information(calories,fat,protein,sodium,etc), ratings, and in some cases, a site may even use tags to categorize their recipes. Ratings of food and their recipes seems almost completely subjective and up to the individual. Perhaps they depend simply on one's enjoyment of the dish or maybe the simplicity of a recipe. Either way, I'd like to make an attempt to look at recipes not from a subjective human standpoint but from a data-driven one. Other than being an avid cook, this topic piqued my interest because of how unique it was. From the structure of the data to the unpredictability of the relationships between variables, I'm excited to see what I can find.
This dataset was acquired from Kaggle.com, a popular data science site where users can share data sets and their explorations with them. This dataset was made by Kaggle user Hugo Darwood who scraped Epicurious, a cooking website/blog.
We should keep in mind that this dataset is from 2016 so some things may be outdated and alot of our findings will likely be catered specifically to Epicurious.
Before we get started, let's establish the packages/libraries we will be using. Reasons for using each package listed next to it.
import pandas as pd # For general data representation
import numpy as np # For general data manipulation
import seaborn as sns # For graphing
import matplotlib.pyplot as plt # For graphing/combining different plots
from collections import defaultdict # For counting
from sklearn import model_selection # For split datasets for regression
from sklearn import linear_model # For regressions
import statsmodels.api as sm # For regressions
import statsmodels.formula.api as smf # For regressions
import sklearn.metrics as metrics # For regression metrics
import operator # for backward elimination efficiency
We have 680 columns/variables. Title, rating, calories, protein, fat, and sodium are self-explanatory. All the other variables are tags. Tags are used by the website to categorize dishes. About 80% of them are ingredients. The rest are things like events, seasons, dietary restrictions, countries/states, etc. Some examples of these tags are sugar, #cakeweek, christmas, vegan, china.
# Read CSV file we got from Kaggle
food_data = pd.read_csv("food/food_data.csv", sep=',')
food_data.head()
# make all tags integers
integer_columns = food_data.columns[6:]
# Convert all the ingredients columns to be integers no floats for better performance and to better represent them as binary variables
temp = {}
for column in integer_columns:
temp[column] = int
food_data = food_data.astype(temp)
# remove all rows with more than 10000 calories
food_data = food_data[food_data['rating'] >= 0] # Ratings should be between 0 and 5
food_data = food_data[food_data['rating'] <= 5]
food_data = food_data[food_data['calories'] < 10000] # anything above 10000, we'll consider an error.
food_data = food_data[food_data['protein'] < 5000] # anything above 5000, we'll consider an error.
food_data = food_data[food_data['fat'] < 5000] # anything above 5000, we'll consider an error.
food_data = food_data[food_data['sodium'] < 5000] # anything above 5000, we'll consider an error.
# Remove NaNs
food_data.dropna(inplace=True)
# Compile number of tags used in each row
tags = food_data.columns[6:]
row_counts = []
for index, row in food_data.iterrows():
count = 0
for tag in tags:
if row[tag] == 1:
count += 1
row_counts.append(count)
# insert the new column at index 5
food_data.insert (5, 'tags_count', row_counts)
# Remove title as it has no use in the data set
food_data.drop(columns=['title'], axis = 1, inplace = True)
print(food_data.head())
First let's look at the distributions of all our variables:
Rating: seems to be normally distributed centered in the range 4 to 4.5. There are also some outliers near 0 but it's not enough that I'd consider it bimodal.
Calories, Protein, Fat, Sodium: All seem to be normally distributed and skewed right.
Tag Count: Looks almost bimodal with the larger peak centered at 8 and the smaller peak at 15.
fig = plt.figure()
fig.subplots_adjust(hspace=0.4, wspace=0.4)
variables = ["rating","calories","protein","fat","sodium", "tags_count"]
fig.set_size_inches(18.5, 15.5)
for i in range(len(variables)):
ax = fig.add_subplot(6, 1, i + 1)
if variables[i] == 'rating':
sns.distplot(food_data[variables[i]],bins=[0,0.5,1,1.5,2,2.5,3,3.5,4,4.5,5], ax=ax)
else:
sns.distplot(food_data[variables[i]], ax=ax)
#ax.set_title(variables[i])
plt.show()
For our tag variables, the only summary statistics I could think of are the top 20 most popular tags.
tags_count = defaultdict(lambda: 0) # map to count ingrdients
# count each ingredient
for index, row in food_data.iterrows():
for tag in tags:
if row[tag] == 1:
tags_count[tag] += 1
tags_count = dict(tags_count)
total_rows = len(food_data.index)
# sort them and change count to percentage
tags_count = {k: v/total_rows for k, v in sorted(tags_count.items(), key=lambda item: item[1], reverse = True)}# sort by count
# putting it into a list to ensure correct ordering(calling .keys()/.values can cause reordering sometimes)
plot_tags = []
plot_count = []
for k,v in tags_count.items():
plot_tags.append(k)
plot_count.append(v)
# set plot size and color scheme
sns.set(rc={'figure.figsize':(11.7,10.27),"axes.labelsize":2})
sns.color_palette("mako")
# Use seaborn to plot bar plot
plot = sns.barplot(x=plot_count[:20], y=plot_tags[:20])
plot.set_xlabel("Frequency in recipes",fontsize=20)
plot.set_ylabel("Tags",fontsize=20)
plot.set_title("Most Popular Tags",fontsize = 25)
Observations: It appears that the most popular tags tend to be related to dietary restrictions and can make up anywhere from 10 to 40% of most recipes. In case you are wondering why Bon Appetit is the number one result, Bon Appetit is the name of the popular food magazine that owns Epicurious so it's high frequency is expected.
First, let's explore the relationship between Ratings and Calories/Protein/Fat/Sodium/Count of Tags. We'll make scatterplots using seaborn and they come with automatic regression lines. Something important to note is these visuals are not completely accurate. I've added jitters to the points which randomly spreads them by a certain range so we can see which parts of the scatterplot are more concentrated.
# We add the jitter to make the distribution of data points clearer, it does not influence the regression at all
fig = plt.figure()
fig.set_size_inches(18.5, 15.5)
fig.subplots_adjust(hspace=0.4, wspace=0.4)
predictors = ["calories","protein","fat","sodium","tags_count"]
# For each predictor, make a subplot and add jitters to it
for i in range(len(predictors)):
ax = fig.add_subplot(3, 2, i + 1)
ax.set(ylim=(0, 8))
if predictors[i] == "tags_count":
sns.regplot(x="tags_count", y="rating", data=food_data, x_jitter = .45, y_jitter = .45, scatter_kws={"color": "blue"}, line_kws={"color": "red"}, ax=ax)
else:
sns.regplot(x=predictors[i], y="rating", data=food_data, y_jitter=.2, scatter_kws={"color": "blue"}, line_kws={"color": "red"}, ax=ax)
ax.set_title(predictors[i], fontsize = 20)
ax.set_ylabel('Ratings', fontsize = 15)
plt.show()
Observations:
The graphs before were simple linear regressions. They were plotted with only 1 predictor and 1 response. Therefore, our previous results may not be that accurate, so let's plot a multiple linear regression. I like to structure all my regression models in the form of a question for easier interpretation, so let's jump right in!
# separate tag columns for future use
tags = list(food_data.columns[6:])
# use separate data table instead of main one
X = food_data[['calories','protein','fat','sodium','tags_count'] + tags]
Y = food_data['rating']
# split data into 80% and 20%
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=.2)
X_temp = X_train[['calories','protein','fat','sodium']]
Y_temp = Y_train
# Fit the Linear Regression on Train split
model = sm.OLS(Y_temp, X_temp).fit()
model.summary()
So our model isn't the best but let's try using it to predict anyways. Two of the most popular metrics used to evaluate the predictive performance of a regression are MAE and RMSE.
Mean Absolute Error(MAE) is the average amount our prediction differs from the actual result.(We'll use this mainly as it is easier to interpret)
Root Mean Square Error(RMSE) is the average of the squared error. RMSE compared to MAE penalizes large errors far more.
# only pick nutrition information for this regression
X_temp = X_test[['calories','protein','fat','sodium']]
predicted = model.predict(X_temp)
actual = Y_test
# Get MAE and RMSE
mae = metrics.mean_absolute_error(actual, predicted)
rmse = np.sqrt(metrics.mean_squared_error(actual, predicted))
print("MAE:",mae,"RMSE:",rmse)
From our MAE result, our model on average predicts ratings with an error of 2.47(and our Rating's only go from 0 to 5 so it's around half of our entire range!). Our model is predicting things very inaccurately.
# only pick non-tag predictors for this regression
X_temp = X_train[['calories','protein','fat','sodium','tags_count']]
Y_temp = Y_train
# Fit the Linear Regression on Train split
model = sm.OLS(Y_temp, X_temp).fit()
model.summary()
# get the testing data for the variables we used
X_temp = X_test[['calories','protein','fat','sodium','tags_count']]
predicted = model.predict(X_temp)
actual = Y_test
# Get MAE and RMSE
mae = metrics.mean_absolute_error(actual, predicted)
rmse = np.sqrt(metrics.mean_squared_error(actual, predicted))
print("MAE:",mae,"RMSE:",rmse)
This time our MAE is a lot lower but it is still fairly high. It is predicting ratings and being off by 1.3.
For more info on Backward Elimination, I recommend checking out this Towards Data Science blog which explains it pretty well. Variable selection methods are extremely useful especially when building regression models.
# Made a copy because we are going to be removing variables as we go
X_temp = X_test.copy(deep = True)
Y_temp = Y_test.copy(deep = True)
# Helper method for Backward Elimination
def remove_most_insignificant(df, results):
# use operator to find the key which belongs to the maximum value in the dictionary:
max_p_value = max(results.pvalues.iteritems(), key=operator.itemgetter(1))[0]
# this is the feature you want to drop:
df.drop(columns = max_p_value, inplace = True)
return df
insignificant_feature = True
# Backward Elimination Code
with np.errstate(divide='ignore',invalid='ignore'):
while insignificant_feature:
# fit regression
model = sm.OLS(Y_temp, X_temp)
results = model.fit()
# find significant variables
significant = [p_value < 0.05 for p_value in results.pvalues]
if all(significant):
insignificant_feature = False
else:
if X_temp.shape[1] == 1: # if there's only one insignificant variable left
print('No significant features found')
results = None
insignificant_feature = False
else:
X_temp = remove_most_insignificant(X_temp, results)
print(results.summary())
# Get predictors that weren't remove by Backward Elimination
filtered_predictors = list(results.params.index)
X_temp = X_test[filtered_predictors]
# Predict using test data
predicted = results.predict(X_temp)
actual = Y_test
# Get MAE and RMSE
mae = metrics.mean_absolute_error(actual, predicted)
rmse = np.sqrt(metrics.mean_squared_error(actual, predicted))
print("MAE:",mae,"RMSE:",rmse)
Our MAE is a lot lower now. Our predictions of rating are only off by 0.9 which is a lot better compare to what we started with! There are probably more things we could try to improve the model, but I'll stop here.
d = {'predictor': results.params.index, 'coefficient': results.params.values}
df = pd.DataFrame(data=d)
df.sort_values(by = 'coefficient', ascending = False, inplace = True)
sns.set(rc={'figure.figsize':(11.7,10.27),"axes.labelsize":2})
sns.color_palette("mako")
top_10_bottom_10_predictors = list(df['coefficient'][:20]) + list(df['coefficient'][-20:])
top_10_bottom_10_coefficient = list(df['predictor'][:20]) + list(df['predictor'][-20:])
plot = sns.barplot(x=top_10_bottom_10_predictors, y=top_10_bottom_10_coefficient)
plot.set_xlabel("Effect on Rating when Present",fontsize=20)
plot.set_ylabel("Tags",fontsize=20)
plot.set_title("Top and Bottom 20 Most Influential tags",fontsize = 25)
When Meatball is included as a tag in a recipe, it's rating is expected to go down by -5.59! This number is surprising especially when you consider Rating is only between 0 and 5. The only explanation for why it's greater than 5 that I can think of is that other tags present for meatball recipes tend to boost it's rating up.
The top 3 tags(radicchio,wasabi, and hummus) could be categorized as healthy and trendy ingredients(this is just speculation though). While there isn't enough data to make any conclusions, it's possible the healthy and trendy cooking reflects the Epicurious userbase.
Users also seem to rate 22-minute recipes higher.
Alcohol seems to be an influential tag in general. 7 of the bottom 20 are alcohol related: gin,liquer,spirit,apeitif,campari,chartreuse,spritzer. 3 of the top 20 include whiskey, fortified wine, and margharitas.
Jicama and Cardamom are associated with lower ratings and from a quick Google search both are somewhat "acquired tastes".
There is a lot of weird randomness in the data. I'll mention this in my reflection but I really wish there was an easy way to categorize the tags as I feel it will help me make sense of these odd observations.
I said at the start that cooking recipe ratings were subjective. From our analysis, I think you could agree that predicting recipe ratings with high accuracy is hard. Compared to something very numerically driven, when working with a very human topic like food, a data-driven solution may not be the best approach. However, I'm sure we've gained some interesting patterns that would not be easily discovered.
We also got to go through the entire data science process. From cleaning the data to some initial analysis, we dived headfirst into regression and steadily build regression models, each one better than the last.
This project's greatest challenge was cleaning the data and handling the large number of predictors. I also first did the project with only the ingredient tags but it introduced a lot of bias in my statistics so I decided to keep them.
I like to end all my projects with things I'd do if I had more time. Here are a list of things I'd like to do:
If you've made it to the end, thanks for reading everything. As a token of my appreciation, here's a video of a pumpkin guy eating stuff.