My project to analyze ingredients, recipes and cuisines. I have four objectives:

  1. Can we classify a cuisine if we know the ingredients?
  2. What are the most popular words in different cuisines?
  3. How the cuisines are correlated each other?
  4. What is the most popular ingredient in the world?

Data analysis and results

I used text data from allrecipes for this analysis. I have scraped food recipes of different cuisines from the website. Apart from the text I used a few different numerical parameters. After cleaning up the text I used a tokenizer to find the most used words in different cuisines. I found that pepper is the most popular ingredient in all the cuisines. I used TFIDF procedure to classify the recipes to different cuisines based on its ingredients. I found that the correlation between cuisine can be found from confusion matrix based on their ingredients. You can view my analysis in this Jupyter notebook

Future direction

This project is equally fun and interesting. Currently, chefs are classifying the recipes based on their own cuisine. Even though there are closer ingredients found in recipes of different cuisines. There is also an increasing availability of fusion food. It will be interesting if we can classify the recipe to any cuisine based on their ingredients.

Detailed figures and their explaination

Most used words in different cuisines

  • Words in these figures are appropriate for these cuisines. Here I show the number of words in Italian cuisins which have a lot of cheese, olive oil, black pepper etc. If you are interested to such figures from different cuisins please go to this link

Italian

  • I count the words in these cuisines and the distribition is shown. It was found that the pepper is the most used ingredients. However, most of the dishes don’t use pepper or garlic as main ingredients but many use it for flavoring the recipe which makes pepper or garlic as their unavoidable ingredient. This result made me happier as I am from a part of India where the first ever pepper was traded in the world!

Word count

Please see in the bottom of this page to find out more count separated by other food related parameters such as calories, ratings etc

Modeling cuisins from ingredients

  • I use these word count to check whether I can classify their respective cuisines. I used scikit-learn TFIDF to find features from the ingredient. Then I found a model which classify cuisins from the ingredients. I have used a few different model on the TFIDF such as logistic regression, MultinomialNB based Bayasian and XGBoost. For these algorithms I obtained 0.56, 0.55 and 0.6 scores. I should remind you that the correlation matrix, as shown below, have identical ingredients which make the classification difficult. Since there are not much data I didn’t want to go for cross validation.

Confusion matrix

  • Confusion matrix as correlation of ingredients between different cuisines

Confusion matrix

After the classification I generated this interesting plot. I looked at the data and found that there are many sweet recipes in Danish cuisine. There are mostly similar ingredients in several sweet recipe. This confusion shows that for sweet recipes there are much more similarity in different cuisines. One of my future analysis is to remove sweet dishes.

More figures on word count

  • There are many recipes are in the high calorie area. I took recipies which gives calorie greater than 800 per person and found the ingredients. You can find some of the ingredients from all recipes but many product from milk, oil are in top list.

High Calorie Word count

  • Found words from recipes which are highly rated

High rating word count

  • Found words from recipes which are highly made (> 5000). It is found that the words are mostly matching the sweet recipes.

High made word count

  • Found words from recipes which are highly rated (>= 4).

Highly rated word count