Kaggle (acquired by Google in 20217) is an incredible resource for all data scientists. The company promotes itself as “the home of data science.” I advise my Intro to Data Science students at UCLA to take advantage of Kaggle by first completing the venerable Titanic Getting Started Prediction Challenge, and then moving on to active challenges. Kaggle is a great way to gain valuable experience with data science and machine learning. Now, there are two excellent books to lead you through the Kaggle process. The Kaggle Book by Konrad Banachewicz and Luca Massaron published in 2022, and The Kaggle Workbook by the same authors published in 2023, are both from UK-based Packt Publishing.
Let’s start with The Kaggle Book. The book is an invaluable learning resource for anyone participating in a Kaggle competition, as well as pretty much any data scientist wishing to sharpen his/her skills. Reading the book is like doing a Vulcan mind-meld with Kaggle Masters and Grandmasters; you get an instant appreciation for how these experts have done so well in the Kaggle ecosystem. This is achieved in a number of ways: through their winning Python code, through their detailed interview sidebars spread throughout the book, and through selected links pointing to important Kaggle discussions. This last feature of the book maybe the most useful as some of the discussions offer insights that you won’t find anywhere else. For example, Grandmaster Michael Jahrer famous post on denoising autoencoders is featured in Chapter 7. Reading his detailed explanations for how he won 1st place in the Porto Seguro’s Safe Driver Prediction competition is an excellent way to add to your own data science toolbox. Chapter 7 also included an insightful interview with well-known Kaggler Bojan Tunguz who is a huge proponent of XGBoost on Twitter.
The book also offers strategic references to many Kaggle competitions that illustrate critical methods for ensuring machine learning success. For example, Chapter 5 includes references to a number of competitions for which AUC was used for determining classification accuracy. You can have fun bouncing from project to project to better understand important principles in machine learning. The book serves as a guide map for such explorations. The result is a much better understanding for how to approach projects moving forward.
One of my favorite chapters is Chapter 5 on Metrics since when it comes right down to it, you need a firm stable of techniques with which to judge the performance of your ML solutions. Another favorite is Chapter 8 on Hyperparameter Optimization. Using the best and most powerful algorithms is one thing, but knowing how to optimize a model’s many hyperparameter is quite another. Although the book doesn’t address the mathematical foundations for the algorithms and their hyperparameters, it does provide insights into finding the the best hyperparameters for your models. Seeing how Grandmasters address the hyperparameter problem is quite valuable. I also enjoyed Chapter 7 on modeling tabular data, i.e. business data. Here are discussions of important topics like dimensionality reduction, feature engineering, and using neural networks for tabular data.
The balance of the book includes useful topics like: an introduction to Kaggle datasets, working with Kaggle notebooks, leveraging Kaggle discussion forums, as well as popular topics like computer vision and NLP. This book is a great way to tame the complex Kaggle infrastructure, and I can’t image proceeding with a Kaggle competition without this book by your side.
A very nice adjunct to The Kaggle Book, is The Kaggle Workbook that contains just four chapters, each having a thorough review of past Kaggle challenges which can be viewed as self-leaning exercises containing valuable insights for Kaggle data science competitions. Each of the 4 chapters includes Python source code for the solution. The code is designed to run on a Kaggle notebook. Here is a list of the projects:
- Porto Seguro’s Safe Driver Prediction – Predict if a driver will file an insurance claim next year. The project includes using the Light GBM model, building a denoising autoencoder and how to use it to fee a neural network, and blending models.
- M5 on Kaggle for Accuracy and Uncertainty – Based on Walmart’s daily sales time series of items hierarchically arranged into departments, categories, and stores spread across three U.S. states, the solution demonstrates how to use LightGBM for this time series problem.
- Cassava Leaf Disease Classification – Classify crowdsourced photos of cassava plants. This multiclass problem demonstrates how to build a complete pipeline for image classification.
- Google Quest Q&A Labeling – Predict human responder’s evaluation of subjective aspects of a question/answer pair where an understanding of context was crucial. Cast as a multiclass classification problem, the solution explores the semantic characteristics of a corpus.
If you’re thinking of competing in Kaggle challenge or if you just want to push forward with your data science skills, I would highly recommend this Kaggle book tandem. I can’t see investing in one book and not the other. You need them both. They represent an excellent one-two punch for gaining valuable experience with solving machine learning problems.
Contributed by Daniel D. Gutierrez, Editor-in-Chief and Resident Data Scientist for insideBIGDATA. In addition to being a tech journalist, Daniel also is a consultant in data scientist, author, educator and sits on a number of advisory boards for various start-up companies.
Sign up for the free insideBIGDATA newsletter.
Join us on Twitter: https://twitter.com/InsideBigData1
Join us on LinkedIn: https://www.linkedin.com/company/insidebigdata/
Join us on Facebook: https://www.facebook.com/insideBIGDATANOW