What is Boosting in Machine Learning? (2023)

The Superheroes of Machine Learning models

What is Boosting in Machine Learning? (1)

In this post, we will see a simple and intuitive explanation of Boosting algorithms: what they are, why they are so powerful, some of the different types, and how they are trained and used to make predictions.

We will avoid all the heavy maths and go for a clear, simple, but in depth explanation that can be easily understood. However, additional material and resources will be left at the end of the post, in case you want to dive further into the topic.

Before we start, here you have some additional resources to skyrocket your Machine Learning career

Awesome Machine Learning Resources:- For learning resources go to How to Learn Machine Learning! - For professional resources (jobs, events, skill tests) go to AIgents.co 
— A career community for Data Scientists & Machine Learning Engineers

Traditionally, building a Machine Learning application consisted on taking a single learner, like a Logistic Regressor, a Decision Tree, Support Vector Machine, or an Artificial Neural Network, feeding it data, and teaching it to perform a certain task through this data.

(Video) Boosting

Then ensemble methods were born, which involve using many learners to enhance the performance of any single one of them individually. These methods can be described as techniques that use a group of weak learners (those who on average achieve only slightly better results than a random model) together, in order to create a stronger, aggregated one.

Generally, ensemble methods are built by grouping variants of individual Decision Trees, as we will see later.

Boosting models fall inside this family of ensemble methods.

Boosting, initially named Hypothesis Boosting, consists on the idea of filtering or weighting the data that is used to train our team of weak learners, so that each new learner gives more weight or is only trained with observations that have been poorly classified by the previous learners.

By doing this our team of models learns to make accurate predictions on all kinds of data, not just on the most common or easy observations. Also, if one of the individual models is very bad at making predictions on some kind of observation, it does not matter, as the other N-1 models will most likely make up for it.

Boosting should not be confused with Bagging, which is the other main family of ensemble methods: while in bagging the weak learners are trained in parallel using randomness, in boosting the learners are trained sequentially, in order to be able to perform the task of data weighting/filtering described in the previous paragraph.

What is Boosting in Machine Learning? (2)

As we can see from the previous image, in boosting the models can have different importance or weights (represented in the different sizes of the learners), while in bagging all learners have the same weight in the final decision.

Also, in boosting, the data set is weighted (represented by the different sizes of the data points), so that observations that were incorrectly classified by classifier n are given more importance in the training of model n + 1, while in bagging the training samples are taken randomly from the whole population.

Now that we have seen what boosting is, and its differences with bagging, let's see why it works so well!

In general, ensemble methods reduce the bias and variance of our Machine Learning models. If you don’t know what bias and variance are don’t worry, I got you covered with this article.

Ensemble methods help increase the stability and performance of machine learning models by eliminating the dependency of a single estimator. This can be clearly seen with a Bagging example: Random Forest.

(Video) A Quick Guide to Boosting in Machine Learning

Random Forest are, as their name suggests, a group of individual Decision trees that make up a forest. These individual trees are quite prone to overfiting the data, and despite being very simple and intuitive models, they’re not so great at making predictions.

However, if we use many trees these problems disappear, as each tree is trained with different samples of data and different features, resulting in a stronger and more robust model overall.

With boosting, this works in the same way, but while in bagging each model is trained independently, in boosting the N models are trained sequentially, taking into account the success of the previous model and increasing the weights of the data that this previous model has had the highest error on, which makes the subsequent models focus on the most difficult data observations.

Also, the individual models that perform the best on the weighted training samples, will become stronger (get a higher weight), and therefore have a higher impact on the final prediction.

Alright, this sounds lovely, but how are these models actually trained?

The training process depends on the Boosting algorithm that we are using (Adaboost vs LigthGBM vs XGBoost…), but generally it follows this pattern:

  1. All the data samples start with the same weights. These samples are used to train an individual model (a Decision Tree lets say).
  2. The prediction error for each sample is calculated, increasing the weights of those samples which have had a greater error, to make them more important for the training of following individual model.
  3. Depending on how well this individual model did on its predictions, it gets assigned an importance/weight or amount of say. A model that outputs very good predictions will have a high amount of say in the final decision.
  4. The weighted data is passed on to the posterior model, and 2) and 3) are repeated.
  5. Number 4) is repeated until we have reached an certain number of models or until the error is bellow a certain threshold.

In some cases, boosting models are trained with an specific fixed weight for each learner (called learning rate) and instead of giving each sample an individual weight, the models are trained trying to predict the differences between the previous predictions on the samples and the real values of the objective variable. This difference are what we call residuals.

We will speak about this more later, when we see the different kind of boosting models, however, the main characteristic of the family still remains: the sequential training on many individual learners to create a more powerful, aggregated model.

Awesome! Now that we know how Boosting models are trained, let's see how they are used to make predictions on new data.

They way a boosting model makes predictions on new data is very simple. When we get a new observation with its features, it is passed through every one of the individual models, having each model make its own prediction.

Then, taking into account the weight of each one of these models, all these predictions are scaled and combined, and a final global prediction is given.

(Video) Gradient Boost Part 1 (of 4): Regression Main Ideas

What is Boosting in Machine Learning? (4)

To end, lets explore the characteristics of the most common Boosting models out there.

What is Boosting in Machine Learning? (5)

Short for Adaptive Boosting, AdaBoost works by the exact process described before of training sequentially, predicting, and updating the weights of the miss-classified samples and of the corresponding weak models.

It is mostly used with Decision Tree Stumps: decision trees with just a root node and two leave nodes, where only 1 feature of the data is evaluated. As we can see, by taking into account only 1 feature of our data to make predictions, each stump is a very very weak model. However, by combining many of them, a very robust and accurate ensemble model can be built.

If you want to know more about AdaBoost, check out the following video by StatQuest.

What is Boosting in Machine Learning? (6)

Very similar to AdaBoost, Gradient Boosting Machines train weak learners sequentially, adding more and more estimators, but instead of adapting the weights of the data, it tries to predict the residual errors made by the previous estimators.

Because of this, we no longer have sample weights, and all the weak models have the same amount of say or importance. Again, most times, Decision trees are used as the base predictors, however, they’re not stumps, but bigger, fixed sized trees. GBMs use a learning rate and takes a small steps towards better results, in a similar manner conceptually to what is done in Gradient Descent.

Again, if you wanna dive deeper, check out the video by StatQuest.

What is Boosting in Machine Learning? (7)
(Video) What is AdaBoost (BOOSTING TECHNIQUES)

Short for eXtreme gradient boosting, like in Gradient boosting, we fit our trees to the residuals of the previous trees predictions, however, instead of using conventional, fixed size decision trees, XGBoost uses a different kind of trees: XGBoost trees we could call them.

It builds these trees by calculating similarity scores between the observations that end up in a leave node. Also, XGBoost allows for regularisation, reducing the possible overfitting of our individual trees and therefore of the overall ensemble model.

Lastly, XGBoost is optimised to push the limit of the computational resources of boosted tree algorithms, making it a very high performance and fast algorithm in terms of time and computation.

You can watch the following video XGBoost Part 1: Regression, to get a deeper vision of what XGBoost is all about.

What is Boosting in Machine Learning? (8)

Light Gradient Boosting Machines, known by the short name of LigthGBM, are yet another turnaround of improvements for Gradient Boosting algorithms. Instead of using a level-wise growing strategy for the decision trees like in XGBoost, it uses a leaf-wise growth strategy, giving it the chance to achieve a higher error reduction per jump than other tree based algorithms. Also, compared to XGBoost, LigthGBM is generally faster, specially on large data sets.

You can learn more about it on its official docu page.

That is it! As always, I hope you enjoyed the post, and that I managed to help you understand what boosting is, how it works, and why it is so powerful.

(Video) Boosting - EXPLAINED!

Here you can find some additional resources in case you want to learn more about the topic:

If you liked this post then feel free to follow me on Twitter at @jaimezorno. Also, you can take a look at my other posts on Data Science and Machine Learning here. Have a good read!

If you want to learn more about Machine Learning and Artificial Intelligence follow me on Medium, and stay tuned for my next posts! Also, you can check out this repository for more resources on Machine Learning and AI!

  • Cover Image from Unsplash.
  • All other images are self made.


1. Boosting Machine Learning Tutorial | Adaptive Boosting, Gradient Boosting, XGBoost | Edureka
2. What is Boosting - Machine Learning & Data Science Terminologies - DataMites
3. Boosting Explained-AdaBoost|Bagging vs Boosting|How Boosting and AdaBoost works
(Unfold Data Science)
4. AdaBoost, Clearly Explained
(StatQuest with Josh Starmer)
5. Gradient Boost Machine Learning|How Gradient boost work in Machine Learning
(Unfold Data Science)
6. Tutorial 42 - Ensemble: What is Bagging (Bootstrap Aggregation)?
(Krish Naik)
Top Articles
Latest Posts
Article information

Author: Dan Stracke

Last Updated: 19/04/2023

Views: 6081

Rating: 4.2 / 5 (63 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Dan Stracke

Birthday: 1992-08-25

Address: 2253 Brown Springs, East Alla, OH 38634-0309

Phone: +398735162064

Job: Investor Government Associate

Hobby: Shopping, LARPing, Scrapbooking, Surfing, Slacklining, Dance, Glassblowing

Introduction: My name is Dan Stracke, I am a homely, gleaming, glamorous, inquisitive, homely, gorgeous, light person who loves writing and wants to share my knowledge and understanding with you.