How it works (in a nutshell)
One of a family of ensemble methods that combine predictions from a series of models to create a single robust estimator, boosted trees all under the larger umbrella of gradient descent algorithms which start off with naive predictions, and iterate by fitting to residuals between the target and the last prediction. The goal of the method is to minimize the average value of the loss function by honing in on a function from the weighted sums from a class of "weak" learners. Trees are built one at a time, where each new tree helps to correct errors made by previously trained tree .
History of boosting 
Adaptive Boosting (AdaBoost) was the first boosting algorithm to gain some notoriety. "AdaBoost works by weighting the observations, putting more weight on difficult to classify instances and less on those already handled well. New weak learners are added sequentially that focus their training on the more difficult patterns."
AdaBoost and its neighboring algorithms were recast as ARCing algorithms (Adaptive Reweighting and Combining), before being developed to Gradient Boosting Machines. The statistical objective of these machines is to minimize the loss of the model by adding weak learners using a gradient descent like procedure. This method is known as a stage-wise additive model because "one new weak learner is added at a time and existing weak learners in the model are frozen and left unchanged."
- Can deliver better accuracy with fewer trees than Random Forests
- Training can be time-consuming, as trees are built one at a time
- Are prone to overfitting
 A Gentle Introduction to the Gradient Boosting Algorithm
 When would you use Random Forests over Gradient Boosted Machines?