After learning regression analysis, I was getting impatient on applying it to some real world data.
Fortunately, Kaggle is hosting a competition called ‘History of Baseball’ and they have provided a data set for it.
I had done analysis on Paul dePodesta’s predictions and statistical findings. All of the analysis is done using linear regression.
To check out the full analysis click on the link Moneyball Predictions Kaggle Script
The following is the summary of the analysis:
–> There is a linear relationship between the number of wins and the run difference.
–> W = number of wins need by Oakland Athletics to qualify for the playoffs.
Paul dePodesta predicted 95, our model predicted 93 and the actuall cutoff was 92 wins.
–> R = Runs scored by Oakland Atheltics in 2002.
We predicted 800 runs and the team actually scored 800 runs.
–> RA = Runs allowed by Oakland Atheltics in 2002.
We predicted 671 runs and the team allowed 654 runs. Still a pretty good prediction.
–> Confirmed Paul dePodesta assumption that OBP and SLG are way more important than any other baseball statistics.