A Major League Baseball team wants to determine what factors are relevant in predicting their success, both in number of wins and whether they make the playoffs or not.
The two Rank variables, RankSeason and RankPlayoff, are determined only for playoff teams and are largely incomplete.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 83.1271332 | 11.394920 | 7.295104 | 0.0000000 |
RS | 0.0643496 | 0.008113 | 7.931669 | 0.0000000 |
RA | -0.0748021 | 0.005478 | -13.654960 | 0.0000000 |
SLG | 50.3390597 | 18.087703 | 2.783054 | 0.0056747 |
OBP | 53.7162453 | 32.666552 | 1.644381 | 0.1009878 |
Playoffs1 | 4.1666086 | 0.565844 | 7.363529 | 0.0000000 |
OOBP | -103.1087535 | 30.658302 | -3.363159 | 0.0008551 |
Model | Validation | RSq | RMSE |
---|---|---|---|
Regression | Training | 0.9100000 | 3.570000 |
Regression | Validation | 0.9097038 | 3.519233 |
Boosted Tree | Training | 0.9030000 | 3.561084 |
Boosted Tree | Validation | 0.9130000 | 3.392753 |
---
title: "INFO 3200 Project Dashboard"
output:
flexdashboard::flex_dashboard:
theme: flatly
source_code: embed
---
```{r include=FALSE}
library(tidyverse)
library(flexdashboard)
library(ggplot2)
library(dplyr)
library(lubridate)
library(broom)
library(rpart)
library(rpart.plot)
library(caret)
library(MASS)
```
Introduction {data-orientation=columns}
=======================================================================
Row {data-height=650}
-----------------------------------------------------------------------
### The Problem & Data Collection
#### The Problem
A Major League Baseball team wants to determine what factors are relevant in predicting their success, both in number of wins and whether they make the playoffs or not.
#### Questions
1. Which model best predicts the number of games a team will win in a given season and which variables are significant in this prediction?
2. Which model best predicts if a team makes the playoffs and which variables are significant in this prediction?
#### Data Source
https://www.kaggle.com/wduckett/moneyball-mlb-stats-19622012
Row {data-height=650}
-----------------------------------------------------------------------
### The Data
#### Variables
* **Team**: the MLB team (3 letter abbreviation)
* **League**: which of MLB’s leagues the team belongs to—AL (American League) or NL (National League)
* **Year**: the year/MLB season
* **RS**: number of runs scored by the team in the given year
* **RA**: number of runs allowed by the team in the given year
* **W**: number of wins the team had –quantitative response variable
* **OBP**: On base percentage
* **SLG**: slugging percentage
* **BA**: batting average
* **Playoffs**: whether the team made the playoffs (1) or not (0) –qualitative response variable
* **RankSeason**: regular season ranking of teams that made the playoffs
* **RankPlayoffs**: ranking of teams after postseason
* **G**: number of games played
* **OOBP**: opposing teams’ on base percentage
* **OSLG**: opposing teams’ slugging percentage
#### Variables Added
* **RunDiff**: Run Differential, or Runs Scored - Runs Allowed.
* **OPS**: OBP + SLG, often regarded as a better metric to measure offensive production.
#### Variables Ignored
The two Rank variables, RankSeason and RankPlayoff, are determined only for playoff teams and are largely incomplete.
Visualizations {data-orientation=rows}
===============================================================
Row {.tabset}
---------------------------------------------------------------
```{r}
#read in and prep data
df <- read_csv("baseball.csv")
df <- dplyr::mutate(df, Playoffs = as.factor(Playoffs), OPS = OBP + SLG, RunDiff = RS - RA,
validation = if_else(Year <= 2010, "Training","Validation"))
write.csv(df,'INFO 3200 Project Data.csv')
dftrain <- df %>%
dplyr::filter(validation == "Training")
dfvalid <- df %>%
dplyr::filter(validation == "Validation")
```
### Average Wins by Playoff Status
```{r Average Wins by Playoff Status}
# df to calculate average wins for each year based on playoff status
avgyr <- df %>% dplyr::select(Year,W,Playoffs) %>%
dplyr::group_by(.,Year,Playoffs) %>%
dplyr::summarize(., avgwins = mean(W))
# make plot
p <- ggplot(data = avgyr, aes(x=Year, y = avgwins, group = Playoffs, col = Playoffs))
p + geom_line() + geom_point() +
scale_y_continuous(breaks = seq(0,100, by=5)) +
scale_x_continuous(breaks = seq(1960,2015, by = 10)) +
labs(x = "Average Wins",
title = "Average Wins per Year for Playoff & Non-Playoff Teams") +
scale_color_brewer(palette = "Set1")
```
### Run Differential
```{r Run Differential}
p <- ggplot(data = df, aes(x = RunDiff, fill = Playoffs, alpha = 0.95))
p + geom_histogram(binwidth = 20, color = "#bdbdbd", size = 0.5) +
scale_fill_brewer(palette = "Set1") +
geom_vline(aes(xintercept=mean(RunDiff)), linetype="dashed") +
annotate("text", x = -12, y = 5, label = "average", angle = 90) +
labs(title = "Distribution of Run Differential by Playoffs") +
scale_alpha(guide="none") +
scale_x_continuous(breaks = seq(-400,400, by = 100))
```
### Runs over Time
```{r Runs over time}
rsyr <- df %>% dplyr::select(Year,RS,League) %>%
dplyr::group_by(.,Year,League) %>%
dplyr::summarize(., totalRS = sum(RS))
p <- ggplot(data = rsyr, aes(x = Year, y = totalRS, col = League, fill = League, group = League))
p + #geom_col(position = position_dodge2(width = 2.5), width = 1) +
geom_line(alpha = 0.7) +
scale_color_brewer(palette = "Set1") +
geom_vline(aes(xintercept = 1973, alpha = 0.7), linetype = "dashed") +
geom_vline(aes(xintercept = 1998, alpha = 0.7), linetype = "dashed") +
annotate("text", x = 1972, y = 6000, label = "DH in AL", angle = 90) +
annotate("text", x = 1997, y = 6625, label = "More NL than AL", angle = 90) +
#facet_wrap(~League, ncol = 1) +
scale_y_continuous(breaks = seq(0,20000, by = 1000)) +
scale_x_continuous(breaks = seq(1960,2015, by = 5)) +
guides(alpha = FALSE) +
labs(x = "Year",
y = "Total Runs Scored",
title = "Runs Scored by League Over Time")
```
Prediction {data-orientation=rows}
=============================================================
Row {.sidebar}
-------------------------------------------------------------
### **Analysis**
Both models perform similarly and predict the number of wins pretty well. The boosted tree model performs slightly better on the validation set, as it has the highest R-square value and the lowest RMSE. Since it performs the best on the validation set, it is the best model.
Row {.tabset}
------------------------------------------------------------
### Linear Regression
```{r include=FALSE}
#regression model
lm_W <- lm(W~RS+RA+SLG+OBP+BA+Playoffs+G+OOBP+OSLG+Team+League+RunDiff+OPS, data = dftrain)
#stepwise: mixed stepwise on regression model
stepW <- stepAIC(lm_W,direction = "both", trace = FALSE)
summary(stepW)
tidy(stepW)
valid <- predict(stepW, newdata = dfvalid)
summary(valid)
stepW_valid_comp <- data.frame(RSq = R2(valid, dfvalid$W),
RMSE = RMSE(valid, dfvalid$W))
```
#### Regression Model Summary
```{r}
knitr::kable(tidy(stepW))
ARSq <- round(summary(lm_W)$adj.r.squared,2)
sig <- round(summary(lm_W)$sigma,2)
```
### Boosted Tree

Row {data-height=200}
--------------------------------------------------------------
### Model Comparison
```{r}
model_comp <- tribble(~Model, ~Validation, ~RSq, ~RMSE,
"Regression", "Training", ARSq, sig,
"Regression", "Validation", stepW_valid_comp$RSq, stepW_valid_comp$RMSE,
"Boosted Tree", "Training", 0.903, 3.5610843,
"Boosted Tree", "Validation", 0.913, 3.3927529)
knitr::kable(model_comp)
```
Classification {data-orientation=rows}
=============================================================
Row {.sidebar}
-------------------------------------------------------------
### **Analysis**
While the Naive Bayes model produces a low error rate on the validation set, it performs way worse on the test set in terms of error rate and it has a significantly lower R-square than the other two models. The bootstrap forest performs exceptionally well on the training set, but it performs worse on the validation set, which means it could be overfitting the data. The logistic regression has a similar validation error rate to the bootstrap forest and its validation R-square is higher. Its R-square is lower on the training set and the error rate is slightly higher, but overall it is the best performing model due to its performance on the validation set.
Row {.tabset}
-------------------------------------------------------------
### Naive Bayes
{width=50%} {width=50%}
### Logistic Regression

### Bootstrap Forest

Row {data-height=200}
--------------------------------------------------------------
### Model Comparison
