Introduction

Row

The Problem & Data Collection

The Problem

A Major League Baseball team wants to determine what factors are relevant in predicting their success, both in number of wins and whether they make the playoffs or not.

Questions

  1. Which model best predicts the number of games a team will win in a given season and which variables are significant in this prediction?
  2. Which model best predicts if a team makes the playoffs and which variables are significant in this prediction?

Row

The Data

Variables

  • Team: the MLB team (3 letter abbreviation)
  • League: which of MLB’s leagues the team belongs to—AL (American League) or NL (National League)
  • Year: the year/MLB season
  • RS: number of runs scored by the team in the given year
  • RA: number of runs allowed by the team in the given year
  • W: number of wins the team had –quantitative response variable
  • OBP: On base percentage
  • SLG: slugging percentage
  • BA: batting average
  • Playoffs: whether the team made the playoffs (1) or not (0) –qualitative response variable
  • RankSeason: regular season ranking of teams that made the playoffs
  • RankPlayoffs: ranking of teams after postseason
  • G: number of games played
  • OOBP: opposing teams’ on base percentage
  • OSLG: opposing teams’ slugging percentage

Variables Added

  • RunDiff: Run Differential, or Runs Scored - Runs Allowed.
  • OPS: OBP + SLG, often regarded as a better metric to measure offensive production.

Variables Ignored

The two Rank variables, RankSeason and RankPlayoff, are determined only for playoff teams and are largely incomplete.

Visualizations

Row

Average Wins by Playoff Status

Run Differential

Runs over Time

Prediction

Row

Linear Regression

Regression Model Summary

term estimate std.error statistic p.value
(Intercept) 83.1271332 11.394920 7.295104 0.0000000
RS 0.0643496 0.008113 7.931669 0.0000000
RA -0.0748021 0.005478 -13.654960 0.0000000
SLG 50.3390597 18.087703 2.783054 0.0056747
OBP 53.7162453 32.666552 1.644381 0.1009878
Playoffs1 4.1666086 0.565844 7.363529 0.0000000
OOBP -103.1087535 30.658302 -3.363159 0.0008551

Boosted Tree

Row

Model Comparison

Model Validation RSq RMSE
Regression Training 0.9100000 3.570000
Regression Validation 0.9097038 3.519233
Boosted Tree Training 0.9030000 3.561084
Boosted Tree Validation 0.9130000 3.392753

Classification

Row

Naive Bayes

Logistic Regression

Bootstrap Forest

Row

Model Comparison

---
title: "INFO 3200 Project Dashboard"
output: 
  flexdashboard::flex_dashboard:
    theme: flatly
    source_code: embed
---
```{r include=FALSE}
library(tidyverse)
library(flexdashboard)
library(ggplot2)
library(dplyr)
library(lubridate)
library(broom)
library(rpart)
library(rpart.plot)
library(caret)
library(MASS)
```

Introduction {data-orientation=columns}
=======================================================================

Row {data-height=650}
-----------------------------------------------------------------------

### The Problem & Data Collection

#### The Problem
A Major League Baseball team wants to determine what factors are relevant in predicting their success, both in number of wins and whether they make the playoffs or not. 


#### Questions
1. Which model best predicts the number of games a team will win in a given season and which variables are significant in this prediction?
2. Which model best predicts if a team makes the playoffs and which variables are significant in this prediction? 


#### Data Source
https://www.kaggle.com/wduckett/moneyball-mlb-stats-19622012

Row {data-height=650}
-----------------------------------------------------------------------
### The Data
#### Variables

* **Team**: the MLB team (3 letter abbreviation) 
* **League**: which of MLB’s leagues the team belongs to—AL (American League) or NL (National League)
* **Year**: the year/MLB season
* **RS**: number of runs scored by the team in the given year
* **RA**: number of runs allowed by the team in the given year
* **W**: number of wins the team had –quantitative response variable 
* **OBP**: On base percentage
* **SLG**: slugging percentage
* **BA**: batting average
* **Playoffs**: whether the team made the playoffs (1) or not (0) –qualitative response variable 
* **RankSeason**: regular season ranking of teams that made the playoffs 
* **RankPlayoffs**: ranking of teams after postseason 
* **G**: number of games played
* **OOBP**: opposing teams’ on base percentage
* **OSLG**: opposing teams’ slugging percentage 

#### Variables Added

* **RunDiff**: Run Differential, or Runs Scored - Runs Allowed. 
* **OPS**: OBP + SLG, often regarded as a better metric to measure offensive production. 

#### Variables Ignored

The two Rank variables, RankSeason and RankPlayoff, are determined only for playoff teams and are largely incomplete. 

Visualizations {data-orientation=rows}
===============================================================

Row {.tabset}
---------------------------------------------------------------
```{r}
#read in and prep data
df <- read_csv("baseball.csv")
df <- dplyr::mutate(df, Playoffs = as.factor(Playoffs), OPS = OBP + SLG, RunDiff = RS - RA,
                    validation = if_else(Year <= 2010, "Training","Validation"))
write.csv(df,'INFO 3200 Project Data.csv')

dftrain <- df %>%
  dplyr::filter(validation == "Training")

dfvalid <- df %>%
  dplyr::filter(validation == "Validation")
```
### Average Wins by Playoff Status
```{r Average Wins by Playoff Status}
# df to calculate average wins for each year based on playoff status 
avgyr <- df %>% dplyr::select(Year,W,Playoffs) %>%
  dplyr::group_by(.,Year,Playoffs) %>%
  dplyr::summarize(., avgwins = mean(W))

# make plot 
p <- ggplot(data = avgyr, aes(x=Year, y = avgwins, group = Playoffs, col = Playoffs))

p + geom_line() + geom_point() +
  scale_y_continuous(breaks = seq(0,100, by=5)) +
  scale_x_continuous(breaks = seq(1960,2015, by = 10)) +
  labs(x = "Average Wins", 
       title = "Average Wins per Year for Playoff & Non-Playoff Teams") +
  scale_color_brewer(palette = "Set1") 

```

### Run Differential
```{r Run Differential}
p <- ggplot(data = df, aes(x = RunDiff, fill = Playoffs, alpha = 0.95))

p + geom_histogram(binwidth = 20, color = "#bdbdbd", size =  0.5) + 
  scale_fill_brewer(palette = "Set1") +
  geom_vline(aes(xintercept=mean(RunDiff)), linetype="dashed") +
  annotate("text", x = -12, y = 5, label = "average", angle = 90) +
  labs(title = "Distribution of Run Differential by Playoffs") + 
  scale_alpha(guide="none") + 
  scale_x_continuous(breaks = seq(-400,400, by = 100)) 
```

### Runs over Time
```{r Runs over time}
rsyr <- df %>% dplyr::select(Year,RS,League) %>%
  dplyr::group_by(.,Year,League) %>%
  dplyr::summarize(., totalRS = sum(RS))

p <- ggplot(data = rsyr, aes(x = Year, y = totalRS, col = League, fill = League, group = League))

p + #geom_col(position = position_dodge2(width = 2.5), width = 1) + 
  geom_line(alpha = 0.7) +
  scale_color_brewer(palette = "Set1") +
  geom_vline(aes(xintercept = 1973, alpha = 0.7), linetype = "dashed") +
  geom_vline(aes(xintercept = 1998, alpha = 0.7), linetype = "dashed") +
  annotate("text", x = 1972, y = 6000, label = "DH in AL", angle = 90) + 
  annotate("text", x = 1997, y = 6625, label = "More NL than AL", angle = 90) + 
  #facet_wrap(~League, ncol = 1) +
  scale_y_continuous(breaks = seq(0,20000, by = 1000)) +
  scale_x_continuous(breaks = seq(1960,2015, by = 5)) +
  guides(alpha = FALSE) +
  labs(x = "Year", 
       y = "Total Runs Scored", 
       title = "Runs Scored by League Over Time")
```

Prediction {data-orientation=rows}
=============================================================

Row {.sidebar}
-------------------------------------------------------------
### **Analysis**
Both models perform similarly and predict the number of wins pretty well. The boosted tree model performs slightly better on the validation set, as it has the highest R-square value and the lowest RMSE. Since it performs the best on the validation set, it is the best model. 

Row {.tabset}
------------------------------------------------------------
### Linear Regression
```{r include=FALSE}
#regression model 
lm_W <- lm(W~RS+RA+SLG+OBP+BA+Playoffs+G+OOBP+OSLG+Team+League+RunDiff+OPS, data = dftrain)

#stepwise: mixed stepwise on regression model 
stepW <- stepAIC(lm_W,direction = "both", trace = FALSE)
summary(stepW)
tidy(stepW)
valid <- predict(stepW, newdata = dfvalid)
summary(valid)
stepW_valid_comp <- data.frame(RSq = R2(valid, dfvalid$W),
                  RMSE = RMSE(valid, dfvalid$W))
```

#### Regression Model Summary
```{r}
knitr::kable(tidy(stepW))
ARSq <- round(summary(lm_W)$adj.r.squared,2)
sig <- round(summary(lm_W)$sigma,2)
```

### Boosted Tree
![](Boosted Tree W.png)

Row {data-height=200}
--------------------------------------------------------------
### Model Comparison
```{r}
model_comp <- tribble(~Model, ~Validation, ~RSq, ~RMSE,
                      "Regression", "Training", ARSq, sig,
                      "Regression", "Validation", stepW_valid_comp$RSq, stepW_valid_comp$RMSE,
                      "Boosted Tree", "Training", 0.903, 3.5610843,
                      "Boosted Tree", "Validation", 0.913, 3.3927529)
knitr::kable(model_comp)
```

Classification {data-orientation=rows}
=============================================================

Row {.sidebar}
-------------------------------------------------------------
### **Analysis**
While the Naive Bayes model produces a low error rate on the validation set, it performs way worse on the test set in terms of error rate and it has a significantly lower R-square than the other two models. The bootstrap forest performs exceptionally well on the training set, but it performs worse on the validation set, which means it could be overfitting the data. The logistic regression has a similar validation error rate to the bootstrap forest and its validation R-square is higher. Its R-square is lower on the training set and the error rate is slightly higher, but overall it is the best performing model due to its performance on the validation set.  

Row {.tabset}
-------------------------------------------------------------

### Naive Bayes 
![](NB Playoffs Fit.png){width=50%} ![](NB Playoffs Details.png){width=50%}

### Logistic Regression
![](Logistic Playoffs.png)

### Bootstrap Forest
![](Bootstrap Playoffs.png)

Row {data-height=200}
--------------------------------------------------------------

### Model Comparison
![](Classification Model Comparison.png)