Intro to Machine Learning with R

class: center, middle, inverse, title-slide

# Intro to Machine Learning with R
## From a tidyverse perspective
### Ian Flores Siaca
### 2018/03/20

---

class: center, middle

# What is Machine Learning?

---

# Types of Machine Learning

.large[- Unsupervised Learning]
    - We don't have labels
.large[- Supervised Learning]
    - We have labels
.large[- Reinforcement Learning]
    - Totally different problem
        - Goal-oriented learning

---
# Unsupervised Learning Example

![](imgs/unsupervised.png)
Source: [IotForAll](https://www.iotforall.com/machine-learning-crash-course-unsupervised-learning/)

---
# Supervised Learning Example

![](imgs/supervised.png)
Source: [StatsandBots](https://blog.statsbot.co/machine-learning-algorithms-183cc73197c)

---
# Machine Learning & Ethics

- https://www.montrealdeclaration-responsibleai.com/

![](imgs/montreal.png)

---
# ML Ecosystem in R

![](imgs/caret.png)

![](imgs/keras.png)

![](imgs/tensorflow.png)

---
# Parsnip package

.large[- Separate the definition of a model from its evaluation]
.large[- Use different packages as engines to train models]
.large[- Harmonize argument names for the same algorithms]

---
class: center, middle, inverse

# Canada - Soccer Crisis

---
# Canada - Soccer Crisis

.large[- Has only qualified for 1 World Cup in the last 33 years]
    - Lost all 3 games
.large[- In 2014 & 2018 we saw some great improvements on the team]
    - 10 games won in the 2014 Qualifiers
    - 12 games won in the 2018 Qualifiers
.large[- World Cup in 2022 & 2026]
    
    
### What is the quality level of a given player?

- This helps us: 
    - Improve the quality level of teams
    - Predict how we can improve
    
---
# How are we going to win the next WC?

- Database of players from the FIFA 19 game

```r
fifa_data <- read_csv('data/fifa_players.csv', col_types = cols()) %>%
    mutate(quality_level = as.factor(quality_level))

head(fifa_data)
```

```
## # A tibble: 6 x 16
##   Name    Age Nationality Position Finishing ShortPassing LongPassing
##   <chr> <dbl> <chr>       <chr>        <dbl>        <dbl>       <dbl>
## 1 E. H…    30 Mexico      LM              75           79          75
## 2 H. S…    28 Japan       RB              33           71          64
## 3 C. B…    26 Germany     CM              57           76          78
## 4 Régis    25 Brazil      CAM             70           74          71
## 5 M. H…    25 Austria     LCB             48           63          81
## 6 R. S…    27 Italy       RCM             72           78          77
## # … with 9 more variables: Acceleration <dbl>, SprintSpeed <dbl>,
## #   Agility <dbl>, Balance <dbl>, ShotPower <dbl>, Jumping <dbl>,
## #   Stamina <dbl>, Strength <dbl>, quality_level <fct>
```

---
# First thing to do? Explore the data

```r
fifa_data %>%
    ggplot(aes(Finishing, fill = quality_level)) +
        geom_density(alpha = 0.4, colour = NA)
```

![](presentation_files/figure-html/unnamed-chunk-3-1.png)

---

```r
fifa_data %>%
    ggplot(aes(x = ShortPassing, 
               y = Finishing, 
               colour = quality_level)) +
        geom_point(alpha = 0.7, size = 3)
```

![](presentation_files/figure-html/unnamed-chunk-4-1.png)

---

```r
fifa_data %>%
    select(ShortPassing, Finishing, Acceleration, quality_level) %>%
    ggpairs(., mapping = aes(colour = quality_level))
```

![](presentation_files/figure-html/unnamed-chunk-5-1.png)

---
# Golden Rule of Machine Learning

```r
library(rsample)

data_split <- fifa_data %>%
    select(-Name, - Nationality, - Position) %>%
    initial_split(., strata = 'quality_level', p = 0.75)

train_data <- training(data_split)
test_data <- testing(data_split)
```

---

```r
head(train_data)
```

```
## # A tibble: 6 x 13
##     Age Finishing ShortPassing LongPassing Acceleration SprintSpeed Agility
##   <dbl>     <dbl>        <dbl>       <dbl>        <dbl>       <dbl>   <dbl>
## 1    30        75           79          75           74          77      80
## 2    26        57           76          78           75          66      80
## 3    25        70           74          71           79          75      83
## 4    27        72           78          77           75          75      79
## 5    28        26           76          71           86          88      76
## 6    28        46           81          80           76          76      72
## # … with 6 more variables: Balance <dbl>, ShotPower <dbl>, Jumping <dbl>,
## #   Stamina <dbl>, Strength <dbl>, quality_level <fct>
```

---
# Decision Tree (Theory)

![](imgs/decision_tree.jpg)

Source: [HackerNoon](https://hackernoon.com/what-is-a-decision-tree-in-machine-learning-15ce51dc445d)
---
# Decision Tree (Parsnip Interface)

```r
library(parsnip)
model <- decision_tree(mode = 'classification') %>%
    set_engine('rpart') %>%
    fit(quality_level ~ ., data = train_data)
```

---
# How do we know how we are doing?

```r
test_results <- test_data %>%
    select(quality_level) %>%
    as_tibble() %>%
    mutate(predicted = predict_class(model, new_data = test_data))

head(test_results)
```

```
## # A tibble: 6 x 2
##   quality_level predicted
##   <fct>         <fct>    
## 1 A             B        
## 2 A             D        
## 3 A             A        
## 4 A             A        
## 5 A             A        
## 6 A             A
```

---
# Accuracy for Decision Tree

```r
library(yardstick)
test_results %>%
    accuracy(truth = quality_level, estimate = 'predicted')
```

```
## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.513
```

---
# Confusion Matrix for Decision Tree

```r
test_results %>%
    conf_mat(truth = quality_level, estimate = 'predicted')
```

```
##           Truth
## Prediction   A   B   C   D
##          A 188  55  12   6
##          B  32  78  41  12
##          C  17  65  64  49
##          D  13  52 133 183
```

---
# Random Forest (Theory)

![](imgs/random_forest.jpg)

Source: [Dimitriadis, et al.](http://www.nrronline.org/article.asp?issn=1673-5374;year=2018;volume=13;issue=6;spage=962;epage=970;aulast=Dimitriadis)
---
# Random Forest (Parsnip Interface)

```r
model <- rand_forest(mode = 'classification') %>%
    set_engine('ranger') %>%
    fit(quality_level ~ ., data = train_data)
```

---
# Accuracy for Random Forest

```r
test_results <- test_data %>%
    select(quality_level) %>%
    as_tibble() %>%
    mutate(predicted = predict_class(model, new_data = test_data))

test_results %>%
    accuracy(truth = quality_level, estimate = 'predicted')
```

```
## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.621
```

---
# Confusion Matrix for Random Forest

```r
test_results %>%
    conf_mat(truth = quality_level, estimate = 'predicted')
```

```
##           Truth
## Prediction   A   B   C   D
##          A 200  32   5   2
##          B  39 132  47  12
##          C   6  59 113  60
##          D   5  27  85 176
```

---
# But, what about Canadian players?

```r
canadian_players <- fifa_data %>%
    filter(Nationality == 'Canada')

canada_predictions <- canadian_players %>%
    select(-Name, -Nationality, -Position) %>%
    predict_class(model, new_data = .)

canadian_players %>%
    select(Name, Age, Nationality, Position, quality_level) %>%
    mutate(predictions = canada_predictions) %>%
    head()
```

```
## # A tibble: 6 x 6
##   Name           Age Nationality Position quality_level predictions
##   <chr>        <dbl> <chr>       <chr>    <fct>         <fct>      
## 1 L. Cavallini    25 Canada      RS       A             A          
## 2 A. Davies       17 Canada      RM       B             B          
## 3 W. Johnson      31 Canada      CDM      B             B          
## 4 M. de Jong      31 Canada      LB       C             C          
## 5 A. Hainault     32 Canada      LCB      C             D          
## 6 Pacheco         34 Canada      RCM      C             C
```

---
# Things we didn't cover today

.large[- Cross-Validation]
.large[- Deep Learning]
.large[- Regression Tasks]
.large[- Unsupervised Learning Algorithms]

---
class: center, middle, inverse

# Thank You!

Email: iflores.siaca@gmail.com

GitHub: @ian-flores