class: center, middle, inverse, title-slide # Intro to Machine Learning with R ## From a tidyverse perspective ### Ian Flores Siaca ### 2018/03/20 --- class: center, middle # What is Machine Learning? --- # Types of Machine Learning .large[- Unsupervised Learning] - We don't have labels .large[- Supervised Learning] - We have labels .large[- Reinforcement Learning] - Totally different problem - Goal-oriented learning --- # Unsupervised Learning Example ![](imgs/unsupervised.png) Source: [IotForAll](https://www.iotforall.com/machine-learning-crash-course-unsupervised-learning/) --- # Supervised Learning Example ![](imgs/supervised.png) Source: [StatsandBots](https://blog.statsbot.co/machine-learning-algorithms-183cc73197c) --- # Machine Learning & Ethics - https://www.montrealdeclaration-responsibleai.com/ ![](imgs/montreal.png) --- # ML Ecosystem in R ![](imgs/caret.png) ![](imgs/keras.png) ![](imgs/tensorflow.png) --- # Parsnip package .large[- Separate the definition of a model from its evaluation] .large[- Use different packages as engines to train models] .large[- Harmonize argument names for the same algorithms] <img src='imgs/parsnip.jpg' height=300 align='center'> --- class: center, middle, inverse # Canada - Soccer Crisis --- # Canada - Soccer Crisis .large[- Has only qualified for 1 World Cup in the last 33 years] - Lost all 3 games .large[- In 2014 & 2018 we saw some great improvements on the team] - 10 games won in the 2014 Qualifiers - 12 games won in the 2018 Qualifiers .large[- World Cup in 2022 & 2026] ### What is the quality level of a given player? - This helps us: - Improve the quality level of teams - Predict how we can improve --- # How are we going to win the next WC? - Database of players from the FIFA 19 game ```r fifa_data <- read_csv('data/fifa_players.csv', col_types = cols()) %>% mutate(quality_level = as.factor(quality_level)) head(fifa_data) ``` ``` ## # A tibble: 6 x 16 ## Name Age Nationality Position Finishing ShortPassing LongPassing ## <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> ## 1 E. H… 30 Mexico LM 75 79 75 ## 2 H. S… 28 Japan RB 33 71 64 ## 3 C. B… 26 Germany CM 57 76 78 ## 4 Régis 25 Brazil CAM 70 74 71 ## 5 M. H… 25 Austria LCB 48 63 81 ## 6 R. S… 27 Italy RCM 72 78 77 ## # … with 9 more variables: Acceleration <dbl>, SprintSpeed <dbl>, ## # Agility <dbl>, Balance <dbl>, ShotPower <dbl>, Jumping <dbl>, ## # Stamina <dbl>, Strength <dbl>, quality_level <fct> ``` --- # First thing to do? Explore the data ```r fifa_data %>% ggplot(aes(Finishing, fill = quality_level)) + geom_density(alpha = 0.4, colour = NA) ``` ![](presentation_files/figure-html/unnamed-chunk-3-1.png)<!-- --> --- ```r fifa_data %>% ggplot(aes(x = ShortPassing, y = Finishing, colour = quality_level)) + geom_point(alpha = 0.7, size = 3) ``` ![](presentation_files/figure-html/unnamed-chunk-4-1.png)<!-- --> --- ```r fifa_data %>% select(ShortPassing, Finishing, Acceleration, quality_level) %>% ggpairs(., mapping = aes(colour = quality_level)) ``` ![](presentation_files/figure-html/unnamed-chunk-5-1.png)<!-- --> --- # Golden Rule of Machine Learning ```r library(rsample) data_split <- fifa_data %>% select(-Name, - Nationality, - Position) %>% initial_split(., strata = 'quality_level', p = 0.75) train_data <- training(data_split) test_data <- testing(data_split) ``` --- ```r head(train_data) ``` ``` ## # A tibble: 6 x 13 ## Age Finishing ShortPassing LongPassing Acceleration SprintSpeed Agility ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 30 75 79 75 74 77 80 ## 2 26 57 76 78 75 66 80 ## 3 25 70 74 71 79 75 83 ## 4 27 72 78 77 75 75 79 ## 5 28 26 76 71 86 88 76 ## 6 28 46 81 80 76 76 72 ## # … with 6 more variables: Balance <dbl>, ShotPower <dbl>, Jumping <dbl>, ## # Stamina <dbl>, Strength <dbl>, quality_level <fct> ``` --- # Decision Tree (Theory) ![](imgs/decision_tree.jpg) Source: [HackerNoon](https://hackernoon.com/what-is-a-decision-tree-in-machine-learning-15ce51dc445d) --- # Decision Tree (Parsnip Interface) ```r library(parsnip) model <- decision_tree(mode = 'classification') %>% set_engine('rpart') %>% fit(quality_level ~ ., data = train_data) ``` --- # How do we know how we are doing? ```r test_results <- test_data %>% select(quality_level) %>% as_tibble() %>% mutate(predicted = predict_class(model, new_data = test_data)) head(test_results) ``` ``` ## # A tibble: 6 x 2 ## quality_level predicted ## <fct> <fct> ## 1 A B ## 2 A D ## 3 A A ## 4 A A ## 5 A A ## 6 A A ``` --- # Accuracy for Decision Tree ```r library(yardstick) test_results %>% accuracy(truth = quality_level, estimate = 'predicted') ``` ``` ## # A tibble: 1 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy multiclass 0.513 ``` --- # Confusion Matrix for Decision Tree ```r test_results %>% conf_mat(truth = quality_level, estimate = 'predicted') ``` ``` ## Truth ## Prediction A B C D ## A 188 55 12 6 ## B 32 78 41 12 ## C 17 65 64 49 ## D 13 52 133 183 ``` --- # Random Forest (Theory) ![](imgs/random_forest.jpg) Source: [Dimitriadis, et al.](http://www.nrronline.org/article.asp?issn=1673-5374;year=2018;volume=13;issue=6;spage=962;epage=970;aulast=Dimitriadis) --- # Random Forest (Parsnip Interface) ```r model <- rand_forest(mode = 'classification') %>% set_engine('ranger') %>% fit(quality_level ~ ., data = train_data) ``` --- # Accuracy for Random Forest ```r test_results <- test_data %>% select(quality_level) %>% as_tibble() %>% mutate(predicted = predict_class(model, new_data = test_data)) test_results %>% accuracy(truth = quality_level, estimate = 'predicted') ``` ``` ## # A tibble: 1 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy multiclass 0.621 ``` --- # Confusion Matrix for Random Forest ```r test_results %>% conf_mat(truth = quality_level, estimate = 'predicted') ``` ``` ## Truth ## Prediction A B C D ## A 200 32 5 2 ## B 39 132 47 12 ## C 6 59 113 60 ## D 5 27 85 176 ``` --- # But, what about Canadian players? ```r canadian_players <- fifa_data %>% filter(Nationality == 'Canada') canada_predictions <- canadian_players %>% select(-Name, -Nationality, -Position) %>% predict_class(model, new_data = .) canadian_players %>% select(Name, Age, Nationality, Position, quality_level) %>% mutate(predictions = canada_predictions) %>% head() ``` ``` ## # A tibble: 6 x 6 ## Name Age Nationality Position quality_level predictions ## <chr> <dbl> <chr> <chr> <fct> <fct> ## 1 L. Cavallini 25 Canada RS A A ## 2 A. Davies 17 Canada RM B B ## 3 W. Johnson 31 Canada CDM B B ## 4 M. de Jong 31 Canada LB C C ## 5 A. Hainault 32 Canada LCB C D ## 6 Pacheco 34 Canada RCM C C ``` --- # Things we didn't cover today .large[- Cross-Validation] .large[- Deep Learning] .large[- Regression Tasks] .large[- Unsupervised Learning Algorithms] --- class: center, middle, inverse # Thank You! Email: iflores.siaca@gmail.com GitHub: @ian-flores