Workshop "Introduction to machine learning - principles, models, and R"
Introduction to machine learning
- principles, models, and R
Date: Friday, 22nd November 2019 (9.30 to 17.30 hrs.)
Venue: Schulungsraum of the Starterzentrum, basement of building A1 1
Machine learning is increasingly gaining attention as a branch of artificial intelligence (e.g., self-driving cars, robotics), which involves sometimes sensationalized reports in media about the capabilities of these models. In reality, machine learning can be considered as another branch of statistics/data analysis, concerned specifically with nonlinear modelling. In social sciences, data analysis often involves modelling of a certain outcome/dependent variable, given a number of feature/independent variables (e.g., multiple regression). Machine learning shares this goal but simply extends basic linear regression with more complex options. As such, it can be applied to the same data problems faced by social scientists, e.g., predicting personality, modelling mental states and behavior, or grouping patterns of responses. Machine learning is thus applicable and potentially helpful to research from a wide range of subdisciplines of psychology. It allows to gain deeper insights of (existing and future) data that cannot easily be gained with more traditional statistical methods. Contrary to common misperceptions, machine learning does not depend on big data, but can readily applied to many data sets that are regularly collected in psychological (laboratory and field) research. It is thus a valuable addition to many psychologist’s methodological tool box.
In this workshop, I will attempt to demystify machine learning to you. In the morning session, I will first draw connections between machine learning and regression (e.g., ANOVA). Second, I will explain the core principles behind machine learning, such as overfitting, generalization, and cross-validation. In the afternoon session, I will introduce to you specific models, which roughly fall into three categories: prototype methods, flexible regression, and tree-based methods. Some of these models will differ radically from linear regression, yet are easy to understand, and can adapt to complex data patterns. For each model, I will explain how it works, how to run it in R, and how it can be useful for your own data. Throughout the workshop, I will illustrate applicability of models with publications from social sciences. You are recommended to bring your laptop and run the R code along with the lectures, although this is not required to follow.
Audience and format
This workshop is aimed at researchers in social sciences who wish to understand machine learning and its value for their own field. Prior experience with machine learning is not needed but an understanding of linear regression (e.g., ANOVA, multiple regression) is required. Familiarity with R is strongly recommended but not required to follow the lectures. This workshop is not an introduction to R, nor will there be an explicit practical session with exercises. All scripts are integrated directly into the lectures.
Material and R
Participants are encouraged to bring their laptops to execute the R code along with the instructor, but the workshop could be followed just as well without a computer. All workshop material (slides, data, scripts) will be shared in downloadable format in advance of the workshop. R scripts will load data sets directly from the internet, so no data needs to be saved on your computer. If you wish to run the R scripts, you will need to have installed a recent version of R (https://www.r-project.org/), as well as the following packages: AdaptiveSparsity, car, DAAG, earth, foreach, ipred, glmnet, kknn, klaR, ks, maptree, MASS, mclust, mda, randomForest, rpart, rpart.plot, visreg. The fast way to do this is by first selecting a mirror/location to download packages from, followed by directly running this line of code:
For Mac users, remember to check the option "install dependencies" when installing a package, as this will automatically install any secondary packages that the package depends upon. For PCs this option is switched on by default.
09:30 – 10:30: Machine learning and social sciences
10:30 – 10:40: Break
10:40 – 11:40: Principles of machine learning
11:40 – 12:00: R primer and data problem
14:00 – 15:00: K-nearest neighbors and discriminants
15:00 – 15:15: Break
15:15 – 16:15: Regularized and flexible regression
16:15 – 16:30: Break
16:30 – 17:30: Trees and random forests