Typically, I tell students that the two primary categories of “basic” statistics is whether they (a) determine the relationship between things or (b) the differences between groups. Sometimes, however, you want to do both. To do this, dummy-coded regression can help out. This page is a brief lesson on how to perform a dummy-coded regression in R. As always, if you have any questions, please email me at MHoward@SouthAlabama.edu!
The typical type of regression is a linear regression, which identifies a linear relationship between predictor(s) and an outcome. Believe it or not, a linear regression can also identify the differences between groups pretty well – as long as we know how to code our predictors correctly. This is where dummy coding can come into play, which can be used to answer the following questions and similar others:
- What is the relationship of people’s training groups on their job performance while accounting for their job satisfaction?
- What is the relationship of people’s county of residence on their life satisfaction while accounting for their income?
- What is the relationship of a widget’s manufacturing process on its assessed quality while accounting for the machine operator’s tenure?
Of course, there is more nuance to dummy-coded regression, but we will keep it simple. To answer these questions, we can use R to calculate a regression equation. If you don’t have a dataset, you can download the example dataset here. This dataset does not include any missing data. So, if you are dealing with missing data, you may have to add an extra command or two to your syntax.
Using this dataset, we are going to see whether Group 2 has a higher Var1 score than Group 1, whether Group 3 has a higher Var1 score than Group 1, and whether Var2 predicts Var1. And we are going to do this all at the same time!
First, you need to open your data into R. If you do not know how to do this, please refer to my page on opening .csv values into R. Your syntax should now look something like this:
Next, we need to create dummy variables. These variables represent group membership and can be used in a regression analysis. To create these dummy variables, we are going to use Group 1 as our reference group. For this reason, we do not create a dummy variable for Group 1. Instead, we are going to create dummy variables for Groups 2 and 3, such that the dummy variables will have a “1” for everyone in Group 2 or 3 (separately) and a “0” for everyone else. In other words, the first dummy variable will have a “1” for everyone in Group 2 and a “0” for everyone else. The second dummy variable will have a “1” for everyone in Group 3 and a “0” for everyone else.
To start this process, we will need to give our dummy variables labels. Let’s call them Dum1 and Dum2, as seen below:
Next, we are going to use the as.numeric() command to tell R to code everyone in Group 2 as “1” for the first dummy variable and everyone in Group 3 as “1” for the second dummy variable. To do this, we would type, as.numeric(MyData$Group == 2) , for the first dummy variable (Dum1). And we would type, as.numeric(MyData$Group == 3) , for the second dummy variable. This will automatically code the correct values for our dummy variables, as it is reading the data from the Group variable in MyData.
We then need to add our new dummy variables to our dataset. To do so, we can type: MyData <- . Then, we’ll use the cbind command, which can add columns (e.g. variables) to a dataset. So, we’ll also type: cbind( . This should be followed by our dataset and the two dummy variables, separated by commas and including a final closed parenthesis. In other words, we should then type: MyData, Dum1, Dum2) .
We now have the correct dummy codes in our dataset. Great! We just now need to run a regression. Let’s use Var1 as our outcome. And let’s use Dum1, Dum2, and Var2 as our predictor. If you don’t know how to run a regression in R, please refer to my page that teaches this skill. Either way, the correct syntax is provided below.
We should get results that look like the following:
From this result, we can infer that the effects of both dummy variables were statistically significant. This means that Group 2 had greater outcome values than Group 1, and Group 3 also had greater outcome values than Group 1. Neat! Again, for more information about interpreting regression results, please refer to my page on regression.
That’s all for dummy-coded regression in R. If you have any other questions or comments, please contact me at MHoward@SouthAlabama.edu!