Dummy-Coded Regression in Python

Typically, I tell students that the two primary categories of “basic” statistics is whether they (a) determine the relationship between things or (b) the differences between groups.  Sometimes, however, you want to do both.  To do this, dummy-coded regression can help out.  This page is a brief lesson on how to perform a dummy-coded regression in Python.  As always, if you have any questions, please email me at MHoward@SouthAlabama.edu!

The typical type of regression is a linear regression, which identifies a linear relationship between predictor(s) and an outcome.  Believe it or not, a linear regression can also identify the differences between groups pretty well – as long as we know how to code our predictors correctly.  This is where dummy coding can come into play, which can be used to answer the following questions and similar others:

  • What is the relationship of people’s training groups on their job performance while accounting for their job satisfaction?
  • What is the relationship of people’s county of residence on their life satisfaction while accounting for their income?
  • What is the relationship of a widget’s manufacturing process on its assessed quality while accounting for the machine operator’s tenure?

Of course, there is more nuance to dummy-coded regression, but we will keep it simple.  To answer these questions, we can use Python to calculate a regression equation.  If you don’t have a dataset, you can download the example dataset here.

Using this dataset, we are going to see whether Group 2 has a higher Var1 score than Group 1, whether Group 3 has a higher Var1 score than Group 1, and whether Var2 predicts Var1.  And we are going to do this all at the same time!

We are going to be using the pingouin and pandas modules. If you don’t know how to install modules, you can look at my guide for installing Python modules here. Likewise, you need to open your .csv file with Python. If you don’t know how to do so, you can look at my guide for opening .csv files in Python. Lastly, the example dataset is a .xslx file. You can look at this guide on how to convert .xlsx files to .csv files by clicking here.

With that noted, your syntax should begin with the following:

Next, we need to create dummy variables. These variables represent group membership and can be used in a regression analysis. Fortunately, the pandas module has an extremely easy way to get dummy codes. We first need to assign these codes to a variable by typing: dums = . We the need to specify the command by typing: pd.get_dummies( . Lastly, we need to identify the variable that we want to create dummy codes for. In the current example, we type: MyData[‘Group’]) . Once all of that has been typed, press enter.

This will create dummy codes for each group of your grouping variable; however, we cannot use all the generated dummy codes in our regression, as it will give us an error if we try. Instead, we want to include all the dummy codes except for one, and the excluded dummy code depends on our reference group. In the current example, Group 1 is our reference group, as we are comparing it to Group 2 and Group 3. This means we want Group 1 to be represented by 0s in all of our included dummy codes. In the dums variable we just created, the dummy variables have Group 1 as 1 and all others as 0 for Dum1; Group 2 as 1 and all others as 0 for Dum2; and Group 3 as 1 and all others as 0 for Dum3. Because we want Group 1 to be 0 in all dummy codes, this means we would only use Dum 2 and Dum 3, as Dum 1 has Group 1 as 1. If you are confused, just type in dums and press enter to see the dummy codes for yourself. This should help clear things up.

Once you understand the dummy codes, we want to add the two chosen dummy codes to the dataset. In this example, it is the 2nd and 3rd dummy code of the dums variable we created. To add these to the dataset, we first enter what we want to name them. I named mine Dum1 and Dum2 by typing: MyData[[‘Dum1’, ‘Dum2’]] = . Then, we want to identify the dummy codes that we will be using by typing: dums[[2, 3]] . Once all of that has been typed, press enter.

Now we perform our regression that includes Var2, Dum1, and Dum2 predicting Var1. If you are unfamiliar with conducting regression in Python, be sure to refer to my guide on the topic. Otherwise, let’s assign our regression to the variable lm by typing: lm = pg.linear_regression(MyData[[‘Var2’, ‘Dum1’, ‘Dum2’]], MyData[‘Var1’]) .

Lastly, we want to print our output rounded to two decimal places by typing: lm.round(2). Press enter.

From this result, we can infer that the effects of both dummy variables were statistically significant.  This means that Group 2 had greater outcome values than Group 1, and Group 3 also had greater outcome values than Group 1.  Neat!  Again, for more information about interpreting regression results, please refer to my page on regression.

That’s all for dummy-coded regression in Python.  If you have any other questions or comments, please contact me at MHoward@SouthAlabama.edu!