Chi-Square Test in Python

Sometimes we are interested in determining whether the number of people in specified groups significantly differs.  In these cases, it would be most appropriate to apply the chi-square statistical test.  The current page provides a step-by-step guide in calculating a chi-square test in Python.  As always, if you have any questions, please email me a MHoward@SouthAlabama.edu!

A chi-square test is used to determine whether the number of people in specified groups significantly differs.  So, a chi-square test could be used to answer questions that are similar to the following:

• Does the number of males and females differ in Dr. Howard’s class?
• Does the number of people significantly differ in geographic regions?
• Does the number of people differ in four training groups, the four factories that they each were applied at, and the combination of training program and location?

Now that we know what a chi-square test is used for, we can now calculate a chi-square test in Python.  To begin, open your data in Python.  If you don’t have a dataset, download the example dataset here. In the example dataset, we are simply comparing the number of people in two different grouping variables, each with three different groups. You can imagine that the groups are anything that you want.

Also, this dataset is in the .xlsx format, and the current guide requires the file to be in .csv format.  For this reason, you must convert this file from .xlsx format to .csv format before you can follow along using this dataset.  If you do not know how to do this, please visit my page on converting a file to .csv format.  While this page was written for R, you can follow the initial steps to convert .xlsx to .csv by using Excel alone. After converting the file, you can continue with this guide.

We are going to be using the scipy.stats, pingouin, and pandas modules. If you don’t know how to install modules, you can look at my guide for installing Python modules here. Likewise, you need to open your .csv file with Python. If you don’t know how to do so, you can look at my guide for opening .csv files in Python. In the current example, I named my dataset: MyData . Your initial code should look like the following:

For this guide, we are going to conduct a chi-square test on the first grouping variable alone (Grouping1), the second grouping variable alone (Grouping2), and both of them together (chi-square test of independence).

To begin with a chi-square test on the first grouping variable, we need to make a list of our observed frequencies for Grouping1. I named mine by typing: Observed = . To make the list, you then want to type: list(Mydata[‘Grouping1’].value_counts()) . Press enter. This counts how many times an observation appears in your specified variable and makes a list of the frequencies. If you want to see what the result looks like, just type in Observed and press enter. Otherwise, your result should look like the syntax below.

Now, we need to make a list of our expected frequencies. For the basic chi-square test, we are going to assume a uniform distribution. This means that we need to divide the total number of observations by the number of groups, and then we need to make a list with the resultant value for each group. You can find the total number of observations by typing: sum(Observed) . You can find the number of groups by typing len(Observed) . When done with Grouping1, you can see that the total number of observations is 27 and the number of groups is 3. This means that we need to make a list with the number nine listed three times (27/3 = 9).

To make this list, we should name it Expected by typing: Expected = . Then, we can make the number nine three times by typing: [sum(Observed) / len(Observed), sum(Observed) / len(Observed), sum(Observed) / len(Observed)] . This simply takes the number of observations and divides them by the number of groups and repeats it two more times to make the list. Once you have entered the syntax, press enter.

Now, we can perform a chi-square test on the difference between the observed and expected frequencies. Type the command that we are going to use: stats.chisquare( . Then, specify your lists of observed and expected frequencies by typing: f_obs=Observed, f_exp=Expected) . Press Enter.

From the results, we can see that the chi-square statistic is 0. We can also see that the p-value is not statistically significant (p > .05). So, we would say that there is not a significant difference in the distribution of people within the groups of this grouping variable. In other words, there is a roughly equal distribution of people across each of the groups.

Now, try to run the same analysis for the second grouping variable. You should follow the same instructions, just change your variables to different names and be sure to reference Grouping2 instead of Grouping1. Start by typing: Observed2 = list(MyData[‘Grouping2’].value_counts()) .

Do the same for Expected2. This can be done by typing: Expected2 = [sum(Observed2) / len(Observed2), sum(Observed2) / len(Observed2), sum(Observed2) / len(Observed2)] .

Complete the second one-variable chi-square test by typing: stats.chisquare(f_obs=Observed2, f_exp=Expected2) .

Again, from this result, we can see that the chi-square statistic is 0. We can also see that the p-value is not statistically significant (p > .05). So, we would say that there is not a significant difference in the distribution of people within the groups of the second grouping variable. In other words, there is a roughly equal distribution of people across each of the groups within the second grouping variable.

But there is one last thing we need to test – the interaction! If an interaction exists, the distribution of people depends on both grouping variables together. Or, the effect of one grouping variable on the distribution depends on the other – and vice versa.

Fortunately, this is much easier. To conduct this analysis, we are going to use the command: pg.chi2_independence( . We then specify our data and relevant variables by typing: MyData, x=’Grouping1′, y=’Grouping2′)

As you can see, there is a warning message. This command gives a warning when the number of observations in a group is very small.

When looking at the results, we can see that the chi-square statistic for the interaction term is 0. We can also see that the p-value for the interaction term is not statistically significant (p > .05).  Therefore, there is not a significant interaction between the two grouping variables, and the effect of one does not depend on the effect of the other.

From all that work, nothing was statistically significant. . .but I hope you at least learned how to calculate a chi-square test in Python.  If you have any questions or comments, please email me at MHoward@SouthAlabama.edu!