We use the chi-square test of independence to determine whether there is a relationship between two categorical variables. For example, we could use it to determine whether there is a relationship between students’ gender and their preferred reading format.
The chi-square test compares the observed distribution of frequencies for the two variables with the distribution that we would expect if there was no relationship between them. For example, if there were no relationship between gender and preferred reading format, we would expect the proportion of male students who preferred print books to be similar to that of female students.
In this tutorial we show you how to conduct and interpret a chi-square test of independence in R, as well as how to report the results in APA style. As usual, we will be working with RStudio, a program that makes it easier to work with R.
The Data
The data frame for a chi-square test of independence must include, at minimum, two categorical variables. We start from the assumption that you have already imported or created a data frame in R. However, if you need help with this step, please see our tutorials on importing Excel, CSV and SPSS files into R, and our tutorial on manually entering data in R.
Our example data frame includes the categorical variables gender and format for 40 fictitious students – 20 males and 20 females. The first 12 records are illustrated below. We want to determine whether there is a relationship between their gender and their preferred reading format (print, ebooks or audiobooks).
Creating a Contingency Table
If your data frame is formatted like ours, where each column represents one of your variables, and each row represents a subject, it is important to create a contingency table before conducting the chi-square test in R.
A contingency table displays the relationship between two or more categorical variables. Each row in the table represents one group of one of the variables (e.g., the male group of the gender variable). Each column represents one group of the other variable (e.g., the audio group of the format variable). The cells display the count for that combination of groups (e.g., males whose preferred reading format is audio books).
We can create a contingency table for our two variables in R as follows:
contingencytable <- xtabs ( ~ x + y, data = dataframe)
contingencytable
Replace the highlighted text in this command as outlined below:
- contingencytable: the name you give to your contingency table in R. We use this name throughout the tutorial, but you could use a different name.
- dataframe: the data frame that contains both of your categorical variables (our data frame is called reading_format)
- x and y: the names of the two categorical variables in your chi-square test (gender and format in our example). It doesn’t matter which variable is x and which is y.
The command for our example is:
contingencytable <- xtabs ( ~ gender + format, data = reading_format)
contingencytable
… and here is our table:
We have an equal number of male and female students in our fictitious study. So, if preferred reading formats did not differ based on gender, we would expect similar numbers of male and female students to prefer each book format. However, our contingency table shows that males are more likely than females to prefer audio books, while females are more likely than males to prefer print books. We need to conduct a chi-square test of independence to determine whether these differences are significant.
Chi-Square Test of Independence Assumptions
The assumptions of the chi-square test of independence are:
- Both of your variables are categorical. Categorical variables include things like gender, race, and marital status.
- The groups in each of your categorical variables are mutually exclusive. For example, students’ preferred book format cannot be both print books and audiobooks.
- Independence of observations. None of the observations in your data set is influenced by any of the other observations.
- The expected frequency in each cell of your contingency table is five or more. That is, if there were no relationship between your two variables (the null hypothesis), you would expect a frequency of at least five for each combination of the two variables. We can generate this contingency table of expected frequencies as follows:
chisq.test(contingencytable)$expected
The expected frequencies for our example are all five or more. If your data violates this assumption, you should conduct a different test such as Fisher’s exact test.
The Chi-Square Test of Independence in R
We conduct the chi-square test in R as follows:
chisq.test(contingencytable)
The results for our example are:
R may apply Yates’ continuity correction for 2 x 2 contingency tables (i.e., when both variables contain only two categories each). The aim of this correction is to adjust the chi-square statistic to prevent the overestimation of statistical significance. As Hitchcock (2009) explains, there are differing opinions about this correction, with some arguing that it is too conservative. If you do not want R to apply the Yates’ continuity correction, add the argument correct = FALSE to the chisq.test function as follows:
chisq.test(contingencytable, correct = FALSE).
Interpreting Chi-Square Test of Independence Results in R
The most important part of the chi-square test results is the p value:
Your test is significant if the p value is less than or equal to the alpha level you have set for it. Researchers typically set alpha levels of .05 or .01. The p value for our example is 0.03063. Since this is less than our selected alpha level of .05, our test is significant. In other words, there does appear to be a relationship between students’ gender and their preferred reading format.
In contrast, you can’t conclude that there is a relationship between your two variables if your p value is greater than your selected alpha level. Since the p value for our example is 0.03063, we would not conclude that there was a relationship between students’ gender and their preferred reading format if we had set an alpha level of .01.
It is important to note, that a significant chi-square test does not demonstrate that there is a causal relationship between two variables.
Reporting a Chi-Square Test of Independence in R
We can report the results of our chi-square test of independence in APA Style as follows:
A chi-square test of independence was performed to evaluate the relationship between students’ gender and their preferred reading format. The relationship between these variables was significant, χ2 (2, N = 40) = 6.97, p = .031.
***************
That’s it for this tutorial. You should now be able to conduct and interpret a chi-square test of independence in R, and write up the results of your test in APA style.
***************