Linear regression with categorical variable
When to use this? When you have a categorical variable with more than two categories/levels and you want to use linear regression.
All categorical variables have to be entered into the regression as a dummy variable. Each dummy variable represents one category of the independent variable and it is coded as 1 if the case falls into that category and 0 if the case does not. So, we have to create one dummy variable for each of the categories.
- Step: check if there are any values to exclude from the database (Check the Frequency table – in case there are any in the Variable View window put the values to System missing)
- Step: Recode every single category into a dummy variable (1-if it falls in the case, 0 – if it does not)
- Step: Select one category as a baseline category, the category against which we will compare all the other categories. The baseline category does not have to be included in the regression. Basically, this means that we compare all the other categories to the baseline category. Either one of them can be set as a basecase / baseline / omitted category.
- Step: To perform the Simple Linear Regression go to Analyze – Regression – Linear. Include the dependendent variable and all the independent variables, except the baseline/omitted category. So, the number of the dummy variables included in the regression is always one less than the number of categories.
- Step: Click Paste, run the command.
- Step: Interpret the tables in the Output.
How to interpret the values?
First of all check the p value (Sig.). If this is above 0,05 then we say that the results are not statistically significant. This means that the results that we see are very likely that they have only occured by chance, thus we do not need to interpret the results furthure.
- b0 (the constant): gives the average level of the dependent variable for the omitted category, which is the category coded with 0 on all dummy variables. This becomes the reference point with which all other categories are compared.
- The other coefficients represent the difference in the average level of the dependent variable between the category coded with 1 on one particular dummy variable and the omitted category.
- b1: the difference between the mean for the category coded as 1 in this independent variable and the mean for the omitted category
- b2: the difference between the mean for the category coded as 1 in this independent variable and the mean for the omitted category
- b3: the difference between the mean for this category (coded with 1) and the mean for the omitted category.
- In case there is b4, b5, b6 the same logic applies.
Example 1: happiness
Research question: Does the feeling of happiness have an effect on satisfaction with the financial situation of a household?
Hypothesis: Those who consider themselves very happy are more satisfied with their financial situation of household than those who consider themselves rather happy.
So, in this case, the reference category is the very happy and therefore this constant here refers to the very happy category.
Those who feel very happy on average score of 6.322 in terms of satisfaction with the financial situation of the household (p<0,000).
Those who feel rather happy on average score of 0.858 less than those who feel very happy in terms of satisfaction with the financial situation of household (p<0,000).
Those who do not feel very happy on average score of 1.686 less than those who feel very happy in terms of satisfaction with the financial situation of household (p<0,000).
Those who do not feel at all happy on average score of 2.151 less than those who feel very happy in terms of satisfaction with the financial situation of the household (p<0,000).
In conclusion, the result supports my hypothesis stating that “Those who consider themselves very happy are more satisfied with their financial situation of household than those who consider themselves rather happy.” (p<0,001)
Example 2: social class
Research question: Does social class have an effect on the financial situation of the household in India?
- Step: We have data from many countries in the database. We only want to analyze the respondents from India. Data – Select Cases – If condition is satisfied – V2=356
- We need to check our variables to see if there is anything that we should code as system missing. Analyze – Descriptive Statistics – Frequencies
- We create our dummy variables for every single category of the independent variable Transform – Recode into Different Variables
- Check if the recoding is correct.
- Select which category do you want to use as the reference category
- Run the linear regression with the following variables: dependent variable and all dummy variables except the reference category
So, in this case, the reference category is the upper class. And therefore this constant here refers to the upper class.
If the p value is less than 0,05 then the relationship is statistically significant.
Constant: this refers to the omitted category. So, on average respondents who believe that they are part of the upper class score 5.458 on a scale of one to ten, where 10 represents completely satisfied and 1 completely dissatisfied in terms of satisfaction with financial situation of the household.
People who think they belong to the upper middle class score of 0.639 points higher than those who think they belong to the upper class. Thus, people who think they belong to the upper middle class on average are more satisfied with the financial situation of their household than those of the upper class. (p<0,001)
Those who see themselves as part of the lower middle class (p=0,079) and of the working class (p=0,447) on average are not significantly more satisfied with their financial situation of the household than those who belong to the upper class.
Those who view themself as a part of the lower class on average score 0,504 less than those of the upper class. Thus, people from the lower class feel significantly less satisfied with their financial situation of the household than those of the upper class (p=0,002).
Example 3: health status
1.Step: Create a hypothesis: Those who have a very good health status are significantly more satisfied with their financial situation of household than those who have a good health status.
Note: When you are writing your own research then hereby you have to check the assumptions of the regression and you might have to filter your data for a certain group of people. In this example, I did not filter the data.
2.Step: Create the Dummy Variables: Recode the categorical variables into dummy variables. Transform – Create Dummy variables.
Slide the variable that you want to recode as a dummy variable into the “Create Dummy Variables for:” box and give a name to your new dummy variable in the “Root Names (One Per Selected Variable)” box. In this case: health_status
Note: Here it is important to mentions that this method is not always a good choice. If your main goal is to recode the variables in a different way then you have to use the Transform – Recode into Different Variables. Example: if you have 5 categories and you want to recode 1-4 to 1 and 5 to 0, then use Recode into Different Variables.
3. Step: Select the omitted category. I have selected the “very good” as a reference category because in my hypothesis I want to test if those who perceive their health status as very good are significantly more satisfied than those who perceive their health status as good.
4. Step Run the regression: Analyze – Regression – Linear
Dependent: satisfaction
Independent: Good, Fair, Poor
You only want to analyze the valid answers, so the dummy variables including “don’t know” and similar ones do not have to be put in the regression. When you have a categorical variable with more than two categories then you have to leave one category out, that is the omitted category, in this case, “very good”.
Note: The hypothesis has to be in line with the variables introduced in the regression. The omitted category is the one to which you want to compare the other categories to. In this case we want to compare the good health status to the very good health status, so the very good health status is the omitted category.
5.Step: Interpret the Coefficients table: b0, b1, b2, … check the p values
In order to make the example more simple in the following, I will refer to the satisfaction of the financial situation of the household as satisfaction and the subjective state of health as health status. (In social sciences and during the assignments you should always refer to the exact terms, since these terms do not measure the same things.)
At the end of every statement write down in parenthesis at which level is your finding significant. If the p value is equal to 0,000 then we usually write p<0,001, if the p value is not equal to 0,000, but it is less than 0,05 then write (p<0,05) and if the p value is above 0,05 write the exact value of the p, example: (p=0,457).
On average those whose health status is very good score 6.566 in terms of satisfaction. (p<0,001)
On average those whose health status is good score 0,499 less than those whose health status is very good in terms of satisfaction. (p<0,001)
On average those whose health status is fair score 1,195 less than those whose health status is very good in terms of satisfaction. (p<0,001)
On average those whose health status is poor score 2,232 less than those whose health status is very good in terms of satisfaction.. (p<0,001)
6.Step: Write down: Did the result support or refute your hypothesis?
In this section always repeat your previous hypothesis and state if the result supports or refutes your hypothesis. At the end of the sentence write the level of the significance. (p value).
In conclusion, the result supports my hypothesis stating that “Those who have a very good health status are significantly more satisfied with their financial situation of household than those who have a good health status.” (p<0,001)
We got to this conclusion because of the b1 coefficient and its p value. So, the coefficient of b1 (Good) is – 0,499, which means that those whose health status is good score 0,499 less in terms of satisfaction than those whose health status is very good and this result is significant (p<0,05), so we did not get this result by chance.
Example 4: marital status
1.Step: Create a hypothesis: Those who are married are more satisfied with their financial situation of the household than those who are divorced.
Note: When you are writing your own research then hereby you have to check the assumptions of the regression and you might have to filter your data for a certain group of people. In this example, I did not filter the data.
2.Step: Create the Dummy Variables: Recode the categorical variables into dummy variables. Transform – Create Dummy variables.
Slide the variable that you want to recode as a dummy variable into the “Create Dummy Variables for:” box and give a name to your new dummy variable in the “Root Names (One Per Selected Variable)” box. In this case: marital_status
Note: Here it is important to mentions that this method is not always a good choice. If your main goal is to recode the variables in a different way then you have to use the Transform – Recode into Different Variables. Example: if you have 5 categories and you want to recode 1-4 to 1 and 5 to 0, then use Recode into Different Variables.
3. Step: Select the omitted category. I have selected the “married” as a reference category since in my hypothesis I want to test if those who are married are significantly more satisfied with the financial situation of their household than those who are divorced.
Note: The hypothesis has to be in line with the variables introduced in the regression. The omitted category is the one to which you want to compare the other categories to. In this case, we want to compare those who are divorced to those who are married, so the married category is the omitted category.
4. Step Run the regression: Analyze – Regression – Linear
Dependent: satisfaction
Independent: living together as married, divorced, separated, widowed, single
You only want to analyze the valid answers, so the dummy variables including “don’t know” and similar ones do not have to be put in the regression. When you have a categorical variable with more than two categories then you have to leave one category out, that is the omitted category, in this case, “married”.
5.Step: Interpret the Coefficients table: b0, b1, b2, … check the p values
b0: On average married people score 6.004 in terms of satisfaction.
b1: At first glance, those who are living together as married score 0.022 higher in terms of the satisfaction with the financial situation of their household than those who are married. Since the significance level is above 0,05 this result might be due by chance. So, we cannot state that those who are living together are more satisfied with their financial situation than those who are married. (p=0,512)
b2: Those who are divorced score 0,729 lower than those who are married in terms of satisfaction with financial household. (p<0,001)
b3: Those who are separated score 0,594 lower than those who are married in terms of satisfaction with the financial situation of their household. (p<0,001)
b4: Those who are widowed score 0,686 lower than those who are married in terms of satisfaction with the financial situation of their household. (p<0,001)
b5: At first glance, it seems like those who are single score 0,023 higher than those who are married in terms of satisfaction with the financial situation of their household, but because the p value is below 0,05 this result might be due by chance, so we cannot accept this statement. (p<0,001) Thus, we cannot state that those who are single are more satisfied with the financial situation of their household than those who are married. (p=0,236)
6.Step: Write down: Did the result support or refute your hypothesis?
In this section always repeat your previous hypothesis and state if the result supports or refutes your hypothesis. At the end of the sentence write the level of the significance. (p value).
In conclusion, the result supports my hypothesis stating that “Those who are married are more satisfied with their financial situation of the household than those who are divorced.” (p<0,001)
We got to this conclusion because of the b2 coefficient and its p value. So, the coefficient of b2 (Divorced) is – 0,729, which means that those who are divorced score – 0,729 lower in terms of satisfaction than those who are married and this result is significant (p<0,05), so we did not get this result by chance.