# Control variable in linear regression

A control variable is a variable that is held constant in a statistical analysis. It is used to reduce the effect of confounding variables, which can interfere with the relationship between the independent variable and dependent variable.

For example, if you want to study the relationship between exercise and weight loss, you might include age and gender as control variables. This is because age and gender can influence weight loss, and by including them as control variables, you can isolate the effect of exercise on weight loss.

To use a control variable in SPSS, you need to include it as an independent variable in your analysis along with the variables you are interested in studying. Then, when you run your statistical tests, you can assess the effect of the independent variables while controlling for the effect of the control variables.

## When to use a control variable?

When you want to address alternative explanations by removing confounding effects and when you want to improve efficiency (when you want to eliminate some kind of distortion.)

**Use a control variable:**

- If the two independent variables correlate with each other and they both have an effect on the dependent variable.
- When the variable has an effect on the dependent variable.

**Another thing that you have to take into consideration:**

1. the **time** order/time schedule of the variables: If the new variable in time comes before/precedes the original independent variable then it is reasonable to involve it in the regression

2. If the new variable in time succeds/follows the original variable then it is not reasonable to involve the new variable to the model.

Example: what does it means that in time one follows the other one? Meaning: in our example if we start from the point of birth then we can imagine that a person/a student first has a father, who finishes x number of classes in the school (PEDUCATION) and this is followed by the fact that a person/a student later on in his life will join a class. In a very simple way: usually first you have a father, then you get into a class. So, the CLASS variable in time follows the PEDUCATION variable.

**Note:** There are two different cases when it is reasonable to involve a new independent variable that “in time happened later” or follows the original variable and you want to test an explanation:

**1.The change in b1 supports the alternative explanation**:

Dependent: test result. Independent: class, peducation

Hypothesis: The new teaching method is much more effective than the old one.

**Alternative explanation:** The educational background of the parent’s has an effect on the student’s test scores, so we involved a new control variable called PEDUCATION.

The significance level of b1 in the second regression where we involved both the variables became not significant -> This supports our hypothesis. (On the next page you can find this example in details.)

**2.** **There is no change in b1 and the results still support the alternative explanation:**

Hypothesis: In the south-east region of Hungary people live in such a “subculture” where suicide as a norm is widely accepted (theory of socialization) and this is why in this region the rate of the suicide is higher than in other regions of Hungary.

- regression: Dependent: suicide rate, Independent: region
- regression: Dependent: suicide rate Independents: region, control variable: geographical mobility

**Alternative explanation: **The geographical mobility of people living in the south-east regions is higher than in other regions. The high level of geographical mobility has the effect on the suicide rate. People moving away from their original hometowns face uncertainty in their new life circumstances and while becoming strangers they also do not find supportive relationships. These events result in stress that they cannot cope with.

b1 did not become insignificant, it only slightly decreased but it is still significant after introducing the geo-mobility as a control variable and the geo-mobility is also significant. -> the results support our hypothesis.

In the second regression besides the original two variables (suicide rate, region) we involved one more independent variable: geographical mobility. Before we involved the control variable the region had an effect on the suicide rate and after involving one more control variable (geographical mobility) the effect of region on the suicide rate was still significant, **the original effect did not disappear**. But the effect of the control variable, geographical mobility became also significant. (Thus, both the independent variables have an effect on the suicide rate.)

**In conclusion,** the results supports our alternative hypothesis stating that the geographical mobility has an effect on the suicide rate. People living in the south-east region are not only more prone to commit suicide because of the social or cultural meanings attributed to suicide (socialization – through generations they have learned that this is the right way to cope with stress), but they are also more prone to commit suicide because people living in the south-east region have higher geographical mobility. Thus, geographical mobility increases the risk of suicide rate and living in the east-south region of Hungary also increases this risk.

**Do not introduce the variable**

**“A” independent variable** (the control variable) can only distort the effect that another independent variable has on the dependent variable if it correlates with the other independent variable and it also has an effect on the dependent variable.

So, when the new independent variable correlates with the other independent variable, but it does not have an effect on the dependent variable then we do not have to include it in our model.

How to show two regression models in one coefficient table?

**Always raise the question**

Are there any other alternative explanations that you can think of besides your original explanation?

## Example

We will see an example of how to check an alternative explanation by using a control variable.

Example: Our main goal is to find out if a new teaching method is more effective than the old one or not.

**Hypothesis:** The new teaching method is more effective than the old one.

We have responses from students from two different classes. In one class, called as experiment group, coded as 1 a new teaching method was introduced. In the other class, called as control group this teaching method was not introduced, the students still learn with the old method. At the end of the year in order to assess their performance we asked our students to participate in a test where the maximum number they could achieved was 100. At the end of the year we compared the two groups.

**1. step: **introducing the first independent variable

Dependent: TEST_RESULT (achievement scores)

Independent: CLASS

**Interpretation of the first coefficients table**

b1:Those who study with the new teaching method score 16,567 higher than those who study with the old method (p<0,05). The p value shows that this result is not due by chance, but are we sure that this result is true? Are we sure that the new teaching method is effective, meaning students who learned with the new teaching method got significantly better results than students in the other group who studied with the old teaching method?

**Alternative explanation:** What if the parents of those students who learned by the new teaching method are more educated than the parents of those who studied in the other class? We could say that the more educated parents could help their children more in learning the materials for the class and that is why their children achieved better scores at the test than the ones from the control group.

**Question:** What if the results (b1 and its p value) that we see in our coefficients table are due to this alternative explanation and not because of the fact that the new teaching method was more effective than the old one? Does the educational background of the parents or the new teaching method explain the difference in the student’s achievement scores?

**The solution:** We need to remove the relationship/the correlation between the two independent variables (class, referring to the new and old teaching method and the education background of the parents). If we remove this correlation then the disturbing factor – between our independent variable and dependent variable – disappears. Thus, we need to include in our regression the PEDUCATION as a control variable.

**2. step: **introducing the second independent variable

We introduce a new independent variable: PEDUCATION, this refers to the educational background of the parents – more specifically the father’s educational background, measured on a scale in years.

Dependent: test result

Independent: class, peducation

While using a control variable, we can keep this disturbing factor (our control variable) on a constant level. In this way, we break the systematic relationship between the disturbing variable (PEDUCATION) and the main independent variable (CLASS) by turning the former into a constant. The reason being that a constant is unable to correlate with anything. If we keep the educational background on a constant level it means that hypothetically in both the classes the parents have the same educational background.

This can be achieved by including the disturbing variable as an additional independent variable in the analysis. This way we make the two classes identical on average PEDUCATION, so if we will still find the classes using the new method to perform better at the test, then this can no longer be explained by differences in PEDUCATION simply because there is no difference, the average level of PEDUCATION now is the same. So, this way we can test our alternative hypothesis.

**How do we interpret b1? **

Ŷ=b0+b1*CLASS+b2*PEDUCATION

How much is the average of the test result in the control group – in the class where the students studied with the old method (category 0)? To answer the question you need to substitute the respective value of the class (0) and you have to calculate the average of the other independent variable (PEDUCATION) in the control group.

1. equation: Ŷ_{CO}=b_{0}+b_{2}*PEDUCATION_{CO}

_{ }CO means in the control group_{ }_{ }_{ }_{ }

Ŷ_{CO}: the average test result in the control group

PEDUCATION_{CO}: the average educational background of the father in the control group

How much is the average of the test result in the experiment group – in the class where the students studied with the new method – if the education of the parents would be the same like in the previous group?

We substitute the respective value of the class (1) and we substitute the actual average of the PEDUCATION to the average of the control group.

2. equation: Ŷ_{EX}=b_{0}+b_{1}+b_{2}*PEDUCATION_{CO }_{ }

EX means in the experiment group_{ }

Ŷ_{EX}: the average test result in the experiment group in the case where hypothetically the educational background of the father would be the same in the experiment group and in the control group.

Let’s subtract from equation 2 the equation 1.

Ŷ_{EX}-Ŷ_{CO}=(b_{0}+b_{1}+b_{2}*PEDUCATION_{CO }) – (b_{0}+b_{2}*PEDUCATION_{CO })=b_{1}

As we see the difference between Ŷ_{EX} Ŷ_{CO }is equal b_{1}

**b1:** is the difference between the average test result of the two groups (experiment group and control group) in a hypothetical case when the parent’s educational background in the two groups is the same.

**b1(dummy): is the difference in the average level of the dependent variable between the two groups in a hypothetical case, when there is no difference between the groups of the other independent variable. **

So, this is a hypothetical case when we keep the effect of the parent’s educational background fixed on a constant level. b1 shows the effect of the new teaching method that is independent of the educational background of the parents. b1 is also called as net effect.

**How do we interpret b2?**

While in the interpretation of the first variable we imagined that there is no difference between the two groups, in the reality, b2 and its p value shows that there is a difference. The students’ parents from the two classes have significantly different educational background.

So, instead of using the 2. equation, we have to use the following equation:

3. equation:** **Ŷ_{EX}=b_{0}+b_{1}+b_{2}*PEDUCATION_{EX }_{ }

PEDUCATION_{EX}: the average educational background of the fathers’ in the experiment group

Ŷ_{EX}: the average test score of the students in this class

Let’s substract from the 3rd equation the 1st equation.

Ŷ_{EX}-Ŷ_{CO}= (b_{0}+b_{1}+b_{2}*PEDUCATION_{EX }) – (b_{0}+b_{2}*PEDUCATION_{CO }) =b_{1}+b_{2}*PEDUCATION_{EX}– b_{2}*PEDUCATION_{CO }= b_{1}+b_{2 }(PEDUCATION_{EX }– PEDUCATION_{CO})

As we see, the difference between the two classes are combined from two different parts. The first part is the b1, the net effect of the CLASS, that refers to the introduction of the teaching method, independently from the effect of father’s educational background. The second part is the b_{2 }(PEDUCATION_{EX }– PEDUCATION_{CO}), that is the bias caused by the different educational background.

The two parts together show the regression coefficient in the case, when we involved only one independent variable to the regression (class) – see 1. regression.

Thus, b_{2 }refers to the net effect or true effect and the bias.

So, in our 1st regression b1, the effect of the class was 16,57, while in the second regression, where we have two independent variables the b1, the true/net effect is 0,01.

b_{2}: the effect of the parents’ education on the test result is 4,14. It can be interpreted as the difference in the average level of the parents’ education between the two groups (experiment and control group).

PEDUCATION_{EX }– PEDUCATION_{CO}= 15,03 – 11,03 = 4.

On average, the parent’s of the students who were in the experimental group had gone to school for 4 more years, than the parent’s of the students who were part of the control group.

b2: 16,57 = 0,01 + 4,14 * (15,03 – 11,03)

**b2 (scale): The difference in the average level of this independent variable (PEDUCATION) between the two groups of the other independent variable (CLASS: 1-new method, also called experiment group, 0-old method, also called control group)**

**Conclusion**

Hereby you need to draw your conclusions on two things: there is a change in the p value of the class (meaning it is not significant anymore) while comparing the second regression to the first regression and you also have to check the values of the significance levels.

**In conclusion,** after involving this third variable the effect that we saw previously (the original effect between class and test result) **disappeared**. So, we can state that the result refutes our hypothesis, because the p value of the class became less than 0,05. This shows us that the results refutes our hypothesis stating that “the new teaching method is more effective than the old one.”

The PEDUCATION has the effect on the dependent variable (p<0,05). Thus, the effect of the learning method (CLASS) is due by the different educational background of the parents. The difference in the test results between the experiment and control group is not due by the fact that the new learning method is better, but it can be explained as the fact that in the experiment group the students’ parents were more educated and they could help their children more than the parents of the students in the other class (control group).

**Note: **After involving a new variable always put the question for yourself: do you see any change in the p value? Does the original independent variable still have an effect on the dependent variable? But usually it is not this much simple. The consequences have to be drawn carefully because after involving a new control variable, the change that we see in the p values in certain cases can mean that our hypothesis is right and in other cases it can mean that it is wrong.

## Another example: immigrant-bribe

**Hypothesis:** Those whose father is immigrant believes more strongly in the idea that „it is justifiable if someone accepts a bribe in the course of their duties” than those whose father is not an immigrant.

**In model 1** we see that those whose father is an immigrant score 0,319 higher than those whose father is not immigrant regarding the justification of a bribe. So, it seems like those whose father is immigrant think that accepting a bribe is more justifiable than those whose father is not immigrant. (p<0,05)

As we see our hypothesis is supported by the results. But, are we sure that this is true? What if this hypothesis seems to be right just because there are more older people in our sample and older people are more willing to think that a bribe is acceptable?

Variables: V202, V244, V242

So, our **alternative explanation** is: Not the fact that someone’s father is immigrant or not has an effect on the question of the justification of the bribe but the age has an effect on this.

**In model 2, b1:** By adding the age to the model as a control variable the SPSS creates a hypothetical situation. This hypothetical case that you have to imagine is the one when there is no difference in the average age of the respondents between those whose father is immigrant and those whose is not. This hypothetical case is shown by b1.

**b1:** So, we interpret b1 as the difference in average justifiability between those whose father is immigrant and those whose father is not in a hypothetical case, when there is no age difference between those whose father is immigrant and those whose father is not immigrant. But we also need to see that this statement is statistically not significant. So, actually we cannot state that 0.235 is the difference in average justifiability between those whose father is immigrant and those whose father is not in a hypothetical case, when there is no age difference between those whose father is immigrant and those whose father is not immigrant.

In this way, we keep the age variable on a constant level, which means that in this hypothetical case there is no difference between the two groups regarding age. Here you have to imagine that the respondents from the group 1 and from the group 0 on average are the same years old. Now that we keep the age variable on a constant level we can test if the relationship between having an immigrant father with the justifiability of the bride still stands if in our sample the average age is the same in the two groups. As we see the p value of (father immigrant) changed from model 1 to model 2. So, it is statistically not significant if a person’s father is immigrant or not when we want to explain what kind of factors affect our dependent variable (the justifiability of a bribe).

**In model 2, b2:** is the difference in average age between those whose father is immigrant and those whose not. On average, those whose father is immigrant are 0,008 years younger than those respondents whose father is not immigrant.

**In conclusion,** the results support the alternative hypothesis stating that “Not the fact that someone’s father is immigrant or not has an effect on the question of the justification of the bribe but the age has an effect on this.”

We state this because: The original effect disappeared and the age is statistically significant. (p<0,05)