We will see an example of how to check an alternative explanation by using a control variable. 

Example: Our main goal is to find out if a new teaching method is more effective than the old one or not. 

Hypothesis: The new teaching method is more effective than the old one.

We have responses from students from two different classes. In one class, called as experiment group, coded as 1 a new teaching method was introduced. In the other class, called as control group this teaching method was not introduced, the students still learn with the old method. At the end of the year in order to assess their performance we asked our students to participate in a test where the maximum number they could achieved was 100. At the end of the year we compared the two groups.

1.REGRESSION

Dependent: TEST_RESULT (achievement scores)

Independent: CLASS

Interpretation of the first coefficients table

b1:Those who study with the new teaching method score 16,567 higher than those who study with the old method (p<0,05). The p value shows that this result is not due by chance, but are we sure that this result is true? Are we sure that the new teaching method is effective, meaning students who learned with the new teaching method got significantly better results than students in the other group who studied with the old teaching method?

Alternative explanation: What if the parents of those students who learned by the new teaching method are more educated than the parents of those who studied in the other class? We could say that the more educated parents could help their children more in learning the materials for the class and that is why their children achieved better scores at the test than the ones from the control group.

Question: What if the results (b1 and its p value) that we see in our coefficients table are due to this alternative explanation and not because of the fact that the new teaching method was more effective than the old one? Does the educational background of the parents or the new teaching method explain the difference in the student’s achievement scores?

The solution: We need to remove the relationship/the correlation between the two independent variables (class, referring to the new and old teaching method and the education background of the parents). If we remove this correlation then the disturbing factor – between our independent variable and dependent variable – disappears. Thus, we need to include in our regression the PEDUCATION as a control variable.

2.REGRESSION

We introduce a new independent variable: PEDUCATION, this refers to the educational background of the parents – more specifically the father’s educational background, measured on a scale in years.

Dependent: test result

Independent: class, peducation

While using a control variable, we can keep this disturbing factor (our control variable) on a constant level. In this way, we break the systematic relationship between the disturbing variable (PEDUCATION) and the main independent variable (CLASS) by turning the former into a constant. The reason being that a constant is unable to correlate with anything. If we keep the educational background on a constant level it means that hypothetically in both the classes the parents have the same educational background. 

This can be achieved by including the disturbing variable as an additional independent variable in the analysis. This way we make the two classes identical on average PEDUCATION, so if we will still find the classes using the new method to perform better at the test, then this can no longer be explained by differences in PEDUCATION simply because there is no difference, the average level of PEDUCATION now is the same. So, this way we can test our alternative hypothesis. 

How do we interpret b1?

Ŷ=b0+b1*CLASS+b2*PEDUCATION

How much is the average of the test result in the control group – in the class where the students studied with the old method (category 0)? To answer the question you need to substitute the respective value of the class (0) and you have to calculate the average of the other independent variable (PEDUCATION) in the control group.

1. equation: ŶCO=b0+b2*PEDUCATIONCO

 CO means in the control group

ŶCO: the average test result in the control group

PEDUCATIONCO: the average educational background of the father in the control group

How much is the average of the test result in the experiment group – in the class where the students studied with the new method – if the education of the parents would be the same like in the previous group?

We substitute the respective value of the class (1) and we substitute the actual average of the PEDUCATION to the average of the control group.

2. equation: ŶEX=b0+b1+b2*PEDUCATIONCO

EX means in the experiment group

ŶEX: the average test result in the experiment group in the case where hypothetically the educational background of the father would be the same in the experiment group and in the control group.

Let’s subtract from equation 2 the equation 1.

ŶEXCO=(b0+b1+b2*PEDUCATIONCO ) – (b0+b2*PEDUCATIONCO )=b1

As we see the difference between ŶEX ŶCO is equal b1

b1: is the difference between the average test result of the two groups (experiment group and control group) in a hypothetical case when the parent’s educational background in the two groups is the same.

b1(dummy): is the difference in the average level of the dependent variable between the two groups in a hypothetical case, when there is no difference between the groups of the other independent variable. 

So, this is a hypothetical case when we keep the effect of the parent’s educational background fixed on a constant level. b1 shows the effect of the new teaching method that is independent of the educational background of the parents. b1 is also called as net effect.

How do we interpret b2?

While in the interpretation of the first variable we imagined that there is no difference between the two groups, in the reality, b2 and its p value shows that there is a difference. The students’ parents from the two classes have significantly different educational background.

So, instead of using the 2. equation, we have to use the following equation:

3. equation: ŶEX=b0+b1+b2*PEDUCATIONEX

PEDUCATIONEX: the average educational background of the fathers’ in the experiment group 

ŶEX: the average test score of the students in this class

Let’s substract from the 3rd equation the 1st equation.

ŶEXCO= (b0+b1+b2*PEDUCATIONEX ) – (b0+b2*PEDUCATIONCO ) =b1+b2*PEDUCATIONEX– b2*PEDUCATIONCO = b1+b2 (PEDUCATIONEX – PEDUCATIONCO)

As we see, the difference between the two classes are combined from two different parts. The first part is the b1, the net effect of the CLASS, that refers to the introduction of the teaching method, independently from the effect of father’s educational background. The second part is the b2 (PEDUCATIONEX – PEDUCATIONCO), that is the bias caused by the different educational background.

The two parts together show the regression coefficient in the case, when we involved only one independent variable to the regression (class) – see 1. regression.

Thus, b2 refers to the net effect or true effect and the bias.

So, in our 1st regression b1, the effect of the class was 16,57, while in the second regression, where we have two independent variables the b1, the true/net effect is 0,01.

b2: the effect of the parents’ education on the test result is 4,14. It can be interpreted as the difference in the average level of the parents’ education between the two groups (experiment and control group).

PEDUCATIONEX – PEDUCATIONCO= 15,03 – 11,03 = 4.

On average, the parent’s of the students who were in the experimental group had gone to school for 4 more years, than the parent’s of the students who were part of the control group.

b2: 16,57 = 0,01 + 4,14 * (15,03 – 11,03)

b2 (scale): The difference in the average level of this independent variable (PEDUCATION) between the two groups of the other independent variable (CLASS: 1-new method, also called experiment group, 0-old method, also called control group)

Conclusion

Hereby you need to draw your conclusions on two things: there is a change in the p value of the class (meaning it is not significant anymore) while comparing the second regression to the first regression and you also have to check the values of the significance levels.

In conclusion, after involving this third variable the effect that we saw previously (the original effect between class and test result) disappeared. So, we can state that the result refutes our hypothesis, because the p value of the class became less than 0,05. This shows us that the results refutes our hypothesis stating that “the new teaching method is more effective than the old one.”

The PEDUCATION has the effect on the dependent variable (p<0,05). Thus, the effect of the learning method (CLASS) is due by the different educational background of the parents. The difference in the test results between the experiment and control group is not due by the fact that the new learning method is better, but it can be explained as the fact that in the experiment group the students’ parents were more educated and they could help their children more than the parents of the students in the other class (control group).

Note: After involving a new variable always put the question for yourself: do you see any change in the p value? Does the original independent variable still have an effect on the dependent variable? But usually it is not this much simple. The consequences have to be drawn carefully because after involving a new control variable, the change that we see in the p values in certain cases can mean that our hypothesis is right and in other cases it can mean that it is wrong.

Next Page: When to use a control variable?


UP