# Simple Linear Regression

**When not to use Simple Linear Regression? **

We do not use simple linear regression when your dependent variable is categorical (Nominal or Ordinal) (In this case you will need to use Chi-square or Logistic Regression instead of Linear Regression.) or when your main goal is not related to predicting something or when the assumptions of the linear regression are not fullfilled. For more information Click here.

## When to use it?

We use it when you have a scale dependent variable and you want to answer the following question: For every unit increase of X how much will Y increase?

What a regression allows you to do is to take a look at one (simple linear regression) or more independent variables (multiple linear regression) and see how the independent variable effects the dependent variable.

**Dependent variable**: This is the variable whose values we want to explain. We call it a dependent variable because its values depend on something else. We denote it as Y. Other names: Predicted, Outcome, Output variable

**Independent variable**: This is the variable that explains the variability of the dependent variable. It values are independent. We denote it as X. Other names: Input variables, Covariates, Features, Predictor variable

The variable that is predicted is called the dependent variable. The variable that I want to predict with is the independent variable.

**y=b0+b1*x+E**

y: dependent variable

b0: intercept or constant

b1: coefficient of X or slope – (usually we pay more attention to this one than to the previous one)

x: independent variable

E: is the error term which we try to minimize

## 1.Step: Open GDP.sav

In our GDP.SAV datafile we have two questions that were assesed in 100 different countries.

## 2.Step: Create a Scatter plot and form your hyphotesis.

Graph – Legacy Dialogs – Scatter/Dots- Simple Scatter

Put the dependent variable on the Y axes: Suicide

Put the independent variable on the X axes: GDP

Paste it, run it in the syntax window!

State a hypothesis. See the relationship between the two variables.

Those countries that have high GDP and high Suicide ratio are in the upper right corner. Those countries that are in the bottom left corner have low GDP and low Suicide ratio. As a hypothesis we could say that **as the GDP increases the sucidie increases too**.

Another example: How to state a hypotheses?

Question: How people’s salaries depend on their experience?

Example: 1) The more experienced you are the higher your salary is. 2) The less experience you have the higher your salary is. 3) Statistically there is no significant relationship between experience and salary. (Which means that the variation in Y is unrelated to the variation in X.)

## Step4: Perform simple linear regression

Analyze-Regression-LinearGo to Statistics and mark the following ones:

**Interpret the Mean:** In the examined 100 countries the average suicide rate is 38 and the average GDP is 6029 dollar.

**Interpret the correlations table**: The correlations table displays Pearson correlation coefficients, significance values, and the number of cases with non-missing values.

What does this show us? Is there a significant relationship between the two variables? (Sig. – Significance level) If there is a relationship then what type is it? Positive or negativ? Week, moderate or strong? (Pearson Correlation number – denoted as r)

Pearson Correlation is close to -1 or 1: there is a strong relationship between the two variables. (If Pearson Correlation is 0,423 : there is a moderate correlation between the two variables.)

Pearson Correlation is close to 0: no relationship.

If **Sig. (1-tailed)** or p value < 0,05 :There is a relationship between the two variables.

If Sig. (1-tailed) or p value > 0,05 or = 0,05 then there is no relationship between the two variables.

Note: It can also happen that there is a relationship but the sample size is too small and that is why our test does not show an accurate result.

Note: SPSS uses asterisks to indicate if the test is significant. One asterisks (*) means that the correlation is significant at 0.5 level. Two asterisks mean that the correlation is significant at 0.01 level.

For now we don’t have to do anything with this one. We will use it later in the multiple regression.

In this table we interpret the R square, which is the correlation coefficient between the two variables. It indicates how good your regression equation fits your data. Closer to 1 indicates a better fit to your data.

**R** is the square root of R-Squared and is the correlation between the observed and predicted values of dependent variable

**R square:** The R square answers the question how much movement in the dependent variable is explained by a movement in the independent variables. (You want this to be more then 30% or 0.3.) Note that this is an overall measure of the strength of association, and does not reflect the extent to which any particular independent variable is associated with the dependent variable.

R-Square is also called the coefficient of determination. R-Square is the proportion of the variance explained by the independent variables

So, 31,9% of the variablity in the suicide ratio can be accounted for by the GDP. So, the GDP does not account for too much.

Example1: The value of R square is: 0.688, which is almost 70%. This means that 70% of the movement/variance observed in the dependent variable is explained by movements in the independent variables. This is a very high R square value. So, the independent variable explains a lot of variance in the dependent variable.

Example2: R square: 0.073 – translated into percentage this is 7.3. This means the GDP (independent variable) accounts only for 7.3% of the variation in suicide ratio (dependent variable). So, 92,7% of the variation in the Suicide ratio is explained by other factors. So as we can see, GDP doesn’t really explain/count a lot.

**Adjusted R square**: we interpret this instead of the R square when the sample size is small. The difference between the Adjusted R square and the R square gets smaller as the sample size gets larger.

**Std. Error of the Estimates**: This is a measure of variablity. It tells you how much ineccuracy you will get in your predictions. Smaller numbers mean more accuracy while larger numbers mean less accuracy.

The ANOVA table shows us if our model is statistically significant or not. It indicates that the regression model predicts the dependent variable significantly well. Look at the “**Regression**” row and go to the “**Sig.**” column. This indicates the statistical significance of the regression model. *p* < 0.05 indicates that, overall, the regression model statistically significantly predicts the outcome variable.

**df** – These are the degrees of freedom associated with the sources of variance. The total variance has N-1 degrees of freedom. In this case, there were N=200 students, so the DF for total is 199. The model degrees of freedom corresponds to the number of predictors minus 1 (K-1). You may think this would be 4-1 (since there were 4 independent variables in the model, **math**, **female**, **socst** and **read**). But, the intercept is automatically included in the model (unless you explicitly omit the intercept). Including the intercept, there are 5 predictors, so the model has 5-1=4 degrees of freedom. The Residual degrees of freedom is the DF total minus the DF model, 199 – 4 is 195.

**Interpreting the coefficients table**:

The first variable (**constant**) represents the constant, also referred to in textbooks as the Y intercept, the height of the regression line when it crosses the Y axis. In other words, this is the predicted value of **science** when all other variables are 0.

So basically the **Standardized Coefficients Beta** in fact is the correlation between the independent (GDP) and the dependent variable (Suicide). (If you scroll up to the Correlation table then you can see the Pearsons Correlation table.) **Beta** – These are the standardized coefficients. These are the coefficients that you would obtain if you standardized all of the variables in the regression, including the dependent and all of the independent variables, and ran the regression. By standardizing the variables before running the regression, you have put all of the variables on the same scale, and you can compare the magnitude of the coefficients to see which one has more of an effect. You will also notice that the larger betas are associated with the larger t-values.

The Standardized Coefficient Beta tells you about the importance of each predictor in the model.

The **Coefficients** table provides us with the necessary information to predict the dependent variable from the independent variable(s), as well as determine whether the independent variable(s) contributes statistically significantly to the model (by looking at the “**Sig.**” column).

If you check the Sig. in this row (the row where the 0.565 is) then we can say that the result is statistically significant (p=0,000, so p < 0,05). It means that it is very unlikely to have occurred by chance alone.

t column – is the t test.

**b0 (Constant/Intercept)**: Most generally the intercept doesn’t have intuitive interpretation. If B0 is on the X(independent variable’s) line then it has a meaning, otherwise we do not interpret this. This tells us that when the independent variable is 0 how much is the value of the dependent variable.

**b1 (Slope)**: This shows us how the independent variable effects the dependent variable. It shows us how much will be the dependent variable if we increase the independent variable by one unit. So, basically b1 shows the effect of the independent variable on the dependent variable. It shows the direction and the intensity.

**Interpretation**: For 1 dollar increase in the GDP the suicide rate will increase 0,003 times on average. This doesn’t really have too much meaning (1 dollar increase doesn’t really matter so much) so, in this case we better talk about 100 dollar. So we say that for every 100 dollar increase in the GDP the average suicide rate will increase by 0,3 times.

So, this shows that in a country where the GDP is 100 dollars more then in the other country the suicidie rate on average is 0,3 times more.

If b1 is 0 then we can say that there is no linear relationship between the two variables. Just keep it in mind that there could be some other relationship between them that is not linear. So, basically the standardized coefficient data is in fact…is the correlation between the indep and dep variables. Scroll up the “then”. Then the value is equal to this. Cross out “then” again… Not it means…. this means. Cross out alone from by chance.

## Step6: Compare the result and your hyphothesis

This implication is in contrast with our previous hypothesis, where we said that in the poorer countries the suicide rate is higher. So it contradicts the idea that suicide rate is high because of the social economic background.

## Other related videos

1.Do not confuse correlation with regression – watch the next video:

3.Education and Earnings with more details

## Other examples

Let’s focus on the three predictors, whether they are statistically significant and, if so, the direction of the relationship. The average class size (**acs_k3**, b=-2.682) is not significant (p=0.055), but only just so, and the coefficient is negative which would indicate that larger class sizes is related to lower academic performance — which is what we would expect. Next, the effect of **meals** (b=-3.702, p=.000) is significant and its coefficient is negative indicating that the greater the proportion students receiving free meals, the lower the academic performance. Please note that we are not saying that free meals are causing lower academic performance. The **meals** variable is highly related to income level and functions more as a proxy for poverty. Thus, higher levels of poverty are associated with lower academic performance. This result also makes sense. Finally, the percentage of teachers with full credentials (**full**, b=0.109, p=.2321) seems to be unrelated to academic performance. This would seem to indicate that the percentage of teachers with full credentials is not an important factor in predicting academic performance — this result was somewhat unexpected.

Should we take these results and write them up for publication? From these results, we would conclude that lower class sizes are related to higher performance, that fewer students receiving free meals is associated with higher performance, and that the percentage of teachers with full credentials was not related to academic performance in the schools. Before we write this up for publication, we should do a number of checks to make sure we can firmly stand behind these results. We start by getting more familiar with the data file, doing preliminary data checking, and looking for errors in the data.

Optional:

UP