Assumptions of Linear Regression
When not to use Simple Linear Regression?
We do not use simple linear regression when your dependent variable is categorical (Nominal or Ordinal) (In this case you will need to use Chi-square or Logistic Regression instead of Linear Regression.) or when your main goal is not related to predicting something or when the assumptions of the linear regression are not fulfilled.
When to use it?
We use it when you have a scale dependent variable and you want to answer the following question: For every unit increase of X how much will Y increase?
What a regression allows you to do is to take a look at one (simple linear regression) or more independent variables (multiple linear regression) and see how the independent variable effects the dependent variable.
Dependent variable : This is the variable whose values we want to explain. We call it a dependent variable because its values depend on something else. We denote it as Y. Other names: Predicted, Outcome, Output variable Independent variable: This is the variable that explains the variability of the dependent variable. It values are independent. We denote it as X. Other names: Input variables, Covariates, Features, Predictor variable The variable that is predicted is called the dependent variable. The variable that I want to predict with is the independent variable. y=b0+b1*x+E y: dependent variable b0: intercept or constant b1: coefficient of X or slope – (usually we pay more attention to this one than to the previous one) x: independent variable E: is the error term which we try to minimizeAssumptions of Simple Linear Regression
There are six assumptions associated with a linear regression model:
- The dependent variable has to be a scale variable.
- Linearity: The relationship between X and the mean of Y is linear.
- Outlier condition
- Homoscedasticity: nearly normal residuals (Independence or Errors) The variance of residual is the same for any value of X.
- Normality: the variables(X,Y) have to be normally distributed.
1.The dependent variable has to be a scale variable.
2. Linearity
Check the Scatter plot for this.
The relationship between the dependent and the independent variable should be linear check using a scatterplot of the data.
3.Outlier condition: Check for outliers and extreme values (levarage and influential points):
Check the Scatter plot.
Outliers in regression are observations that fall far from the “cloud” of points. These points are especially important because they can have a strong influence on the regression line.
So, there are 3 types of „deviant” data points: influential data points, high leverage data points and outliers.
4. Homoscedasticity: nearly normal residuals, nearly normal residuals
Homoscedasticity=constant variability
For large sample sizes this does not matter.
Check using a normal probability plot of residuals or a histogram. Check using a residuals plot (plotting the predicted values and the residuals)
Analyze-Regression-Linear : Plots window -mark the Normal Probability Plot
Question: Are the theoretical residuals normally distributed? We don’t know the theoretical residuals, we only have the observed residuals. Residuals should be nearly normally distributed, centered at 0. This may not be satisfied if there are unusual observations that don’t follow the trend of the rest of the data.
The variability of points around the least squares line (regression line) should be roughly constant. This implies that the variability of residuals around the 0 line should be roughly constant as well.
Put into Y: *ZRESID – standardized residuals
Put into X: *ZPRED – standardized predicted values
What we are looking for here is that if these points more or less fall the line. We see that there is some deviation here (towards the center) but generally the points do seem to fall the line. So, we would assume that we have a normal distribution here. The observed standardized residuals are normally distributed. And then we check if the observed unstandardized residuals are normally distributed. (You can see them in the DataView window, the SPSS program creates these ones after you run the regression.)
Click here for more info (Video)
For another example click here
5.Test of Normality for the Variables (Independent and dependent)
We use the Kolmogorov-Smirnov or Shapiro Wilk test.
Analyze-Descriptive Statistics-Explore: Plots – mark: Normality Plots with Tests
If the Sig. value is above 0,05 then we assume that the variable is normally distributed.
If it is 0,05 or less then 0,05 then the variable is not normally distributed.