Regression Analysis – linear regression, SSE , Assumption of linear regression Error term and best Fit line
Regression Analysis :-In statistics, regression analysis is a statistical process for estimating the relationships among variables. herein this post we discuss only about linear regression.
Regression analysis is used to:
- Predict the value of a dependent variable based on the value of at least one independent variable
- Explain the impact of changes in an independent variable on the dependent variable
Dependent variable: the variable we wish to explain (also called the endogenous variable)
Independent variable: the variable used to explain (also called the exogenous variable)
- The relationship between X and Y is described by a linear function
- Changes in Y are assumed to be caused by changes in X
Linear regression population equation model:
Now let’s have look on above equation
Y = Dependent variable
X = independent Variable
β0 and β1 = Where β0 and β1 are the population model coefficients and
ε is a random error term , so let’s understand error term in broad manner coz it impact a lot when we try to learn best fit line or calculating SSE –
Here error is basically the distance between predicted point (on true regression line) and observed point is an error. It is also called disturbance or in more simpler way it is the vertical distance (downward or upward) that any data point is away from the ‘best fit line’. That is the point lies a certain distance either above or below the line and if you were to draw a line from it to the best fit line, then that distance would be considered the ‘error’
the above equation is similar to staringht line equation, and it form two intercepts with error term, lets have a look on the below image to get the better sight of it:-
the above equation is equation represent population regression model while the simple linear regression model provides the estimates of population regression model that can be written like this ;-
Where
Yi = Estimated Y value of observation ith
Xi = value of X observation ith
b0 and b1 = are the estimate of regression intercept and estimate of regression slope
Assumptions for linear regression :-
- X is non Random i.e no variance
- error term is random which also makes Y random . Hence for the smae value of X two different observations may have different value of Y i.e cause of error term.
- error has mean value = 0 and standard deviation is independent of X
Fitting the regression Equation ( least square estimates ) or finding best fitted line :-
getting the estimates of β0 and β1 means finding the best straight line that can be drawn through the scatter plot of X vs Y.
In simple word we can say that least square method is t find the estimate of β0 and β1SSE is basically sum of squares of the errors and minimum value of SSE gives the best fitted line let’s understand the term SSE in more simpler manner it is the sum of all such error points when squared and summed and finally minimized (square, sum and the minimization function being mathematical terms) gives us the equation for the line of best fit.
Hence in laymen language, regression analysis helps in future modelling and predictive analysis such that what would happen to the value of y if x were to up or down by some value.