Don't forget to create account on our site to get access to more material made only for free registered user.  

Download Free Data Science

QuickBooks (PDF)  

Soon Launching Apache Spark Interview Questions

EMC Data Science Certification Practice Questions

Watch Data Science Videos

Module-1 : EMC Data Science Certification

Module 2 : Data Science and Machine Learning 



Linear Regression: Also called Ordinary Least Squares Regression, models linear relationship between a dependent variable and one or more independent variables.  And also used for finding models “Goodness of fit”.


RY = b0 + b1x1+b2x2+ .... +bnxn


In the linear model, the bi's  represent the unknown p parameters. The estimates for these unknown parameters are chosen so that, on average, the model provides a reasonable estimate of a person's income based on age and education. In other words, the fitted model should minimize the overall error between the linear model and the actual observations. Ordinary Least Squares (OLS) is a common technique to estimate the parameters 


The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (< 0.05) indicates that you can reject the null hypothesis. In other words, a predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the response variable. Conversely, a larger (insignificant) p-value suggests that changes in the predictor are not associated with changes in the response. Significance of the estimated coefficients: Are the t-statistics greater than 2 in magnitude, corresponding to p-values less than 0.05  If they are not, you should probably try to refit the model with the least significant variable excluded, which is the "backward stepwise" approach to model refinement. 


Remember that the t-statistic is just the estimated coefficient divided by its own standard error. Thus, it measures "how many standard deviations from zero" the estimated coefficient is, and it is used to test the hypothesis that the true value of the coefficient is non-zero, in order to confirm that the independent variable really belongs in the model.


The p-value is the probability of observing a t-statistic that large or larger in magnitude given the null hypothesis that the true coefficient value is zero. If the p-value is greater than 0.05-which occurs roughly when the t-statistic is less than 2 in absolute value-this means that the coefficient may be only "accidentally" significant.


There's nothing magical about the 0.05 criterion, but in practice it usually turns out that a variable whose estimated coefficient has a p-value of greater than 0.05 can be dropped from the model without affecting the error measures very much-try it and see


You have no rights to post comments