Regression Models Using Numerous Variables

April 16, 2017 admin

Analysis, Models, SAS Programming, Statistics

Assessing Regression Models Using Numerous Variables

Regression model on the Amex, Iowa housing data set builds regression models for the house sale price with numerous variables. Some of which are highly correlated, continuous variables along to the other side of the continuum by evaluating categorical, low correlated variables. An assessment of each model will be conducted along with a review on the statistical and ODS output will be conducted and interpreted. Additionally, linear regression models will be conducted based on two and three variables along with an evaluation of the impact of these new variables and how they add value to predicting the sales price of a house. Finally, an assessment on the model being specified and the next steps will be provided.

Part A – Step 1

The variable I choose is was one that was created in first installment – Total Floor Square Footage (TotalFlrSF).

Parameter Estimates
Variable	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	1	11406	3242.59761	3.52	0.0004
TotalFlrSF	1	113.30303	2.05569	55.12	<.0001

Equation

The normal equation is y = β₀ + β₁x + ε

Based on the above parameter estimates, the equation is:

Sale Price = $11,406 + $113.30303 x TotalFlrSF

Thus, for each unit increase in the TotalFlrSF, an increase in sales price of $113.30 occurs. This assumes all values are greater than zero but even at zero, a SalePrice of$11,406 would result. As we are evaluating houses, this would be logical but could be if perhaps a house was not livable and is considered a ‘tear down’ where someone would spend the time and money required to build a new house. However, this would be outside the norm and would require a different equation to determine these types of sale prices.

Model Adequacy

The automatic generated ODS output:

regression models adequacy

The various produced ODS output from SAS shows a cluster of points in the residual versus predicted value indicating there is an issue as it should be completely random without any type of pattern. The Q-Q Plot titled “Quantile” is abnormal as it deviates from the line and has heavy tails which means there is a larger probability of getting very large values. The Predicted Value plot again shows that there are issues as the points are not on the line.

Assess the Goodness-of-Fit of this model

Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	1	9.518383E12	9.518383E12	3037.86	<.0001
Error	2928	9.174155E12	3133249517
Corrected Total	2929	1.869254E13

Root MSE	55975	R-Square	0.5092
Dependent Mean	180796	Adj R-Sq	0.5090
Coeff Var	30.96054

Regression Models – TotalSqFt Fit Plot

The P value is small but not equal to zero which means we can reject the null hypothesis that it is equal to zero. Thus, each variable is significant. Additionally, the statistical summaries show an R-Square of .5092 and an adjusted R-Square of .509 which is also very good as it represents the proportion of variability in the dependent variable that can be explained by the regression model.

Fitted regression model over the scatter-plot: The fit Plot shows a plot of all of the data along with the 95% confidence and prediction limits. We can clearly see the amount of data that exceeds the boundaries confidence limits.
Assessment of the normality of the residuals using a Q-Q plot and/or histogram of the standardized residuals: The normal quantile plot of the residuals and the residual histogram are consistent with the assumption of Gaussian errors and shows the narrow spread in the residuals but with a high peak. The Q-Q plot helps detect violations from normality if they are normal, the points will cluster tightly around a reference line. As we can see, we have a deviation from the line and has heavy tails which means there is a larger probability of getting very large values. Thus, normality is not true in this model.
Assessment of homoscedasticity by plotting the predictor variable against the standardized residuals: The quantile histogram helps diagnose violation of the normality and homoscedasticity assumptions. The points on the Predicted Value versus Sales Price plot of the dependent variables versus the predicted values along the 45-degree line show a definite pattern which indicates that the model is not appropriate. Homoscedasticity also referred to as the assumption of constant error variance looks for errors that aren’t independent causing the plot to be more linear looking or that flares out as x increases. Thus, based on the above, this model does not appear to be homoscedastic.
Check for potential outliers using Cook’s Distance: Cook’s Distance measures the effect of each data point on the predicted value with the lowest value being zero and the highest point showing the influence of the point. As we can see, there are only two points that exceed .2 but are less than one, the acceptable limit, an investigation into these two outliers should be conducted. Thus, these two points could be outliers but having an influence on the results. While the above does not reflect the actual Cook’s D score, the values show how the vector of fitted values move when the observation is deleted.

Model Adequacy Conclusions

The variable TotalSqFt has normality issues based on the above analysis. The ODS plots reflect patterns of clusters, heavy tailed Q-Q plot, and the Predicted versus Sale Price are not on the line. Additionally, the Residual plot does not appear to be complete randomness and the Fit Plot which we would expect to be random, has more of a sideways funnel pattern. However, the P value is < .0001 which means each variable is significant and that we can reject the null hypotheses. Additionally, the Adjusted R-Square is .509 which is significant. The matrix of graphs and plots provides a high-level overview of the relationship between total square footage in a house and sale price.

Part A – Step 2

The Best X to Predict Y in the Regression Model

The ‘best’ simple linear regression model to predict Sales Price using the R-square option reflects the following output:

Variables in Model	R-Square	Adjusted	C(p)
	R-Square	R-Square	C(p)
TotalFlrSF	0.5118	0.5116	3161.986
GrLivArea	0.5006	0.5004	3289.886
GarageArea	0.4235	0.4233	4169.764
TotalBsmtSF	0.421	0.4207	4199.299
FirstFlrSF	0.4115	0.4112	4307.718

Ironically, Total Floor Square Footage has the highest R-square value and Adjusted R-square which was discussed in Part A Step 1. So not to repeat the same outputs, I’ll use GrLivArea for the rest of this task.

Parameter Estimates
Variable	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	1	13290	3269.70277	4.06	<.0001
GrLivArea	1	111.69400	2.06607	54.06	<.0001

Equation & Interpret each coefficient:

The equation for GrLivArea is SalePrice = $13,290 x 111.694 x GrLivArea. Which reflects that for each unit increase in GrLivArea the Sale Price increases by $111.69. This seems reasonable and logical.

R²measures the variability in y remaining after x has been considered and is often called the proportion of variance explained by the regressor x (Montgomery, D. Introduction to Linear Regression Analysis. p. 36). So, the variables with the closest R²equaling to one implies that they explain most of the variability in y.

Variables in Model	R-Square	Adjusted R-Square	C(p)	Interpretation	Overlap?
TotalFlrSF	0.5118	0.5116	3161.986	Makes logical sense in terms in explaining the variability in sales price	Overlap
GrLivArea	0.5006	0.5004	3289.886	Logical but overlaps with TotalFlrSF	Overlap
GarageArea	0.4235	0.4233	4169.764	Not logical on its own as a house needs to go with a garage
TotalBsmtSF	0.421	0.4207	4199.299	Logical but overlaps with TotalFlrSF and GrLivArea	Overlap
FirstFlrSF	0.4115	0.4112	4307.718	Would have thought this to be stronger than TotalBsmtSF; Overlaps with TotalFlrSF, GrLivArea, TotalBsmtSF	Overlap
TotalBath	0.3868	0.3866	4589.011	Logical as it plays an important part in the sales price of a house	Overlap
TotalFullBath	0.359	0.3588	4906.859	Logical and expected that it would have a lower R than total number of bathrooms	Overlap
HouseAge	0.3213	0.321	5338.312	Logical in that newer housses would likely have a higher sales price	Overlap
YearBuilt	0.3209	0.3206	5342.262	Expected to be very similar to HouseAge which it is so no surprise	Overlap
YearRemodel	0.2894	0.2891	5702.33	Logical in terms of explaining the variability in sales price; somewhat associated with houseage and yrbuilt
MasVnrArea	0.2843	0.284	5760.722	Not logical as I would have thought it to have lesser impact on the sales price than some of the variables with a lesser R
TotRmsAbvGrd	0.2523	0.252	6126.552	Logical in terms of explaining the variability in sales price	Overlap
BsmtFinSF1	0.1966	0.1963	6762.719	Logical as I think people are more concerned about the bsmt sq ft over being a finished area or not	Overlap
LotFrontage	0.127	0.1266	7557.59	Logical as the sales price and lotfrontage would be less likely to explain the variability
WoodDeckSF	0.117	0.1166	7671.945	Logical as it ranks higher than the amount of porch space but surprised that it is ranked higher than LotArea
LotArea	0.1027	0.1023	7835.853	Not logical as I would have thought the LotArea would be more directly related to sales price
SecondFlrSF	0.0636	0.0633	8281.516	Logical as likely to be similar to the first floor and is related to TotalFlrSF	Overlap
TotalHalfBath	0.0544	0.054	8387.015	Logical as sales price is likely to be more effected by totalbath and fullbaths	Overlap
BsmtUnfSF	0.0386	0.0382	8567.744	Logical as I think people are more concerned about the bsmt sq ft over being a finished area or not	Overlap
TotalPorchSF	0.0293	0.0289	8673.387	Logical as it is of little importance to the sales price
BedroomAbvGr	0.0188	0.0184	8793.586	Logical and has less impact than TotalRmsAbvGrd which makes sense	Overlap
PoolArea	0.0048	0.0044	8953.456	Logical as would expect it to have little impact on sales value
LowQualFinSF	0.0013	0.0009	8994.168	Logical based on what I would expect people to report on this variable and other variable that may be related i.e. sale condition
MoSold	0.0007	0.0003	9000.444	Logical – when a house is sold should have minimal impact on the sales price	Overlap
YrSold	0.0006	0.0001	9002.185	Logical – when a house is sold should have minimal impact on the sales price	Overlap
MiscVal	0.0003	-0.0001	9005.503	Logical given the few items and their value to the overall sales price
BsmtFinSF2	0	-0.0004	9008.506	Logical given the dataset and the potential duplication and confusion on this variable	Overlap

In what sense is the model the ‘best’ model:

GrLivArea is the best model as it logical to the sales price of a house which people obviously plan to live in so they are more likely to spend additional money for a larger space. It has an Adjusted R-Square of .5004. The p value is small so we can reject the null hypothesis.

Anything funny about it from an interpretation standpoint?

The equation is ‘funny’ if you consider a house with zero GrLivArea which would result in a sale price of $13,290. While one could perhaps justify, it based on the price of land only, it is illogical. Additionally, the increase in value based on a one unit change of ~$111.70 is logical. Overall, we would expect the model to be better with the increased Adjusted R-Square yet we still have the same issues and concerns based on the plots.

Goodness-of-Fit

Goodness-of-fit statistics on the GrLivArea are as follows:

Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	1	9.33763E12	9.33763E12	2922.59	<.0001
Error	2928	9.354907E12	3194981962
Corrected Total	2929	1.869254E13

Root MSE	56524	R-Square	0.4995
Dependent Mean	180796	Adj R-Sq	0.4994
Coeff Var	31.26405

The F Value is 2,922.59 is significantly greater than one. The p-value is small reflecting that each variable is significant. Based on the above, there is a linear relationship between SalePrice and GrLivArea. The Adjusted R-Square of .4994 reflects the variability that is accounted for using SalesPrice and GrLivArea. While there is a small difference between R-Square and Adjusted R-Square, Adjusted R-Square will not penalize us for adding variables and will only change if the variable reduces the residual mean square. The P value is small so we can reject the null hypothesis.

The ODS output is as follows:

Regression Models – SaS ODS output

The Predicted Value to the Residual has a definite pattern and seems to fan out showing that the variable is increasing. The ODS Quantile or Q-Q plot indicates that normality is not true as the points deviate from the line (see red area) and is heavy tailed. Again, similar to TotalSqFt Cook’s D we have three influential points with two points that exceed 0.2. It would only be prudent to look into these three points to ensure they are valid. Additionally, the Residuals for Sale Price are not random as we would expect and again has a bit of a pattern to it. The Fit Plot also has a distinct pattern to it. Thus, we have normality issues.

Regression Models – FitPlot GrLivArea

Regression Models – Residuals GrLivArea

The Residuals between SalePrice and GrLivArea are again, similar to TotalSqFt, where homoscedasticity does not appear to be linear within the plot but does flare out a bit as GrLivArea increases. Thus, this model does not appear to be homoscedastic.

Comments on Adequacy

While GrLivArea is the highest Adjusted R-Square value out of all of the continuous variables accounting for .4994 of the variability leaving another .5006 unexplained. Perhaps, one of the categorical variables will be assist in improving the unexplained portion. Again, we have conflicting information between the adjusted R-Square the ODS plots. Additionally, GrLivArea has a small p-value but we have normality issues based on the ODS plots.

The Best X to Predict Y Conclusions

The GrLivArea variable is a variable with a small p-value which means that the variable is significant. Additionally, it has a good Adjusted R-Square value however, the ODS outputs reflect that we have normality issues. We have patterns in the Predicted Value versus Residual value where there shouldn’t be any, the Q-Q plot deviates from the line, the Predicted Value versus Sales Price is also not on the line which reflect that is it not ideal. The residual plot also has a pattern and the fit plot also has a fanning out pattern. Thus, the GrLivArea is not normal.

Part A – Step 3

Categorical value in a Regression Model:

Based on week one Exterior1 is a category variable with the following histogram

Regression Models- Exterior Frequency

Equation

Evaluating a few of the variables we get the following result:

Parameter Estimates
Variable	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|	Variance Inflation
Intercept	1	150475	4457.27517	33.76	<.0001	0
Exterior1Category	1	2575.12360	357.88376	7.20	<.0001	1.00000

Root MSE	79203	R-Square	0.0174
Dependent Mean	180796	Adj R-Sq	0.0170
Coeff Var	43.81456

SalePrice = $150,475 x $2,575.12 x Exterior1

Interpret Coefficient

While Siding is only one of sixteen variables within the Exterior1 variable, a one unit change will result in an increase in value by $2,575 not that we can have a house with more than one siding, this is illogical but it simply means that a house will siding on average has an increased sales price of $2,575. The p-values is also greater than one which means we can reject the null hypothesis and conclude that there is a linear relationship between Exterior1 Siding and SalePrice.

Looking at the Adjusted R-Square, we see that SalePrice using Exterior1 only explains 1.74% of the variability which is low.

Anything funny about the coefficient interpretation

I can’t see there is anything funny about the coefficient other than its low value in explaining variability. Variance inflation is also one.

Generate ODS output and assesses the Goodness-of-fit

Regression Models – Exterior ODS

Reviewing the ODS plots we don’t seem to have a pattern in the Predicted Valued and Residual plot unlike in the GrLivArea ODS plot but we do have another heavy-tailed Q-Q plot that deviates from the line, a cluster in the predicted value with a lot more spikes in Cook’s D reflecting that there are more observations that are having an influence on the results but is still less than one on the axis scale. Additionally, we see skewed residual.

Model Adequacy

Regression Models – FitPlot Exterior

Regression Models – Residuals Exterior

Reviewing the plots and various statistics, the residuals plot can’t fan out due to the type of variable but there does not appear to be any linear relationship between the different exterior1 categories. The fit plot resembles shows a slight positive correlation with a bit of a ‘u’ shape between the data. It appears to be bimodal which is reflective of categorical model. Overall, Exterior1 is not normally distributed or a good predictor in determining SalePrice which is reflected in the various plot, statistics and reasons provided.

Does the predicted model go through the mean value of Y for each category group?

The above Fit Plot line does appear to go through the mean value of each category but shows the distribution of values for each category based on the variability. The confidence limits but as mentioned above, reflects a slight correlation which is reflected in the slight slope of the line. The plot shows the distribution of values within each category with the largest amount of variability in the SalePrice of a house using the Exterior1 Category of category16 which is wood siding. Having the line go through the mean doesn’t really provide any value as the categories are listed in order of their assigned value.

Is this good or bad / why or why not?

I don’t think having the line adds any value and in fact, causes confusion and possible incorrect conclusions that there is some type of linear relationship when in fact there isn’t one. Additionally, it doesn’t provide any insight on which categories perform better than others or the volume of data in each.

Categorical Value Conclusion

While the Exterior1 category analysis is interesting to conduct, the results in being able to predict the SalePrice is low with the Adjusted R-Square of 0.0170. Additionally, the plots and charts provided additional insight into Cook’s D observations with few influential peaks, a heavy tailed Q-Q plot. No surprise, the does not appear to be a linear relationship between Exterior1 and SalePrice. Finally, the Fit Plot would not be my preferred method of making categorical assessment of data as they add very little value and can be misleading.

Part A – Task 4 Analysis

Of the three models described above, the best model is TotalSqFt as it has the highest Adjusted R-Square at .509, the errors across the values appear to be independent of the variables and thus considered to be homoscedastic. Cook’s D has only a few data points that are having an influential influence. These points along with the outliers need some further research but do not expect them to change the outcomes of the above analysis. Additionally, the F value is significantly greater than one. Again, we have conflicting information based on the plots and the R-Square values which needs to be further investigated.

Part B – Task 5 Regression Model Using Two Variables

Variables:

The two variables are TotalFlrSF and GrLivArea.

Equation:

Parameter Estimates
Variable	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|	Variance Inflation
Intercept	1	11688	3238.62266	3.61	0.0003	0
TotalFlrSF	1	185.03078	22.40381	8.26	<.0001	119.15477
GrLivArea	1	-71.69170	22.29838	-3.22	0.0013	119.15477

Root MSE	55886	R-Square	0.5109
Dependent Mean	180796	Adj R-Sq	0.5106
Coeff Var	30.91129

Sale Price = $11,688 + $185.03078 x TotalFlrSF – $71.69170 x GrLivArea

For each unit increase in TotalFlrSF the SalePrice increases by $185 yet a decrease of $71.69 impacts SalePrice for each unit increase in GrLivArea. The above equation is different from the simple linear regression models above with the negative value each unit of GrLivArea has on SalePrice. This is not intuitive but because of the unusual choice in variables and the obvious overlap, it makes logical sense. Additionally, the Adjusted R-Square is .5106 again reflected the amount of variability that is accounted for in using these two variables. However, this Adjusted R-Square has only improved slighted from the simple linear regression model using TotalFlrSF which had an Adjusted R-Square of .5090. Again, this is not unexpected due to the overlap in the variables.

By adding one unit of TotalFlrSF with all other factors remaining constant, we can see how the SalePrice changes. However, one unit of TotalFlrSF and GrLivArea are not the same and need to ensure that this is not confused.

Goodness-of-Fit

Regression Models – ODS Multiple Variables

Regression Models – ODS Multiple Variables – TotalFlrArea and GrLivArea

Based on the above, we have a small p-value reflecting that each variable is significant and thus, we cannot reject the null hypothesis. However, unlike the prior variables, we have a Variance inflation number of 119.15477 which is significantly outside of the acceptable range of 1 to 5. Thus, we have multi-linearity issues. We can see a lot of similarities with TotalFlSF with the fanning out pattern in the Predicted Value and Residual plot, the Q-Q plot again, deviates from the line and Cook’s D has only two influential points that exceeding 0.2 which is less than one. Again, the Adjusted R-Square is the largest out of all of the models analyzed with a value of 0.5106. Homoscedasticity looks for errors that aren’t independent causing the plot to be more linear looking or that flares out as x increases. Again, there is a clear clustered pattern here. Above, we can see compare TotalFlrSF and GrLivArea which, for the most part, appear to be identical.

Better Fit

As mentioned earlier, we have only accounted for .5106 of the variability leaving another .4894 unexplained. While this has increased from the prior models it is only a small increase. Thus, a significant amount of additional work is required to reduce this value.

Regression Model Using Two Variables Conclusion

Combining two continuous explanatory variables TotalFlSF and GrLivArea provided similar results to the simple linear regression model using TotalFlSF. This is not surprising, as GrLivArea is encapsulated in TotalFlSF as such the equation has a negative component to it with a small increase to the Adjusted R-Square value. The p-value reflects that each variable is significant and thus, we can reject the null hypothesis. However, the variance influence exceeds our normal range reflecting that we have multi-linearity issues which isn’t a surprise. Evaluating the plots based on the multiple variables is again, very similar to the analysis on TotalFlSF with the a fanning out in the Predicted Value and Residual plot. Thus, this we again have normality issues.

Part B – Step 6 Using Three Variables

Variables:

The three variables are TotalFlrSF, GrLivArea and MiscVal.

Equation:

Parameter Estimates
Variable	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|	Variance Inflation
Intercept	1	11105	3227.36083	3.44	0.0006	0
TotalFlrSF	1	186.43890	22.31325	8.36	<.0001	119.17355
GrLivArea	1	-72.39791	22.20695	-3.26	0.0011	119.15954
MiscVal	1	-9.14957	1.82008	-5.03	<.0001	1.00470

Root MSE	55886	R-Square	0.5151
Dependent Mean	180796	Adj R-Sq	0.5146
Coeff Var	30.78393

Sales Price = $11,105 + $186.4389 x TotalFlrSF – $72.39791 x GrLivArea – $9.14957 x MiscVal

For each unit increase in TotalFlrSF the SalePrice increases by $186.44, a small increase of $1.40 over the equation in part 5, and similarly a decrease of $72.40 (a further decline from $71.69 in part 5) but now there is another decrease which is $9.15 per unit of MiscVal. The decrease in value in MiscVal on SalePrice is not as intuitive as TotalFlrSF and GrLivArea however due to the low volume and value of the MiscVal data points it is understandable there the variable would have little impact on SalePrice.

Regression Models – ODS Multiple Variables TotalFlrSF, GrLivArea and MiscVal.

Goodness-of-Fit

We can see a lot of similarities with part 5, the p-values are small reflecting that each variable is significant and thus, we reject the null hypothesis. The Variance Inflation for TotalFlSF is 119.17355, GrLivArea is 119.159 which exceed the upper boundaries of our 1-5 range and the MiscValue is close to the bottom of our acceptable range with a value of 1.00470. We again see the Predicted Value to the Residual having a fanning out pattern which reflects that the variance is increasing. Cook’s D plot has changed in our axis measures but still has only a few influential points. The Q-Q plot is again, deviating from the line and is heavy tailed. The Predicted Value to Sales Price shows a cluster of values. Thus, we still have the same normality issues.

Adjusted R-Square increased from .5106 to .5146 again reflected the amount of variability that is accounted for in using these two variables. However, this is a very small increase – 0.0040 over the Adjusted R-Value in question 5 and the value of adding the MiscVal variable. The Adjusted R-Square has only improved slighted from the simple linear regression model using TotalFlrSF and the multiple variables between TotalFlSF and GrLivArea.

Regression Models – Residuals Multiple Variables TotalFlrSF, GrLivArea and MiscVal.

The Residual plots for each variable provides more of a view on how they alter depending on the variables over the results in part 5 but still do not appear to be random with distinct patterns to each variable. However, we can see how MiscVal errors do not provide any easy to see, linear relationship and has only a few values. Thus, we still have normality issues.

Changes & Better Fit

Overall, the additional variable – MiscVal has not made any real impact in predicting SalePrice. While this is no surprise, based on its low correlation value. While our Adjusted R-Value did increase, it was not significant with a small .0040 increase thus showing us that adding additional values does not necessarily improve the results or assist us in creating a better fitting equation and explaining the variability. While Adjusted R-Square plays an increased importance on the criteria in comparing the models primarily due to not wanting to make incorrect decisions by simply adding variables, the plots and charts are helpful in understand the relationships and how they impact the model.

Using Three Variables Conclusion

Adding additional variables do not necessarily improve results. By adding a low correlated variable – MiscVal the other variables within the equation received minor adjustments. While each of the variables has a small p-value we can reject the null hypothesis. Similar to the prior analysis, there are not a lot of changes with the exception being the variance influence. The review of the plots reflects similar patterns with the exception of the residual plot where the MiscValue has a different pattern but still not a pattern of randomness which we hope for. However, the analysis remained relatively the same as to having only two variables as in part 5. Additionally, adding the MiscVal variable had a minor effect in explaining the variability and did not have any significant impact on SalePrice.

Conclusion

By observing the various models and the effects of adding and reviewing various variables we can easily grasp the changes through the various plots, equations and summary statistics. Continuous variables have a much different look and resulting plots over continuous variables. Additionally, evaluating the Adjusted R-Square and F-Value provides insight into how the model is performing even when the impact is difficult to ascertain through the plots.

The model may not be appropriately specified as all variables seem to be equal which may not be the case. Additionally, we could be adding new variables without considering their association with other variables as shown between TotalFlSF and GrLivArea – removing GrLivArea would have significant effects on TotalFlSF.

Next steps in the modeling process would be to investigate a few of the influential points that have been mentioned in several parts such as Cook’s D and focus on determining whether the Adjusted R-Score can be enhanced through the evaluation of different variables and try to account for more of the variability as we all of the models have not exceeded 51.46%. Finally, transformation of the data so the conflicts between the R-squares and ODS outputs are decreased would be really helpful in creating a useful model we are confident in.

Regression Models Using Numerous Variables

Assessing Regression Models Using Numerous Variables

Part A – Step 1

Equation

Model Adequacy

Assess the Goodness-of-Fit of this model

Model Adequacy Conclusions

Part A – Step 2

The Best X to Predict Y in the Regression Model

Equation & Interpret each coefficient:

In what sense is the model the ‘best’ model:

Anything funny about it from an interpretation standpoint?

Goodness-of-Fit

Comments on Adequacy

The Best X to Predict Y Conclusions

Part A – Step 3

Categorical value in a Regression Model:

Equation

Interpret Coefficient

Anything funny about the coefficient interpretation

Generate ODS output and assesses the Goodness-of-fit

Model Adequacy

Does the predicted model go through the mean value of Y for each category group?

Is this good or bad / why or why not?

Categorical Value Conclusion

Part A – Task 4 Analysis

Part B – Task 5 Regression Model Using Two Variables

Equation:

Goodness-of-Fit

Better Fit

Regression Model Using Two Variables Conclusion

Part B – Step 6 Using Three Variables

Variables:

Equation:

Goodness-of-Fit

Using Three Variables Conclusion

Conclusion

Related Posts

Leave a Reply Cancel reply