Variable Transformations: Continuous & Categorical

Variable Transformations

The Amex, Iowa housing data set build has been utilized to develop various regression models to determine the sales price of a house based on numerous variables. The variables range from highly correlated, continuous variables to categorical variables with low correlations. In this assessment, variable transformations and comparisons of Y versus Log(Y) will be conducted along with a review on the statistical and ODS outputs used to analyze and interpret the results. Additionally, outliers will be identified and determined whether they should be kept or removed along with the rational behind the process.  Finally, an assessment on ways variable transformations and outlier deletion impacts the modeling process and the results will be discussed along with potential next steps.

Part A.1:   Appending New Variables

As D. Montgomery explains in Introduction to Liner Regression Analysis (2012), there are a few major assumptions in regression analysis including:

  • The relationship between the response y and the regressors is linear, at least approximately
  • The error term has zero mean
  • The error term has constant variance
  • The errors are uncorrelated
  • The errors are normally distributed

A small subset of the proc display of the new variable transformations log_SalePrice and log_TotalFlrSF are shown below:

Obs TotalFullBath TotalHalfBath TotalBath log_SalePrice log_TotalFlrSF
1 3 2 5 13.5345 8.37008
2 4 1 5 13.5211 8.40649
3 4 1 5 13.3455 8.19616
4 3 1 4 13.3294 7.81197
5 3 1 4 13.3239 7.76811

Part A:  Task 2

Fit four models based on a pair-wise combination using the newly appended variables including the variable transformations.

Model A:  TotalFlrSF to Predict SalePrice

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 1 9.518383E12 9.518383E12 3037.86 <.0001
Error 2928 9.174155E12 3133249517
Corrected Total 2929 1.869254E13

 

Root MSE 55975 R-Square 0.5092
Dependent Mean 180796 Adj R-Sq 0.5090
Coeff Var 30.96054  

 

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 11406 3242.59761 3.52 0.0004
TotalFlrSF 1 113.30303 2.05569 55.12 <.0001

Equation & Coefficient Interpretation

Utilizing the standard equation for SalePrice =  β0 + β1x + ε our equations becomes:

  • SalePrice = $11,406 + 113.30303 x TotalFlrSF

This reflects that for each unit increase in the TotalFlrSF, an increase in the independent variable would result in an average change in the mean sales price of $113.30. This assumes that all values are greater than zero but even at zero, a SalePrice of $11,406 would result. As we are evaluating houses, this would be logical but could be if perhaps a house was not livable and is considered a ‘tear down’ where someone would spend the time and money required to build a new house. However, this would be outside the norm and would require a different equation to determine these types of sale prices.

Model B:  log_TotalFlrSF to Predict SalePrice

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 1 9.16712E12 9.16712E12 2817.86 <.0001
Error 2928 9.525417E12 3253216313
Corrected Total 2929 1.8619254E13

 

Root MSE 57037 R-Square 0.4904
Dependent Mean 180796 Adj R-Sq 0.4902
Coeff Var 31.54769  

 

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 -1068768 23563 -45.36 <.0001
log_TotalFlrSF 1 172187 3243.70186 53.08 <.0001

Equation & Coefficient Interpretation

The standard equation is SalePrice =  β0 + β1x + ε

  • SalePrice = $-1,068,768 + 172,187 x log_TotalFlrSF

The above equation reflects that if we increase log_TotalFlrSF by one percent, we can expect the average SalePrice to increase by $1,721.87 (172,187/100).

Model C:  TotalFlrSF to Predict log_SalePrice

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 1 240.66674 240.66674 2865.46 <.0001
Error 2928 245.91951 0.08399
Corrected Total 2929 486.58626

 

Root MSE 0.28981 R-Square 0.4946
Dependent Mean 12.02097 Adj R-Sq 0.4944
Coeff Var 2.41086  

 

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 11.16922 0.01679 665.30 <.0001
TotalFlrSF 1 0.00056973 0.00001064 53.53 <.0001

Equation & Coefficient Interpretation

The standard equation is SalePrice =  β0 + β1x + ε

  • log_SalePrice = $11.16922 + .00056973 x TotalFlrSF

The above equation reflects that as we increase log_SalePrice by one unit, we can expect the average change in the mean TotalFlrSF to increase by .056973% (.00056973 x 100%).

Model D:  log_TotalFlrSF to Predict log_SalePrice

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 1 258.85775 258.85775 3328.24 <.0001
Error 2928 227.72851 0.07778
Corrected Total 2929 486.58626

 

Root MSE 0.27888 R-Square 0.5320
Dependent Mean 12.02097 Adj R-Sq 0.5318
Coeff Var 2.31998  

 

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 5.38090 0.11521 46.70 <.0001
log_TotalFlrSF 1 0.91499 0.01586 57.69 <.0001

Equation & Coefficient Interpretation

The standard equation is:

  • Log_SalePrice = 5.38090 + .91499 x log_TotalFlrSF

The above equation reflects that as we increase SalePrice by one percent, we can expect the average change in the mean TotalFlrSF to increase by .91499 percent.

Summary:

Based on the above, model ‘d’ is the best fitting model as it has the highest Adjusted R-Square value and the largest F-Value.  The Adjusted R-Square and F-Value have been chosen as the key metrics as they provide a macro level view of how the model fits.  The Adjusted R-Square measures how close the data points are to the fitted regression line and reflects the amount of variability of the response data around its mean.  Thus, in model ‘d’, we have accounted for 53.2% of the variability but still have 46.8% unexplained.

Evaluating the models based on the various plots we can see that:

  • The t-values in each of the models range from -45.36 to 665.3
  • The p-values are all statistically significant
  • Model A & Model B:  The residual values similar and have distinct grouping which does not appear to be random.  Model B has more of a fan shape to it over model A, little differences in the Q-Q plot but has a heavy tail.  The predicted versus the sale price has does not follow the 45-degree line.
  • Model C & Model D:  The residual versus the predicted look similar but still do not appear to be random.  The Q-Q plot is light tailed.  The predicted versus the sale price looks appears to follow the 45 degree line a bit more than in models A & B.
  • Model A & C each have a few values that exceed .20 reflecting that each of these models contain influential data points
  • Model B & D has no values that exceed .20 reflecting that it has few influential data points.
  • The residuals for the log_SalePrice also looks the best out of all of the models as it appears to be more random
  • Model D follows the 45-degree line better than the other models but still has linearity issues.

Concerns on the ‘best’ fitting model

My concerns on the best fitting model, which is model D, is that it is the best fitting based on the Adjusted R-Square and F-Value but analyzing the impact of the few influential data points has not yet been undertaken.  Additionally, ~47% of variability remains unexplained and the residual value does not appear random as it has a distinct grouping of data.   The residuals plot still seem to have a pattern to it and the Q-Q plot shows that the data does not yet fit the line and in the residuals, we still have outliers.  Finally, ensuring taking a step back to ensure that the transformation actually adds value needs to be kept in mind and not to be forgotten is that we need to ensure normality before applying the transformation.  So, while we have improved the model, we still have additional considerations to address.

Concerns about using Variable Transformations

Concerns about using the variable transformations is eased by comparing the original data versus the transformed data is difficult due to the change in scales.  Additionally, concerns about using transformed variables makes continuous non-normal data, normal to increase the validity of the associated statistical analysis.  Personally, I think I prefer Model A because it is less complicated to understand including any consumer considering the purchase of a house.  Using model D, is not as intuitive based on its scales making interpretation more difficult.

How is the interpretation of the log (SalePrice) model different from the price model?

A one unit change in the independent variable (TotalFlrSF) that results in a change in the regression coefficient of the expected value of the dependent variable while all the predictors remain constant.  In the discussion above, I explained the equation in a percentage basis for ease of understanding only.

Conclusion

Using log_SalePrice and log_TotalFlrSF provides a different perspective and method of calculation over the prior equations.  While variable transformations are not as intuitive, they do provide a better model in terms of Adjusted R-Square and F-Value.  Additionally, the plots of the data are more aligned with what we would like to see – the residuals don’t look as random as the other plots, the Q-Q plot is light tailed.  The predicted versus the sale price seems to fit the 45-degree line better, Cook D’s values only has a few influential values that exceed .20.  Based on the above, model ‘d’ is the best fitting model as it has the highest Adjusted R‐Square value and the largest F Value but have only accounted for 5.2% of the variability.  Additionally, comparing the different models have different commonalities for instance, Model A and Model B have distinct grouping which does not appear to be random. Model C and Model D, the Q‐Q plot is lighter tailed. Model A and C each have a few values that exceed .20 and finally, in model B and D in Cook’s D, there are no values that exceed .20.

Part A.3:   Correlate Continuous Variables

The below table reflects the correlation for SalePrice, log_SalePrice and sqrt_SalePrice for continuous variables:


TotalFlrSF has the largest correlation out of all three variables.  This are logical based on the square footage and how it ties into sales price value by consumers.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The above plots are very.  In SalePrice vs TotalFlrSF the line is not linear and the outliers assist in this nonlinearity.  In the plot with SalePrice vs. log_TotalFlrSF we have a smooth curved line that is not straight.   Log_SalePrice vs TotalFlrSF is not smooth but for a portion, seems to be linear with the 45-degree line.  Finally, in the TotalFlrSF vs sqrt_SalePrice we don’t have a smooth line but it does follow the 45-degree for a larger portion of the plot than in the previous plot and thus, appears to be our best fitting model.  The outliers seem to be having a large effect on how the curve changes direction which is in line with the analysis and Cook D’s provided insights.

Constant variance look at the regression line as look at SalePrice as TotalSqFt the variance increases and violates the constancy of variance.  The log SalePrice the variance is more constant over the SalePrice so the assumption has now been met whereas in previous models it was not.

Part A.4:  Log improvements

The conditions where the long variable transformations and the response variable (Y) could improve the model fit when the effect of the outliers is reduced. When applied to the independent variable (in this case, TotalFlrSF) to help create a linear relationship based on the dependent variable.  Variable transformation on the data, while being perhaps less intuitive, assists in providing a more effective and useful model.

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 1 11722622 11722622 3327.03 <.0001
Error 2928 10316657 3523.44838
Corrected Total 2929 22039279

 

Root MSE 59.35864 R-Square 0.5319
Dependent Mean 416.26208 Adj R-Sq 0.5317
Coeff Var 14.25992  

 

Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 30.16558 6.78295 4.45 <.0001
Sqrt_TotalFlrSF 1 10.11869 0.17543 57.68 <.0001

The standard equation is:

  • Log_SalePrice = 230.76398 + .12408 x TotalFlrSF

Thus, a 1% increase in log_SalePrice will occur as one unit is increased in TotalFlrSF while holding all other predictors constant.

Comparison of Sqrt_SalePrice to log_SalePrice

ODS output with the variable transformations based on SQRT

Comparing the sqrt_SalePrice to log_SalePrice the Residual vs Predicted Value and RStudent plot appear to have more of a fan shape to it over the log_SalePrice ODS plots.  Additionally, the Q-Q plot has a different fit to the line with the sqrt_SalePrice appearing to have a heavier tail over log_SalePrice.  The Predicted Value in the sqrt_SalePrice appears to be more dispersed than in the log_SalePrice and has more outliers.  Cook’s D has a very different scale to it indicating that the log_SalePrice outliers have more influence than the sqrt_SalePrice outliers.  We also have a small increase in the Adjusted R-Square with an increase from 0.4944 in the log_SalePrice to .5317 in the sqrt_SalePrice.  Finally, we see an increase in the F-Value with the sqrt_SalePrice with a value of 3327.03 versus 2865.46 in log_SalePrice.

In comparing the residuals for sqrt_SalePrice versus log_SalePrice there is still a cluster of values occurring.  The log_SalePrice appears to be more compressed but both are very similar in how the outliers appear.

The sqrt_SalePrice fit plot appears to be more compressed with fewer values outside of the 95% prediction limits.  However, it has significantly more outliers.  They are both still very similar but the sqrt_SalePrice has more of a fan shape to it.

Log Improvements Conclusion

Utilizing variable transformations  to create log_SalePrice we have seen how it has impacted the SalePrice and TotalFlrSF.  The residual and Predict Values changed shape to more of a fan shape when the values were transformed.  Additionally, the log_SalePrice Q-Q plot fit the line significantly better using the log values over the non-transformed values.  The residuals for log_SalePrice and log_TotalFlrSF appeared much more random then any of the other three models and while it still does not appear to be completely random as we would prefer, the different is easy to spot.  The Fit Plot also improved with more data points remaining within the 95% prediction limits.  The adjusted models also had the highest Adjusted R-Square and F-Value.   Evaluating the continuous variables with log_SalePrice was interesting in how the regression line smoothed out when the log_TotalFlrSF was used versus other variables that appeared choppy with drastic changes in direction.  Adding the square root transformation variable to SalePrice was also conducted.  While somewhat similar to log_SalePrice there were some modest changes to the scatter plots including a change in the Residial vs predicted values, the tails in the Q-Q plot, Cook’s D scale. A different shape of cluster in the residuals plot and the fit plot appearing more compressed but having more outliers.  The F-value was larger in the sqrt_SalePrice but the Adjusted R-Square was larger in the log_SalePrice.

Part B  Step 5:  Outliers

To assist in finding outliers, the proc univariate normal plot was utilized.

proc univariate normal plot data=ames;

var SalePrice;

histogram SalePrice/normal;

This allows us to see the outliers and ranges:

Quantiles (Definition 5)
Level Quantile
100% Max 755000
99% 457347
95% 335000
90% 281357
75% Q3 213500
50% Median 160000
25% Q1 129500
10% 105250
5% 87500
1% 61500
0% Min 12789
Extreme Observations
Lowest Highest
Value Obs Value Obs
12789 182 611657 45
13100 1554 615000 1064
34900 727 625000 2446
35000 2844 745000 1761
35311 2881 755000 1768

Based on the above values, an evaluation on SalePrice using the 99% quantile value of $457,347 along with the 1% lowest value of $61,500.  The outliers can easily be seen in the distribution and probability plot in the extreme high and low values along with the quantiles plot.

The two-standard deviation rule is that 95% of all observations will fall within two standard deviations.  Assuming, that we are to apply the two-standard deviation rule to locate all observations that exceed two or more standard deviations from the mean and remove them to minimize issues with non-normal distributions regardless of whether the observations are legitimate or not.  I do not believe it should be used here as there are a lot of values that exceed two standard deviations from the mean.  Removing the data that exceeds two standard deviations would result in a large loss of data and impact bias, accuracy and power of the results.  If outliers are illegitimately included in the data i.e. they were errors then they should be removed.

Based on the above described quantile values of 1% and 99%, a code was assigned to these observations with some of the lowest and highest observations in SalePrice

Obs SalePrice TotalFlrSF
1 755000 4316
2 745000 4476
3 625000 3627
4 615000 2470
5 611657 2364
Obs SalePrice TotalFlrSF
1 12789 832
2 13100 733
3 34900 720
4 35000 498
5 35311 480

After applying the codes and removing the observations that met the above criteria, we can see the results.  Below the outlier_def codes are as follows:

  • Code 1 reflects the values when SalePrice <= 61500
  • Code 2 reflects SalePrice < 457347 & SalePrice > 61500
  • Code 3 reflects SalePrice >= 457347

By using the extreme value, found 30 values and was confident in removing them as they are a small percentage of the data.

outlier_def Frequency Percent Cumulative
Frequency
Cumulative
Percent
1 30 1.02 30 1.02
2 2870 97.95 2900 98.98
3 30 1.02 2930 100.00

 

outlier_def=1 outlier_def=3
Analysis Variable : SalePrice
N Mean Std Dev Minimum Maximum
30 48624.83 12979.61 12789.00 61500.00
 

Analysis Variable : SalePrice
N Mean Std Dev Minimum Maximum
30 538643.40 78966.13 457347.00 755000.00

 

Quantiles for Normal Distribution
Percent Quantile
Observed Estimated
1.0 61500.0 -5048.18
5.0 87500.0 49394.14
10.0 105250.0 78417.14
25.0 129500.0 126913.30
50.0 160000.0 180796.06
75.0 213500.0 234678.82
90.0 281356.5 283174.98
95.0 335000.0 312197.98
99.0 457347.0 366640.30

Outliers Conclusion

As reflected above, the basis of determining which observations to remove were based on the SalePrice values at the 99% and 1% level.  In reality we would not simply delete these values but flag them as observations to be investigated and evaluated upon on whether they are in error or are legitimate data points.  Obviously, observations that are errors are removed and legitimate observations would be remain intact.

For the purposes of understanding the impact outliers can have on our model, we simply removed all of the values within this threshold.  After we removed the selected outliers, we compared GrLivArea, TotalFlrSF and GrLivArea and TotalFlrSF, GrLivArea and MiscVal to determine whether the removal of the outliers helped the model to become more linear.  In each case, the Adjusted R-Square and F-Values decreased but the plots improved reflecting positive improvements.

Step 6:  Cleaned Data

With the outliers removed, we now have a new quantiles table and distribution plot for SalePrice.

Quantiles (Definition 5)
Level Quantile
100% Max 455000
99% 405000
95% 320000
90% 275500
75% Q3 212900
50% Median 160000
25% Q1 130000
10% 107950
5% 91000
1% 75000
0% Min

62383

Going back to the analysis on GrLivArea and applying the same data analysis to the data but this time with the cleaned data set we have the following results in the highest ranked variables:

Number in
Model
R-Square Adjusted
R-Square
C(p) Variables in Model
1 0.4503 0.4501 574.7237 GrLivArea
1 0.2179 0.2176 1815.424 MasVnrArea
1 0.1274 0.1270 2298.385 YearBuilt
1 0.1270 0.1267 2300.429 HouseAge
1 0.0681 0.0677 2615.237 YearRemodel
1 0.0636 0.0632 2638.890 TotalBath
1 0.0608 0.0604 2654.258 GarageArea

Comparison to the Full Data Set

Evaluating the same variables as used with the full data set.  Using the variable GrLivArea

Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 30406 3352.95617 8.61 <.0001
GrLivArea 1 99.03272 2.25147 43.99 <.0001

 

Root MSE 59250 R-Square 0.4503
Dependent Mean 178239 Adj R-Sq 0.4501
Coeff Var 29.70718  

Equation & Interpret each Coefficient

The new equation for the GrLivArea is SalePrice = $30,406 x 99.03272 x GrLivArea.  Which reflects that for each average unit increase in GrLivArea the Sale Price increases by $99.03. This seems reasonable and logical.

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 2 6.604169E12 3.302084E12 1258.76 <.0001
Error 2867 7.520959E12 2623285390
Corrected Total 2869 1.412513E13

 

Root MSE 51218 R-Square 0.4675
Dependent Mean 178437 Adj R-Sq 0.4672
Coeff Var 28.70369  

 

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t| Variance
Inflation
Intercept 1 13708 3958.59881 3.46 0.0005 0
TotalFlrSF 1 14.03133 1.91693 7.32 <.0001 1.01991
GrLivArea 1 96.26726 2.00021 48.13 <.0001 1.01991

Equation & Interpret each Coefficient

The new equation for is:

  • SalePrice = $13,708 + $14.03 x TotalFlrSF + $96.27 x GrLivArea

Which reflects that for each unit increase in GrLivArea the Sale Price increases by $14.03. This seems reasonable and logical.

For each unit increase in TotalFlrSF the SalePrice increases by $14.03 and $96.27 for each unit increase in GrLivArea.  The above equation is different from the prior version where the GrLivArea was negative.

 

 

Variables:

The three variables are TotalFlrSF, GrLivArea and MiscVal.

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 3 6.607801E12 2.2026E12 839.75 <.0001
Error 2866 7.517327E12 2622933420
Corrected Total 2869 1.412513E13

 

Root MSE 51215 R-Square 0.4678
Dependent Mean 178437 Adj R-Sq 0.4672
Coeff Var 28.70176  

 

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t| Variance
Inflation
Intercept 1 13634 3958.83629 3.44 0.0006 0
TotalFlrSF 1 14.21735 1.92331 7.39 <.0001 1.02684
GrLivArea 1 96.19807 2.00094 48.08 <.0001 1.02079
MiscVal 1 -2.16107 1.83648 -1.18 0.2394 1.00713

Equation:

  • Sales Price = $13,634 + $14.21735 x TotalFlrSF + $96.19807 x GrLivArea – $2.16107 x MiscVal

For each unit increase in TotalFlrSF the SalePrice increases by $14.22, a large decrease of $186.43 when the full dataset was utilized.  Additionally, the decrease in SalePrice of $72.40 when the full dataset was utilized has now changed to an increase in salePrice of $96.20 per unit of GrLivArea and finally, the MiscValue declines by $2.16 per unit increase – this is a decline over the $9.15 decrease when the full dataset was utilized.  MiscValue has little impact on SalePrice so this value is logical.

Summary Table:

Cleaned Data Adj R-Square F Value P Value
SalePrice = $30,406 x 99.03 x GrLivArea. 0.4501 1934.76 <.0001
SalePrice = $13,708 + $14.03 x TotalFlrSF + $96.27 x GrLivArea 0.4675 1258.76 <.0001
 Sales Price = $13,634 + $14.21735 x TotalFlrSF + $96.19807 x GrLivArea – $2.16107 x MiscVal 0.4672 839.75 <.0001
Full Data
GrLivArea is SalePrice = $13,290 x 111.694 x GrLivArea. 0.4994 2922.59 <.0001
Sale Price = $11,688 + $185.03078 x TotalFlrSF – $71.69170 x GrLivArea 0.5106 1528.94 <.0001
Sales Price = $11,105 + $186.4389 x TotalFlrSF – $72.39791 x GrLivArea – $9.14957 x MiscVal 0.5146 1036.17 <.0001

Analysis

GrLivArea:  As we can see above, in each case the goodness-of-fit decreased in each model both in terms on the Adjusted R-Square and F-Value. The variable GrLivArea using the cleaned data still looks quite similar to the full data set.  The residuals still have a fan shape to them reflecting that the normality assumption is violated.  The Q-Q plot has smoother tails but still is not on the line and still have outliers.  Cook’s D is similar.  range has also increased. The residuals are more compressed with fewer outliers and the fit plot also has fewer outliers but is still fan shaped.  and fit plot also look very similar with minimal changes.

TotalFlrSF and GrLivArea has more noticeable changes with some minor slimming in the tails of the Q-Q plot, fewer top outliers in the predicted versus average SalePrice with a bit more compression.  Cook’s D is very similar looking with only two points >.20.  The residual histogram appears quite different with more bars in the middle.  Finally, there is less of a funnel shape and appearing more random in the residuals by regressors and fewer outliers.

In the TotalFlrSF, GrLivArea and MiscValue, we again seem the slimmer tails in the Q-Q plot, the predicted value vs sale value appears more compressed with fewer outliers.  Cook D’s scale changed significantly from 3 to .4.  The residual histogram again reflected more middle bars in the middle.  Finally, the regressors are less fanned shape, appear more random and compressed with fewer outliers in the TotalFlrSF which is similar to GrLivArea.  MiscValue has fewer outliers with the ones remining being closer to the line.

Overall, while the Adjusted R-Square and F-Values have decreased, the graphical plots with the cleaned data appear to be in better shape.  While the alterations are not major, all of the little differences in each of the plots seem to show that the relationship between the variables is improved.  While, there is still room for improvement, the minor change in the removal of the data at the 99% and 1% mark have made some positive improvements.  Thus, the transformation process was worthwhile in doing and provided some positive improvements but we still have issues to address.

Step 7:  Influential Points

Using the threshold of .69 as the DFFITS value and removing the values that did not fit, the data changes as follows:

Before the DFFITS removal After the DFFITS removal
Number of Observations Read 2870
Number of Observations Used 2364
Number of Observations with Missing Values 506
Number of Observations Read 2862
Number of Observations Used 2356
Number of Observations with Missing Values 506

Thus, based on the above, eight observations were removed using the .69 threshold.

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 21 7.27839E12 3.4659E11 172.43 <.0001
Error 2334 4.691511E12 2010073476
Corrected Total 2355 1.19699E13

 

Root MSE 44834 R-Square 0.6081
Dependent Mean 178179 Adj R-Sq 0.6045
Coeff Var 25.16229  

Additionally, the ODS is as follows:

Thus, by removing the additional eight observations, we can see how the Predicted Value vs Residual plot has fewer outliers and the scale has changed from 600,000 to 400,000.  Additionally, in the Q-Q Plot the outliers at the bottom of the plot have disappeared. Similarly, the Predicted Values vs Sale price plot also shows a change in scale from 600,000 to 400,000 with the observations not more compressed on the 45-degree line.  Additionally, the Cook’s D plot scale has changed from .15 to a maximum of .025 with one value.  Most are less than .015.  We also can see the change in the Adjusted R-Square with an increase in value to .6045 from .5576 and have an F-value of 172.43.

Overall, the additional removal of the eight points, has again made a pretty big improvement to the model and thus, making it become more linear.

Step 8:  Conclusion

Overall, transforming the variables and detecting and deleting outliers are two effective ways to assist in creating a better fitting model.  Naturally, understanding the goals of the model are key in determining if they should be employed and to what degree.  Additionally, the use of these techniques is more of an art than a science.  If the model already meets the needs of its purpose than perhaps these methods are not required.  However, if the model is not appropriate than utilizing these techniques to transform the model to become more linear is appropriate.

Employing a process to help determine which observations are outliers and requiring an evaluation can be easily accomplished as outlined within this report.  While we have been heavy – handed in the removal of outliers, in reality, each outlier needs to be evaluated to see if it is valid or if it is an error.  If an observation is an error, then it should be removed.  However, removing legitimate observations will distort the model and should not be removed.

If done correctly, the transformation and deletion of outliers will benefit the model by creating a better, more appropriate model that is more linear.  However, it does require additional processes and a certain finesse to determine the appropriate thresholds values.  As easily observed from the above discussion, the choice of thresholds can drastically impact the number of observations it impacts and needs to be carefully selected and employed.

In summary, in Part A of this report we appended new variables, log_SalePrice and log_TotalFlrSF.  We then used these new variables and paired them in four different methods:  TotalFlrSF to predict SalePrice, log_TotalFlrSF to predict SalePrice, TotalFlrSF to Predict log_SalePrice and log_TotalFlrSF to predict log_SalePrice.  In each combination of variables, an equation and interpretation of the coefficients was provided along with an analysis of the SAS generated Output Delivery System (ODS).  Additional metrics such as Adjusted R-Square, F-Value were included in determining the fit of the model.  Initially, we found that the variable pair of log_SalePrice and log_TotalFlrSF was the ‘best’ fitting model based on an in-depth analysis of both ODS plots and Goodness-of-Fit data.  However, we also correlated TotalFlrSF with log_SalePrice and evaluated how well the observations fit the line.

Next, we evaluated the conditions where the log transformation improved the fit and also expanded to include another transformation but using square root.    Again, an analysis on the ODS plots was provided along with the Goodness-of-Fit measures such as the Adjusted R-Square which reflected positive improvements over the log_SalePrice.  We observed the difference in the residual and fit plots which were minor but still where the sqrt_SalePrice had some improvements over the log_SalePrice.

In the second part of the report, we focused on outliers.  We chose the 99% and 1% level as our basis of determining which observations to remove based on the distribution and probability plot of SalePrice.  As mentioned above, in reality we would not be this heavy-handed in their removal but instead evaluate each observation to determine whether it is an error or legitimate would be conducted.  However, for the purposes of understanding the impact outliers can have on our model, we simply removed all of the values within this threshold.  After we removed the selected outliers, we had a dataset of 2,870 observation that we compared GrLivArea, TotalFlrSF and GrLivArea and TotalFlrSF, GrLivArea and MiscVal to determine whether the removal of the outliers helped the model to become more linear.  In each case, the Adjusted R-Square and F-Values decreased but the plots improved reflecting positive improvements.

Finally, we undertook another method of removing outliers which was based on DFFITS which is a method to reflect how influential an observation is within a statistical regression.  Based on our threshold of .69, we removed eight observations.  Reducing our dataset down to 2,862.  While the number of observations removed are pretty minimal, the results on the various ODS plots were easy to spot and the Adjusted R-Square improved to .6045.  Thus, removing these influential observations clearly had an impact on transforming the model to become more linear.

Next steps could include an evaluation of the different regressors that have the largest influence on the regression equation and ensuring the chosen variables are logical.  Additionally, attaining additional data to validate the model would be helpful in understanding how well the model performs in predicting sales prices.  In some cases, data is split so that data can be used both in the creation of the model and the determine how predictive if it on the remaining data.  While there are pros and cons to the methods, in our study of the Ames housing data, we did not employ this technique and hence, additional data would be helpful in determining the performance of our model.

Leave a Reply