top of page
Screen Shot 2022-03-20 at 4.22.10 PM.png

HOUSING PRICE PREDICTION

 Houses in Ames, Iowa

Anyone who wants to own a house at some point would want to know the cost of a house. For this project, I will build a model to predict the housing prices using a kaggle dataset for houses in Ames, Iowa. The dataset I will be working with was compiled by Dean De Cock for the sole purpose of data science education. It contains 81 features unlike the Boston Housing Dataset which has 14 features to work with regression.

WHAT IS REGRESSION?

Regression is an analysis technique that helps make relationships between variables by estimating how one variable affects the other. Regression uses a mathematical method to predict a continuous outcome (y) using a predictor variable (x). There are many kinds of regression analysis such as linear regression, logistic regression, ridges regression, lasso regression and many more. For this project we will focus on linear regression but also experiment with the other models.


Linear regression is a regression technique that is used to find linear relationships between variables, target and predictor variables. There are 2 types of linear regression; simple, multiple, and multivariate linear regression. For this project we will use simple linear regression which is used to find relationships between two continuous variables.


The linear regression equation is: y = mx+b

In other words, linear regression is the process of finding a line that best fits the given data. We determine what line fits best by measuring residuals which is the distance between the line and a point. The shortest distance is the best for the line fit. The mathematical notation will be as follows:


SS(fit) = Summation (residual squared)


We can use Four different ways to determine/evaluate the effectiveness of our model. 


1. R-squared or coefficient of determination. R- squared is squaring the coefficient of correlations which is a value that tells us how strong a linear relationship is. It is the variation of the dependent variable that is predictable from the independent variable. The mathematical notation for the coefficient of determination is:


R2 = (var(mean) - var(fit))/(var(mean))

Or 

R2 = (SS(mean) - SS(fit))/(SS(mean))


If R2 = 1, then the model is said to predict the data precisely. 


2. Mean Absolute Error. It is the summation of absolute value distance from the points to the line. The problem with MAE is that it is not differentiable.  The mathematical notation for MAE is:

MAE = summation(|yPrediction - xTrue Value|)/(total number of data points)


3. Mean Squared Error. It is the summation of the square of distance from the point to the line. It solved the differentiability problem of MAE. the mathematical notation for MSE is:

MSE = summation((yPrediction - xTrue Value)^2)/(total number of data points)


4. Root Mean Squared Error. It is the standard deviation of residuals that measures the amount of spread out the residuals are in order to show how much the data is concentrated by the line of best fit. The mathematical notation is:

RMSE = (summation(((mean of yPrediction - yPrediction)^2/(total number of data points)))^(1/2)

House Price Prediction: Bio

EXPERIMENT 1: 
DATA UNDERSTANDING AND PRE-PROCESSING

By running “train.shape”, we find out that our dataset has 81 features. We then checked features with null values, and we noticed that there are a lot of features with a large number of null values.

To increase the quality of our prediction performance and accuracy, I decided to remove features with null values more than 400. And our dataset features dropped to 76 removing 5 features. For the experiment, I will work with features with continuous values only but for fun we looked at how many categorical values we have on this dataset and it shows that about 80% of the dataset contains features with categorical values.


We excluded the categorical values and replaced the null values on the continuous values with 0. And double checked for null and it returned empty series. 


NOTE: The figure to the bottom shows the number of null values to some of the features like:

BsmtExposure: Refers to walkout or garden level walls

BsmtFinType1: Rating of basement finished area

BsmtFinType2: Rating of basement finished area (if multiple types)

Electrical: Electrical system

FireplaceQu: Fireplace quality

GarageType: Garage location

GarageYrBlt: Year garage was built

We are also able to see some multicollinearity between features such as GarageType, GarageYrBlt… because they have an equal number of null values.

Screen Shot 2022-03-20 at 10_edited.jpg
House Price Prediction: About

MODELING AND EVALUATION

For experiment 1 modeling, the features being modeled are not pre-processed very much. Like discussed above, we removed features that are categorical and features with 400+ null values. This experiment will help us see where we are with the dataset before too much pre-processing. We create a linear regression model for our continuous feature pre-processed in the above step and the evaluation looks as follows:

Screen Shot 2022-03-20 at 10.49.36 AM.png
House Price Prediction: About

As it is displayed on the image above, with the features we pre-processed, we get 64.51 for the coefficient of determination which is not bad for a dataset that is not thoroughly pre-processed. 


I would like to see what the coefficient of determination would be if we included the categorical value. Therefore, I used LabelEncoder to transform non-numerical values to numerical labels then created a linear regression model. The model still excludes features with more than 400 null values. Then create a linear model for our feature and evaluate the model as follows:

Screen Shot 2022-03-20 at 11.04.56 AM.png
House Price Prediction: About

I am surprised that the R-Squared score increased by 19.21 when we include the categorical features because linear regression is best with continuous values. When we compare other metrics such as the Mean Absolute Error, Mean Squared Error, and Root Mean Squared, all three performed better with the feature that included the categorical value.


Now, I want to experiment with this dataset with a more specific, selected feature.

House Price Prediction: About
HeatMap Cont.png

EXPERIMENT 2.1:

DATA UNDERSTANDING AND PREPROCESSING

For experiment 2, I am going to try to find continuous features that are an important factor for Sales Price. To identify which feature is the best factor, we will look at a heat map of continuous features.

According to the heatmap, there is high correlation between Sales price and feature: 'TotalBsmtSF', 'GrLivArea', '1stFlrSF', 'GarageArea'. 

Key:

TotalBsmtSF: Total square feet of basement area

GrLivArea: Above grade (ground) living area square feet

1stFlrSF: First Floor square feet

GarageArea: Size of garage in square feet

MODELING AND EVALUATION

I went ahead and created a linear regression model. The coefficient of determinant is 62.79 which is no better than the experiment we did with all the continuous features. Therefore I am going to try this experiment with different features where I will pick a selected feature to improve the R-Squared score.

House Price Prediction: Bio
Screen Shot 2022-03-25 at 3.08_edited.jpg
House Price Prediction: Testimonials

Experiment 2.1 evaluation

EXPERIMENT 2.2:

DATA UNDERSTANDING AND PREPROCESSING

On my second try, I dropped features that are highly and very little correlated with Sales Price. To be safe, I normalized the sale price to avoid a skewed distribution. Then I created a heatmap to see the correlation between features. Using the visualization, I picked the most important factors for housing price which are: 'OverallQual', 'YearBuilt', '1stFlrSF', 'GrLivArea', 'FullBath', 'GarageArea'. 

Key:

OverallQual: Rates the overall material and finish of the house

YearBuilt: Original construction date

1stFlrSF: First Floor square feet

GrLivArea: Above grade (ground) living area square feet

FullBath: Full bathrooms above grade

GarageArea: Size of garage in square feet

MODELING AND EVALUATION

Using these features, I created a linear regression model. The model turned out to have a better coefficient of determination of 74.79 which is better than our first attempt for this experiment by 12.


Let's look at our evaluation for experiment 2.1 and 22.

House Price Prediction: Bio
Screen Shot 2022-03-25 at 3_edited.jpg
House Price Prediction: Testimonials

Experiment 2.2 evaluation

EXPERIMENT 3:

Now, we will try different modeling techniques and see if we can better our R-squared score.

This experiment is going to be a little different as I will use the feature from experiment 1 and test different modeling techniques. 

  • SGD stands for Stochastic Gradient Descent. It supports different loss functions and penalties to fit linear regression models. 

  • Ridge regression performs ‘L2 regularization’. It adds a factor of sum of squares of coefficients in the optimization objective.

  • Lasso regression stands for Least Absolute Shrinkage and Selection Operator. It performs ‘L1 regularization’. It adds a factor of summer of absolute value coefficients in optimization objectives.

Source

MODELING AND EVALUATION

This experiment doesn’t require pre-processing because I will be using the dataset I pre-processed for experiment one.

RECAP: The features we will use does not include null values of more than 400. And the null values for other features are replaced by 0. The dataset includes both continuous and categorical features. 


Look at the slide below to compare the Mean Absolute, Error, Mean Squared Error, and Root Mean Squared Error for the three models.  

House Price Prediction: Bio
House Price Prediction: Services

SGD REGRESSOR

Screen Shot 2022-03-20 at 12.47.35 PM.png

The three metrics are very close for Ridge regression and Lasso regression models. While it's a slightly different result with SGD regression, it is very close. Let us see what the metrics will look like if we try random forest regression. We are going to use a feature selection algorithm for random forest. We pick better feature using the feature selection algorithm shown on the right:

Screen Shot 2022-03-20 at 1.08.56 PM.png
House Price Prediction: Bio
Screen Shot 2022-03-20 at 1.11.15 PM.png

The coefficient of determination for random forest regression is 97.56 which is obviously better than Lasso regression which was 83.72. The other three evaluation metrics also perform better. Therefore, we can say that using experiment 1 features (both continuous and categorical features) and then using random forest regressor improved the R-Squared Score.

House Price Prediction: Welcome

CONCLUSION

All three experiments are different. Experiment 1 uses a linear regression model on a continuous feature only and we got a low coefficient of determination. In the same experiment, we used a linear regression model on a feature that includes the categorical variable and observed on our evaluation for the R-Squared Score increased. Experiment 1 required a few pre-processing steps, we removed features with more than 400 null values and label encoded non-numerical categorical variables. But for experiment 2, we created a heatmap to see correlation between features. In experiment 2, we created a linear regression model with selected continuous features. On experiment 2.1, I used features that are highly correlated and the coefficient of determination happens to be low. Therefore, we did experiment 2.2 and used features that are combinations of continuous and categorical variables that are not highly or very little correlated and the coefficient of determination got better.


I was also curious about the results when we use different regression models. Therefore, for experiment 3 we used the same feature from experiment 1 and modeled it using Lasso, SDG, Ridge, and random forest. According to my experiment, Lasso and Ridge were pretty close in all the metric evaluations we used but random forest regression gave the highest coefficient of determination. As it could be because we used a feature selection algorithm for this regression.

The most surprising thing from this experiment was that my linear regression model performed better with features that are categorical. Because going based on the definition, linear regression models work better with continuous values instead of categorical. Overall, this project gave me a good insight in how regression models work and how to play around a dataset to improve prediction. 

House Price Prediction: Text
House Price Prediction: Text

©2022 by liyugt.

bottom of page