This report documents the instructions of how I built and compare linear regression and regularized regression with caret package R to predict house prices in India using the dataset, available at https://data.world/dataindianset2000/house-price-india, which contains the data from 2016 to 2017. Steps from reading data into Rstudio to evaluating the two models are covered.
library(readr)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ purrr 1.0.1
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(caret)
## Warning: package 'caret' was built under R version 4.3.1
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(knitr)
The dataset of 2016 and 2017 were saved separately in 2 .csv files.
They were loaded into 2 dataframes and stacked together using
bind_rows()
command.
df1 <- read_csv("House Price India.csv",show_col_types = FALSE)
df2 <- read_csv("House Price India 2.csv",show_col_types = FALSE)
# join data frame
full_df <- bind_rows(df1,df2)
Using glimpse()
function, it can be seen that the
dataset consists of 17594 rows and 23 columns. The data type of all
columns is double (dbl) which is suitable for building regression
models.
# show all columns of the dataset
glimpse(full_df)
## Rows: 17,594
## Columns: 23
## $ id <dbl> 6762810145, 6762810635, 676281…
## $ Date <dbl> 42491, 42491, 42491, 42491, 42…
## $ `number of bedrooms` <dbl> 5, 4, 5, 4, 3, 3, 5, 3, 3, 4, …
## $ `number of bathrooms` <dbl> 2.50, 2.50, 2.75, 2.50, 2.00, …
## $ `living area` <dbl> 3650, 2920, 2910, 3310, 2710, …
## $ `lot area` <dbl> 9050, 4000, 9480, 42998, 4500,…
## $ `number of floors` <dbl> 2.0, 1.5, 1.5, 2.0, 1.5, 1.0, …
## $ `waterfront present` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ `number of views` <dbl> 4, 0, 0, 0, 0, 0, 2, 0, 2, 0, …
## $ `condition of the house` <dbl> 5, 5, 3, 3, 4, 4, 3, 5, 4, 5, …
## $ `grade of the house` <dbl> 10, 8, 8, 9, 8, 9, 10, 8, 8, 7…
## $ `Area of the house(excluding basement)` <dbl> 3370, 1910, 2910, 3310, 1880, …
## $ `Area of the basement` <dbl> 280, 1010, 0, 0, 830, 900, 0, …
## $ `Built Year` <dbl> 1921, 1909, 1939, 2001, 1929, …
## $ `Renovation Year` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ `Postal Code` <dbl> 122003, 122004, 122004, 122005…
## $ Lattitude <dbl> 52.8645, 52.8878, 52.8852, 52.…
## $ Longitude <dbl> -114.557, -114.470, -114.468, …
## $ living_area_renov <dbl> 2880, 2470, 2940, 3350, 2060, …
## $ lot_area_renov <dbl> 5400, 4000, 6600, 42847, 4500,…
## $ `Number of schools nearby` <dbl> 2, 2, 1, 3, 1, 1, 3, 3, 1, 2, …
## $ `Distance from the airport` <dbl> 58, 51, 53, 76, 51, 67, 72, 71…
## $ Price <dbl> 2380000, 1400000, 1200000, 838…
No missing value was found
#Check data completeness
completeness <- full_df %>%
complete.cases() %>%
mean()*100
cat(paste("Data completeness: ",completeness,"%"))
## Data completeness: 100 %
id and Date columns were excluded as they were not relevant for predictions.
#excluding id and date
prep_df <- full_df[,-c(1,2)]
The density plot of the price shows that the data distribution is highly right skewed.
For better performance of regression models, the skewness was reduced by applying log transformation to the price column. For further steps, the column of log transformation was used as the target columns.
prep_df$log_price <- log(prep_df$Price)
A function for splitting the data into train and test sets was created.
split_data <- function(data,train_size,seed){
set.seed(seed)
n <- nrow(data)
id <- sample(1:n,size = n*train_size)
train_set <- data[id, ]
test_set <- data[-id, ]
cat(paste("nrow train set: ",nrow(train_set),"\n"))
cat(paste("nrow test set: ",nrow(test_set),"\n"))
return( list ( train = train_set,
test = test_set))
}
Using function to split the data into train and test sets of size 0.8 and 0.2 respectively.
split_df <- split_data(prep_df,0.8,42)
## nrow train set: 14075
## nrow test set: 3519
train_df <- split_df$train
test_df <- split_df$test
First, all features, except the price column, were used to train a linear regression model. Using the result of the summary of the model, the features were then chosen by excluding those with significance level of p-value > 0.05 which suggests less statistically significant relationship with the target variable. The resampling process used was k-fold cross validation with k = 5 and standardization was applied in the pre-processing.
ctrl <- trainControl(method = "cv",
number = 5,
verboseIter = TRUE)
set.seed(6)
model <- train(log_price ~ .,
data = train_df[,-21],
method = "lm",
preProcess = c("center","scale"), #standardization
trControl = ctrl
)
## + Fold1: intercept=TRUE
## - Fold1: intercept=TRUE
## + Fold2: intercept=TRUE
## - Fold2: intercept=TRUE
## + Fold3: intercept=TRUE
## - Fold3: intercept=TRUE
## + Fold4: intercept=TRUE
## - Fold4: intercept=TRUE
## + Fold5: intercept=TRUE
## - Fold5: intercept=TRUE
## Aggregating results
## Fitting final model on full training set
summary(model)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3479 -0.1558 0.0009 0.1472 1.2687
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) 13.0437520 0.0020992 6213.790
## `\\`number of bedrooms\\`` -0.0114139 0.0027488 -4.152
## `\\`number of bathrooms\\`` 0.0500129 0.0038444 13.009
## `\\`living area\\`` 0.1392096 0.0062271 22.355
## `\\`lot area\\`` 0.0192663 0.0029750 6.476
## `\\`number of floors\\`` 0.0342804 0.0029738 11.527
## `\\`waterfront present\\`` 0.0338284 0.0023185 14.591
## `\\`number of views\\`` 0.0452024 0.0025385 17.807
## `\\`condition of the house\\`` 0.0459678 0.0023465 19.590
## `\\`grade of the house\\`` 0.1878391 0.0039059 48.091
## `\\`Area of the house(excluding basement)\\`` -0.0053355 0.0055485 -0.962
## `\\`Area of the basement\\`` NA NA NA
## `\\`Built Year\\`` -0.0921998 0.0032487 -28.381
## `\\`Renovation Year\\`` 0.0196999 0.0022568 8.729
## `\\`Postal Code\\`` -0.0320279 0.0022646 -14.143
## Lattitude 0.1786522 0.0023199 77.008
## Longitude -0.0097144 0.0025935 -3.746
## living_area_renov 0.0690609 0.0036339 19.004
## lot_area_renov -0.0075016 0.0029990 -2.501
## `\\`Number of schools nearby\\`` 0.0007150 0.0021007 0.340
## `\\`Distance from the airport\\`` -0.0001906 0.0021000 -0.091
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## `\\`number of bedrooms\\`` 3.31e-05 ***
## `\\`number of bathrooms\\`` < 2e-16 ***
## `\\`living area\\`` < 2e-16 ***
## `\\`lot area\\`` 9.72e-11 ***
## `\\`number of floors\\`` < 2e-16 ***
## `\\`waterfront present\\`` < 2e-16 ***
## `\\`number of views\\`` < 2e-16 ***
## `\\`condition of the house\\`` < 2e-16 ***
## `\\`grade of the house\\`` < 2e-16 ***
## `\\`Area of the house(excluding basement)\\`` 0.336267
## `\\`Area of the basement\\`` NA
## `\\`Built Year\\`` < 2e-16 ***
## `\\`Renovation Year\\`` < 2e-16 ***
## `\\`Postal Code\\`` < 2e-16 ***
## Lattitude < 2e-16 ***
## Longitude 0.000181 ***
## living_area_renov < 2e-16 ***
## lot_area_renov 0.012382 *
## `\\`Number of schools nearby\\`` 0.733582
## `\\`Distance from the airport\\`` 0.927679
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.249 on 14055 degrees of freedom
## Multiple R-squared: 0.7794, Adjusted R-squared: 0.7791
## F-statistic: 2613 on 19 and 14055 DF, p-value: < 2.2e-16
It can be seen from the train result above that features with no * as Signif. codes in the coefficients section hold p-values close to 1. Those features were therefore removed before building the models.
The model was built keeping resampling and pre-processing as k-fold CV and standardization respectively. The metric kept for evaluation was Rsquared.
set.seed(6)
linreg_model <- train(log_price ~ .
-(`Area of the house(excluding basement)`
+`Area of the basement`+`Number of schools nearby`+`Distance from the airport`),
data = train_df[,-21],
method = "lm",
preProcess = c("center","scale"), #standardization
trControl = ctrl
)
## + Fold1: intercept=TRUE
## - Fold1: intercept=TRUE
## + Fold2: intercept=TRUE
## - Fold2: intercept=TRUE
## + Fold3: intercept=TRUE
## - Fold3: intercept=TRUE
## + Fold4: intercept=TRUE
## - Fold4: intercept=TRUE
## + Fold5: intercept=TRUE
## - Fold5: intercept=TRUE
## Aggregating results
## Fitting final model on full training set
summary(linreg_model)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.34736 -0.15586 0.00075 0.14694 1.26769
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043752 0.002099 6214.222 < 2e-16 ***
## `\\`number of bedrooms\\`` -0.011380 0.002748 -4.141 3.48e-05 ***
## `\\`number of bathrooms\\`` 0.050591 0.003797 13.323 < 2e-16 ***
## `\\`living area\\`` 0.135335 0.004742 28.543 < 2e-16 ***
## `\\`lot area\\`` 0.019202 0.002973 6.458 1.10e-10 ***
## `\\`number of floors\\`` 0.033017 0.002672 12.358 < 2e-16 ***
## `\\`waterfront present\\`` 0.033727 0.002316 14.564 < 2e-16 ***
## `\\`number of views\\`` 0.045677 0.002491 18.338 < 2e-16 ***
## `\\`condition of the house\\`` 0.046142 0.002339 19.724 < 2e-16 ***
## `\\`grade of the house\\`` 0.187317 0.003867 48.437 < 2e-16 ***
## `\\`Built Year\\`` -0.092176 0.003248 -28.377 < 2e-16 ***
## `\\`Renovation Year\\`` 0.019717 0.002257 8.737 < 2e-16 ***
## `\\`Postal Code\\`` -0.032124 0.002261 -14.206 < 2e-16 ***
## Lattitude 0.178901 0.002306 77.581 < 2e-16 ***
## Longitude -0.010193 0.002543 -4.008 6.16e-05 ***
## living_area_renov 0.068527 0.003592 19.079 < 2e-16 ***
## lot_area_renov -0.007575 0.002997 -2.527 0.0115 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.249 on 14058 degrees of freedom
## Multiple R-squared: 0.7793, Adjusted R-squared: 0.7791
## F-statistic: 3103 on 16 and 14058 DF, p-value: < 2.2e-16
linreg_rsquared_train <- round(summary(linreg_model)$r.squared,4)
cat(paste("Linear Regression Train Rsquared: ",linreg_rsquared_train))
## Linear Regression Train Rsquared: 0.7793
Similarly, the regularized regression model was trained using Rsquared as the metric for final model selection.
set.seed(6)
glmnet_model <- train(log_price ~ .
-(`Area of the house(excluding basement)`
+`Area of the basement`+`Number of schools nearby`+`Distance from the airport`),
data = train_df[,-21],
method = "glmnet",
metric = "Rsquared",
preProcess = c("center","scale"),
trControl = ctrl
)
## + Fold1: alpha=0.10, lambda=0.0753
## - Fold1: alpha=0.10, lambda=0.0753
## + Fold1: alpha=0.55, lambda=0.0753
## - Fold1: alpha=0.55, lambda=0.0753
## + Fold1: alpha=1.00, lambda=0.0753
## - Fold1: alpha=1.00, lambda=0.0753
## + Fold2: alpha=0.10, lambda=0.0753
## - Fold2: alpha=0.10, lambda=0.0753
## + Fold2: alpha=0.55, lambda=0.0753
## - Fold2: alpha=0.55, lambda=0.0753
## + Fold2: alpha=1.00, lambda=0.0753
## - Fold2: alpha=1.00, lambda=0.0753
## + Fold3: alpha=0.10, lambda=0.0753
## - Fold3: alpha=0.10, lambda=0.0753
## + Fold3: alpha=0.55, lambda=0.0753
## - Fold3: alpha=0.55, lambda=0.0753
## + Fold3: alpha=1.00, lambda=0.0753
## - Fold3: alpha=1.00, lambda=0.0753
## + Fold4: alpha=0.10, lambda=0.0753
## - Fold4: alpha=0.10, lambda=0.0753
## + Fold4: alpha=0.55, lambda=0.0753
## - Fold4: alpha=0.55, lambda=0.0753
## + Fold4: alpha=1.00, lambda=0.0753
## - Fold4: alpha=1.00, lambda=0.0753
## + Fold5: alpha=0.10, lambda=0.0753
## - Fold5: alpha=0.10, lambda=0.0753
## + Fold5: alpha=0.55, lambda=0.0753
## - Fold5: alpha=0.55, lambda=0.0753
## + Fold5: alpha=1.00, lambda=0.0753
## - Fold5: alpha=1.00, lambda=0.0753
## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 0.1, lambda = 0.000753 on full training set
glmnet_model
## glmnet
##
## 14075 samples
## 20 predictor
##
## Pre-processing: centered (16), scaled (16)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 11262, 11259, 11258, 11261, 11260
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.10 0.0007530241 0.2492722 0.7788389 0.1905556
## 0.10 0.0075302413 0.2493201 0.7788018 0.1906322
## 0.10 0.0753024127 0.2539750 0.7758858 0.1948874
## 0.55 0.0007530241 0.2492717 0.7788279 0.1905758
## 0.55 0.0075302413 0.2497767 0.7782362 0.1912372
## 0.55 0.0753024127 0.2762759 0.7489048 0.2147028
## 1.00 0.0007530241 0.2492738 0.7788233 0.1905852
## 1.00 0.0075302413 0.2505547 0.7772489 0.1919947
## 1.00 0.0753024127 0.2977742 0.7228463 0.2320649
##
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0.1 and lambda = 0.0007530241.
glmnet_rsquared_train <- round(c(head(glmnet_model$results$Rsquared,5),tail(glmnet_model$results$Rsquared,5))[1],4)
cat(paste("Regularized Regression Train Rsquared: ",glmnet_rsquared_train))
## Regularized Regression Train Rsquared: 0.7788
In this step, the two models were used to predict the log transformation of the house price. Rsquared was used as the metric for evaluation.
Log transformation of house price of the test dataset was predicted using the piece of code below.
p_linreg <- predict(linreg_model,newdata = test_df)
p_glmnet <- predict(glmnet_model,newdata = test_df)
Rsquared values of the test dataset for both models were calculated.
#evaluate
error_linreg <- test_df$log_price - p_linreg
error_glmnet <- test_df$log_price - p_glmnet
ssr_linreg <- sum(error_linreg**2)
ssr_glmnet <- sum(error_glmnet**2)
sst <- sum((test_df$log_price - mean(test_df$log_price))**2)
linreg_rsquared_test <- round(1 - ssr_linreg/sst,4)
glmnet_rsquared_test<- round(1 - ssr_glmnet/sst,4)
cat(paste("Linear regression Test Rsquared: ", linreg_rsquared_test))
## Linear regression Test Rsquared: 0.764
cat(paste("Regularized regression Test Rsquared: ", glmnet_rsquared_test))
## Regularized regression Test Rsquared: 0.7642
result <- data.frame(
c(linreg_rsquared_train,glmnet_rsquared_train),
c(linreg_rsquared_test,glmnet_rsquared_test)
)
colnames(result) <- c("Train Rsquared", "Test Rsquared")
row.names(result) <- c("Linear Regression", "Regularized Regression")
kable(result,caption = "Rsquared Comparison")
Train Rsquared | Test Rsquared | |
---|---|---|
Linear Regression | 0.7793 | 0.7640 |
Regularized Regression | 0.7788 | 0.7642 |
From the the previous section, both models proved their generalisability for having close train and test Rsquared results. However, although the performances of the two were equally effective for obtaining almost the same results, linear regression could be more efficient for this regression problem due to its lower complexity in the training process.