Introduction

This report documents the instructions of how I built and compare linear regression and regularized regression with caret package R to predict house prices in India using the dataset, available at https://data.world/dataindianset2000/house-price-india, which contains the data from 2016 to 2017. Steps from reading data into Rstudio to evaluating the two models are covered.

Step 1: Load necessary libraries

library(readr)
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.1

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ purrr     1.0.1
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(caret)

## Warning: package 'caret' was built under R version 4.3.1

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(knitr)

Step 2: Load and join dataset

The dataset of 2016 and 2017 were saved separately in 2 .csv files. They were loaded into 2 dataframes and stacked together using bind_rows() command.

2.1 Load and join dataframes

df1 <- read_csv("House Price India.csv",show_col_types = FALSE)
df2 <- read_csv("House Price India 2.csv",show_col_types = FALSE)

# join data frame
full_df <- bind_rows(df1,df2)

2.2 Show all columns

Using glimpse() function, it can be seen that the dataset consists of 17594 rows and 23 columns. The data type of all columns is double (dbl) which is suitable for building regression models.

# show all columns of the dataset
glimpse(full_df)

## Rows: 17,594
## Columns: 23
## $ id                                      <dbl> 6762810145, 6762810635, 676281…
## $ Date                                    <dbl> 42491, 42491, 42491, 42491, 42…
## $ `number of bedrooms`                    <dbl> 5, 4, 5, 4, 3, 3, 5, 3, 3, 4, …
## $ `number of bathrooms`                   <dbl> 2.50, 2.50, 2.75, 2.50, 2.00, …
## $ `living area`                           <dbl> 3650, 2920, 2910, 3310, 2710, …
## $ `lot area`                              <dbl> 9050, 4000, 9480, 42998, 4500,…
## $ `number of floors`                      <dbl> 2.0, 1.5, 1.5, 2.0, 1.5, 1.0, …
## $ `waterfront present`                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ `number of views`                       <dbl> 4, 0, 0, 0, 0, 0, 2, 0, 2, 0, …
## $ `condition of the house`                <dbl> 5, 5, 3, 3, 4, 4, 3, 5, 4, 5, …
## $ `grade of the house`                    <dbl> 10, 8, 8, 9, 8, 9, 10, 8, 8, 7…
## $ `Area of the house(excluding basement)` <dbl> 3370, 1910, 2910, 3310, 1880, …
## $ `Area of the basement`                  <dbl> 280, 1010, 0, 0, 830, 900, 0, …
## $ `Built Year`                            <dbl> 1921, 1909, 1939, 2001, 1929, …
## $ `Renovation Year`                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ `Postal Code`                           <dbl> 122003, 122004, 122004, 122005…
## $ Lattitude                               <dbl> 52.8645, 52.8878, 52.8852, 52.…
## $ Longitude                               <dbl> -114.557, -114.470, -114.468, …
## $ living_area_renov                       <dbl> 2880, 2470, 2940, 3350, 2060, …
## $ lot_area_renov                          <dbl> 5400, 4000, 6600, 42847, 4500,…
## $ `Number of schools nearby`              <dbl> 2, 2, 1, 3, 1, 1, 3, 3, 1, 2, …
## $ `Distance from the airport`             <dbl> 58, 51, 53, 76, 51, 67, 72, 71…
## $ Price                                   <dbl> 2380000, 1400000, 1200000, 838…

Step 3: Data preparation

3.1 Check data completeness

No missing value was found

#Check data completeness
completeness <- full_df %>%
  complete.cases() %>%
  mean()*100
cat(paste("Data completeness: ",completeness,"%"))

## Data completeness:  100 %

3.2 Omit id and Date columns

id and Date columns were excluded as they were not relevant for predictions.

#excluding id and date
prep_df <- full_df[,-c(1,2)]

Step 4: Visualizing the target column and applying log transformation

4.1 Visualizing Price

The density plot of the price shows that the data distribution is highly right skewed.

4.2 Log transformation

For better performance of regression models, the skewness was reduced by applying log transformation to the price column. For further steps, the column of log transformation was used as the target columns.

prep_df$log_price <- log(prep_df$Price)

Step 5: Building the models

5.1 Split the data

A function for splitting the data into train and test sets was created.

split_data <- function(data,train_size,seed){
  set.seed(seed)
  n <- nrow(data)
  id <- sample(1:n,size = n*train_size)
  train_set <- data[id, ]
  test_set <- data[-id, ]
  cat(paste("nrow train set: ",nrow(train_set),"\n"))
  cat(paste("nrow test set: ",nrow(test_set),"\n"))
  return( list ( train = train_set,
                 test = test_set))
}

Using function to split the data into train and test sets of size 0.8 and 0.2 respectively.

split_df <- split_data(prep_df,0.8,42)

## nrow train set:  14075 
## nrow test set:  3519

train_df <- split_df$train
test_df <- split_df$test

5.2 Select features

First, all features, except the price column, were used to train a linear regression model. Using the result of the summary of the model, the features were then chosen by excluding those with significance level of p-value > 0.05 which suggests less statistically significant relationship with the target variable. The resampling process used was k-fold cross validation with k = 5 and standardization was applied in the pre-processing.

Train the model using all features

ctrl <- trainControl(method = "cv",
                     number = 5,
                     verboseIter = TRUE)

set.seed(6)
model <- train(log_price ~ .,
               data = train_df[,-21],
               method = "lm",
               preProcess = c("center","scale"), #standardization
               trControl = ctrl
               )

## + Fold1: intercept=TRUE 
## - Fold1: intercept=TRUE 
## + Fold2: intercept=TRUE 
## - Fold2: intercept=TRUE 
## + Fold3: intercept=TRUE 
## - Fold3: intercept=TRUE 
## + Fold4: intercept=TRUE 
## - Fold4: intercept=TRUE 
## + Fold5: intercept=TRUE 
## - Fold5: intercept=TRUE 
## Aggregating results
## Fitting final model on full training set

summary(model)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3479 -0.1558  0.0009  0.1472  1.2687 
## 
## Coefficients: (1 not defined because of singularities)
##                                                 Estimate Std. Error  t value
## (Intercept)                                   13.0437520  0.0020992 6213.790
## `\\`number of bedrooms\\``                    -0.0114139  0.0027488   -4.152
## `\\`number of bathrooms\\``                    0.0500129  0.0038444   13.009
## `\\`living area\\``                            0.1392096  0.0062271   22.355
## `\\`lot area\\``                               0.0192663  0.0029750    6.476
## `\\`number of floors\\``                       0.0342804  0.0029738   11.527
## `\\`waterfront present\\``                     0.0338284  0.0023185   14.591
## `\\`number of views\\``                        0.0452024  0.0025385   17.807
## `\\`condition of the house\\``                 0.0459678  0.0023465   19.590
## `\\`grade of the house\\``                     0.1878391  0.0039059   48.091
## `\\`Area of the house(excluding basement)\\`` -0.0053355  0.0055485   -0.962
## `\\`Area of the basement\\``                          NA         NA       NA
## `\\`Built Year\\``                            -0.0921998  0.0032487  -28.381
## `\\`Renovation Year\\``                        0.0196999  0.0022568    8.729
## `\\`Postal Code\\``                           -0.0320279  0.0022646  -14.143
## Lattitude                                      0.1786522  0.0023199   77.008
## Longitude                                     -0.0097144  0.0025935   -3.746
## living_area_renov                              0.0690609  0.0036339   19.004
## lot_area_renov                                -0.0075016  0.0029990   -2.501
## `\\`Number of schools nearby\\``               0.0007150  0.0021007    0.340
## `\\`Distance from the airport\\``             -0.0001906  0.0021000   -0.091
##                                               Pr(>|t|)    
## (Intercept)                                    < 2e-16 ***
## `\\`number of bedrooms\\``                    3.31e-05 ***
## `\\`number of bathrooms\\``                    < 2e-16 ***
## `\\`living area\\``                            < 2e-16 ***
## `\\`lot area\\``                              9.72e-11 ***
## `\\`number of floors\\``                       < 2e-16 ***
## `\\`waterfront present\\``                     < 2e-16 ***
## `\\`number of views\\``                        < 2e-16 ***
## `\\`condition of the house\\``                 < 2e-16 ***
## `\\`grade of the house\\``                     < 2e-16 ***
## `\\`Area of the house(excluding basement)\\`` 0.336267    
## `\\`Area of the basement\\``                        NA    
## `\\`Built Year\\``                             < 2e-16 ***
## `\\`Renovation Year\\``                        < 2e-16 ***
## `\\`Postal Code\\``                            < 2e-16 ***
## Lattitude                                      < 2e-16 ***
## Longitude                                     0.000181 ***
## living_area_renov                              < 2e-16 ***
## lot_area_renov                                0.012382 *  
## `\\`Number of schools nearby\\``              0.733582    
## `\\`Distance from the airport\\``             0.927679    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.249 on 14055 degrees of freedom
## Multiple R-squared:  0.7794, Adjusted R-squared:  0.7791 
## F-statistic:  2613 on 19 and 14055 DF,  p-value: < 2.2e-16

Train with insignificant features removed

It can be seen from the train result above that features with no * as Signif. codes in the coefficients section hold p-values close to 1. Those features were therefore removed before building the models.

5.3 Train Linear Regression Model

The model was built keeping resampling and pre-processing as k-fold CV and standardization respectively. The metric kept for evaluation was Rsquared.

set.seed(6)
linreg_model <- train(log_price ~ .
             -(`Area of the house(excluding basement)`
              +`Area of the basement`+`Number of schools nearby`+`Distance from the airport`),
               data = train_df[,-21],
               method = "lm",
               preProcess = c("center","scale"), #standardization
               trControl = ctrl
               )

## + Fold1: intercept=TRUE 
## - Fold1: intercept=TRUE 
## + Fold2: intercept=TRUE 
## - Fold2: intercept=TRUE 
## + Fold3: intercept=TRUE 
## - Fold3: intercept=TRUE 
## + Fold4: intercept=TRUE 
## - Fold4: intercept=TRUE 
## + Fold5: intercept=TRUE 
## - Fold5: intercept=TRUE 
## Aggregating results
## Fitting final model on full training set

summary(linreg_model)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.34736 -0.15586  0.00075  0.14694  1.26769 
## 
## Coefficients:
##                                 Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                    13.043752   0.002099 6214.222  < 2e-16 ***
## `\\`number of bedrooms\\``     -0.011380   0.002748   -4.141 3.48e-05 ***
## `\\`number of bathrooms\\``     0.050591   0.003797   13.323  < 2e-16 ***
## `\\`living area\\``             0.135335   0.004742   28.543  < 2e-16 ***
## `\\`lot area\\``                0.019202   0.002973    6.458 1.10e-10 ***
## `\\`number of floors\\``        0.033017   0.002672   12.358  < 2e-16 ***
## `\\`waterfront present\\``      0.033727   0.002316   14.564  < 2e-16 ***
## `\\`number of views\\``         0.045677   0.002491   18.338  < 2e-16 ***
## `\\`condition of the house\\``  0.046142   0.002339   19.724  < 2e-16 ***
## `\\`grade of the house\\``      0.187317   0.003867   48.437  < 2e-16 ***
## `\\`Built Year\\``             -0.092176   0.003248  -28.377  < 2e-16 ***
## `\\`Renovation Year\\``         0.019717   0.002257    8.737  < 2e-16 ***
## `\\`Postal Code\\``            -0.032124   0.002261  -14.206  < 2e-16 ***
## Lattitude                       0.178901   0.002306   77.581  < 2e-16 ***
## Longitude                      -0.010193   0.002543   -4.008 6.16e-05 ***
## living_area_renov               0.068527   0.003592   19.079  < 2e-16 ***
## lot_area_renov                 -0.007575   0.002997   -2.527   0.0115 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.249 on 14058 degrees of freedom
## Multiple R-squared:  0.7793, Adjusted R-squared:  0.7791 
## F-statistic:  3103 on 16 and 14058 DF,  p-value: < 2.2e-16

linreg_rsquared_train <- round(summary(linreg_model)$r.squared,4)
cat(paste("Linear Regression Train Rsquared: ",linreg_rsquared_train))

## Linear Regression Train Rsquared:  0.7793

5.4 Train Regularized Regression Model

Similarly, the regularized regression model was trained using Rsquared as the metric for final model selection.

set.seed(6)
glmnet_model <- train(log_price ~ .
             -(`Area of the house(excluding basement)`
              +`Area of the basement`+`Number of schools nearby`+`Distance from the airport`),
               data = train_df[,-21],
               method = "glmnet",
               metric = "Rsquared",
               preProcess = c("center","scale"),
               trControl = ctrl
               )

## + Fold1: alpha=0.10, lambda=0.0753 
## - Fold1: alpha=0.10, lambda=0.0753 
## + Fold1: alpha=0.55, lambda=0.0753 
## - Fold1: alpha=0.55, lambda=0.0753 
## + Fold1: alpha=1.00, lambda=0.0753 
## - Fold1: alpha=1.00, lambda=0.0753 
## + Fold2: alpha=0.10, lambda=0.0753 
## - Fold2: alpha=0.10, lambda=0.0753 
## + Fold2: alpha=0.55, lambda=0.0753 
## - Fold2: alpha=0.55, lambda=0.0753 
## + Fold2: alpha=1.00, lambda=0.0753 
## - Fold2: alpha=1.00, lambda=0.0753 
## + Fold3: alpha=0.10, lambda=0.0753 
## - Fold3: alpha=0.10, lambda=0.0753 
## + Fold3: alpha=0.55, lambda=0.0753 
## - Fold3: alpha=0.55, lambda=0.0753 
## + Fold3: alpha=1.00, lambda=0.0753 
## - Fold3: alpha=1.00, lambda=0.0753 
## + Fold4: alpha=0.10, lambda=0.0753 
## - Fold4: alpha=0.10, lambda=0.0753 
## + Fold4: alpha=0.55, lambda=0.0753 
## - Fold4: alpha=0.55, lambda=0.0753 
## + Fold4: alpha=1.00, lambda=0.0753 
## - Fold4: alpha=1.00, lambda=0.0753 
## + Fold5: alpha=0.10, lambda=0.0753 
## - Fold5: alpha=0.10, lambda=0.0753 
## + Fold5: alpha=0.55, lambda=0.0753 
## - Fold5: alpha=0.55, lambda=0.0753 
## + Fold5: alpha=1.00, lambda=0.0753 
## - Fold5: alpha=1.00, lambda=0.0753 
## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 0.1, lambda = 0.000753 on full training set

glmnet_model

## glmnet 
## 
## 14075 samples
##    20 predictor
## 
## Pre-processing: centered (16), scaled (16) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 11262, 11259, 11258, 11261, 11260 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        RMSE       Rsquared   MAE      
##   0.10   0.0007530241  0.2492722  0.7788389  0.1905556
##   0.10   0.0075302413  0.2493201  0.7788018  0.1906322
##   0.10   0.0753024127  0.2539750  0.7758858  0.1948874
##   0.55   0.0007530241  0.2492717  0.7788279  0.1905758
##   0.55   0.0075302413  0.2497767  0.7782362  0.1912372
##   0.55   0.0753024127  0.2762759  0.7489048  0.2147028
##   1.00   0.0007530241  0.2492738  0.7788233  0.1905852
##   1.00   0.0075302413  0.2505547  0.7772489  0.1919947
##   1.00   0.0753024127  0.2977742  0.7228463  0.2320649
## 
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0.1 and lambda = 0.0007530241.

glmnet_rsquared_train <- round(c(head(glmnet_model$results$Rsquared,5),tail(glmnet_model$results$Rsquared,5))[1],4)
cat(paste("Regularized Regression Train Rsquared: ",glmnet_rsquared_train))

## Regularized Regression Train Rsquared:  0.7788

Step 6: Score and evaluate models

In this step, the two models were used to predict the log transformation of the house price. Rsquared was used as the metric for evaluation.

6.1 Predict log transformation of house price

Log transformation of house price of the test dataset was predicted using the piece of code below.

p_linreg <- predict(linreg_model,newdata = test_df)
p_glmnet <- predict(glmnet_model,newdata = test_df)

6.2 Calculate Rsquared

Rsquared values of the test dataset for both models were calculated.

#evaluate
error_linreg <- test_df$log_price - p_linreg
error_glmnet <- test_df$log_price - p_glmnet

ssr_linreg <- sum(error_linreg**2)
ssr_glmnet <- sum(error_glmnet**2)
sst <- sum((test_df$log_price - mean(test_df$log_price))**2)

linreg_rsquared_test <- round(1 - ssr_linreg/sst,4)
glmnet_rsquared_test<- round(1 - ssr_glmnet/sst,4)

cat(paste("Linear regression Test Rsquared: ", linreg_rsquared_test))

## Linear regression Test Rsquared:  0.764

cat(paste("Regularized regression Test Rsquared: ", glmnet_rsquared_test))

## Regularized regression Test Rsquared:  0.7642

result <- data.frame(
  c(linreg_rsquared_train,glmnet_rsquared_train),
  c(linreg_rsquared_test,glmnet_rsquared_test)
  )

colnames(result) <- c("Train Rsquared", "Test Rsquared")
row.names(result) <- c("Linear Regression", "Regularized Regression")

kable(result,caption = "Rsquared Comparison")

Rsquared Comparison
	Train Rsquared	Test Rsquared
Linear Regression	0.7793	0.7640
Regularized Regression	0.7788	0.7642

Step 7: Discussion

From the the previous section, both models proved their generalisability for having close train and test Rsquared results. However, although the performances of the two were equally effective for obtaining almost the same results, linear regression could be more efficient for this regression problem due to its lower complexity in the training process.

House Price India Regression Models with R

Massittha M.

2023-09-27