Smart Investment Decisions: Which house should I buy in London

Abstract

This report is about estimating housing prices in London with estimation engines to guide investment decisions. Transaction data from houses sold in London in 2019 as well as various information on these houses are used to create seven different estimation engines using linear regression, LASSO regression, knn, tree, random forest, gradient boosting, and stacking. The best model is used to select the most promising 200 houses out of 2,000 houses that are currently on sale. This will be done by calculating the deviation of the predicted price and asked price.

Introduction

London house prices have risen substantially above the general inflation since 1995. Given this positive development of house prices, investing in properties in London seems very lucrative. However, house prices have crashed twice since 1990 and the current uncertainty in the British economy caused by Brexit and the global pandemic COVID-19 can have impact on its house prices. Recent governmental initiatives to lower property prices, are adding an additional factor of uncertainty. This may lead to lower house prices making investments in properties highly interesting if seen as a mid to long-term investment. However, while some of the houses are overpriced, it is crucial to gather and evaluating more information on the properties when taking investment decisions. Purpose of this project is to build an estimation engine to guide investment decisions in the London house market. This estimation engine predicts a price based on detailed information on the property for sale such as location, size, and energy efficiency which I will compare to the asking price. I will use publicly available data on transactions in 2019 in London data from Land Registry’s Price Paid Data, Energy Performance Certificate (EPC) data, and public transport information data to determine the effects of the various variables. In this report, I explain how I got the data, which machine learning algorithms I used and how I tuned them. Finally, I will apply the estimation engine to recommend top 200 houses out of 2,000 houses on the market for sale at the moment.

Body

Data Used

For my project, I combine three datasets: I use publicly available transaction data occurred in London in 2019 from Land Registry’s Price Paid Data that tracks the property sales in England and Wales and includes details on property types. I merge this data with detailed information about each property from a publicly available data set with Energy Performance Certificate (EPC) data. From this, I retrieve data including size, number of bedrooms, and energy ratings. Finally, I add public transport information such as nearest station, walking distance to station, and the number of lines for each property. I clean the dataset, making sure that the correct data type is assigned to each variable and remove the variables with too many missing datapoints.

#read in the data
library(data.table)
london_house_prices_2019_training<-read.csv("data/training_data_assignment_with_prices.csv")
london_house_prices_2019_out_of_sample<-read.csv("data/test_data_assignment.csv")

#fix dates
london_house_prices_2019_training <- london_house_prices_2019_training %>% mutate(date=as.Date(date))
#change characters to factors
london_house_prices_2019_training <- london_house_prices_2019_training %>% mutate_if(is.character,as.factor)
london_house_prices_2019_out_of_sample<-london_house_prices_2019_out_of_sample %>% mutate_if(is.character,as.factor)
#remove address2 and town because of missingness
london_house_prices_2019_training <- london_house_prices_2019_training %>% select(-c(town, address2))
london_house_prices_2019_out_of_sample<-london_house_prices_2019_out_of_sample %>% select(-c(town, address2))

#make sure out of sample data and training data has the same levels for factors 
a<-union(levels(london_house_prices_2019_training$postcode_short),levels(london_house_prices_2019_out_of_sample$postcode_short))
london_house_prices_2019_out_of_sample$postcode_short <- factor(london_house_prices_2019_out_of_sample$postcode_short, levels = a)
london_house_prices_2019_training$postcode_short <- factor(london_house_prices_2019_training$postcode_short, levels = a)

Visualize data

Before visualizing the data I calculate the average price total and average price per Sqrmtr

london_house_prices_2019_training %>% summarise(average_price =mean(price), average_price_sqmtr =mean(price/total_floor_area))
##   average_price average_price_sqmtr
## 1        593791                6343

To get a good initial understanding on the structure of the data and the relationship between housing prices and the explanatory variables, I create a few visualizations:

First, I plot the distribution of the prices and detect that they are right skewed, with a medium price of 595,000 pounds.

library(patchwork)
p1 <- ggplot(london_house_prices_2019_training, aes(x=price, fill=(price < 2000000)))+
  geom_histogram()+
  labs(y="Number of Properties", x="House Price", title="Distribution of House Prices in London")+
  theme_classic()+ #add theme
  scale_x_continuous(labels=scales::dollar_format())+
  theme(legend.position = "none")+
  scale_fill_manual(values=c("dark grey", "#5691B0" ))+
  geom_vline(xintercept = 593790.9, color = 'red', linetype = 'dashed') +
  annotate(geom="text", x = 2000790.9,y = 8500, label='Average Price\n = ~595,000', color = 'red', size=3.5) +
  
  
  
  NULL
  
  p2 <- ggplot(london_house_prices_2019_training, aes(x=price, fill=(price < 2000000)))+
  geom_histogram()+
  labs(y="Number of Properties", x="House Price", title="Deep Dive:", subtitle= "Distribution of House Prices (< 2 Mio) in London")+
  theme_classic()+ #add theme
theme(legend.position = "none")+
scale_x_continuous(labels=scales::dollar_format(), limits = c(0,2000000))+
scale_fill_manual(values=c("dark grey", "#5691B0" ))+
  NULL

  p1+p2

Second, I plot the average price per property type as well as the frequency of each property type. While detached houses, the least frequent sold property type, are on average more expensive, Flats, the most frequent sold property type, are on average the cheapest properties to buy.

p1 <- london_house_prices_2019_training %>% 
  group_by(property_type) %>% 
  summarise(average_price=mean(price)) %>% 
  ggplot(aes(y=average_price, x=reorder(property_type, -average_price), fill=property_type))+ 
  geom_col()+
  labs(y="Average Price", x="Property Type", title="Average Price per Property Type")+
  scale_x_discrete(labels = c('Detached','Terraced',  'Semi- Detached',"Flats/Maisonettes"))+
  scale_fill_manual(values=c("#317395","#B5D7E9", "#76B3D3", "#5691B0" ))+
  theme_classic()+ #add theme
  theme(legend.position = "none")+
  scale_y_continuous(labels=scales::dollar_format())+
  NULL

p2<- london_house_prices_2019_training %>% 
   group_by(property_type) %>% 
  summarise(count=n()) %>% 
  ggplot(aes(x=reorder(property_type, -count), y=count, fill=property_type))+
    geom_col()+
  scale_x_discrete(labels = c("Flats/Maisonettes",'Terraced',   'Semi- Detached','Detached'))+
  scale_fill_manual(values=c( "#317395","#B5D7E9", "#76B3D3", "#5691B0"))+
   theme(legend.position = "none")+
  
 scale_y_continuous(label=comma)+
  labs(y="Number of Property Type", x="Property Type", title="Frequency of Property Types sold in 2019")+
  
  
  NULL

p1+p2

To understand the influence of a property’s size and zone I plot these variables and detect strong relationship between size and price, as well as london zones and prices.

london_house_prices_2019_training %>%
  mutate(london_zone2=as.factor(london_zone)) %>% 

  ggplot(aes(y=price, x=total_floor_area, colour=london_zone2))+ 
 #geom_smooth()+
  geom_point(alpha=0.35)+
  labs(x="Size of Flat (in Sqmtr)", y="Price", title="Positive Relationship between Property' Price and Size", colour="London Zone")+
  theme_classic()+ #add theme
  scale_y_continuous(labels=scales::dollar_format())+

  NULL

Then, I visualized the positive relationship between average income and price.

london_house_prices_2019_training %>% 
  ggplot(aes(y=price, x=average_income))+ 
  geom_point(alpha=0.35)+
  geom_smooth()+
  labs(x="Average Income", y="Price", title="Relationship between Average Income and Price")+
  theme_classic()+ #add theme
  scale_y_continuous(labels=scales::dollar_format(), limits=c(0, 3000000))+
  scale_x_continuous(label=comma)+
  NULL

I look at the price/floor area by district.

london_house_prices_2019_training %>% 
  mutate(price_floor =price/total_floor_area) %>%
  group_by(district) %>% 
  summarise(average_price_floor=mean(price_floor)) %>% 
  ggplot(aes(x=average_price_floor, y=reorder(district, average_price_floor)), by_row=TRUE)+ 
  geom_col()+
  labs(x="Average Price per squaremetre", y="", title="Average Price per District")+
  theme_classic()+ #add theme
  NULL

Finally, I check if there are strong correlations between the variables. Some variables such as total floor area, number of habitable rooms, and current CO2 emissions are correlating strongly. Whilst it is important to keep such correlations in mind, they do not constitute a problem when creating models for prediction.

# produce a correlation table using GGally::ggcor()

library("GGally")
london_house_prices_2019_training %>% 
  select(-ID) %>% #keep Y variable last
  ggcorr(method = c("pairwise", "pearson"), layout.exp = 2,label_round=2, label = TRUE,label_size = 2,hjust = 1,nbreaks = 5,size = 2,angle = -20)

## Tuning Model

Before creating models, I split the data into training and testing data. I use the training dataset to build my models and test them subsequently on the testing data. Having an outcome variable for the testing data, I can detect if my model overfits before using it for predictions on unlabeled data (houses with no transaction price).

#let's do the initial split
set.seed(1)
library(rsample)
train_test_split <- initial_split(london_house_prices_2019_training, prop = 0.75) #training set contains 75% of the data
train_data <- training(train_test_split)
test_data <- testing(train_test_split)

As first step, I set seed and stabilize a cross-fold validation that I use for all my models. Cross-fold validation is a technique to test the predictive power of a model on a dataset that was not used to create the model, when having a limited number of observations. The seed instead makes it possible to replicate the model with exact the same results.

#Define control variables
set.seed(1)#because I use cross-validation and want to be able to replicate the model
control <- trainControl (
    method="cv", #cross-fold validation
    number=10,
    verboseIter=TRUE) #by setting this to true the model will report its progress after each estimation

I tune all the models maximizing R² and minimizing RMSE. R² is the amount of variance in the data explained by the model. If my model has an R² of 80% for example, the model explains 80% of the different prices between the houses in my dataset. RMSE on the other side, stands for the root mean squared error and is the prediction error of the model.

Linear Regression

The first model I create, is a linear regression. A linear regression looks for the line of best fit between all the variables. I use a stepwise regression when selecting the variables, meaning that I include all of them and subsequently remove unsignificant ones. The only variables I exclude since the beginning are illogical variables such as latitude and longitude, variables with missing data, and variables that are missing in the dataset on which I will do the final predictions. Latitude and longitude are illogical for a linear regression because I assume prices to be higher in the centre if London, and a linear correlation between longitude/latitude and price would therefore be unlikely.

1 Linear Regression

#we are going to train the model and report the results using k-fold cross validation
model_lm_0<-train(
    price ~ 

   num_tube_lines 
   +num_rail_lines
   +num_light_rail_lines
   +distance_to_station 
   #+nearest_station #not using it because of new station
   +type_of_closest_station
   
    +whether_old_or_new
    +freehold_or_leasehold
  

   +london_zone
   #+postcode_short #too many variables
    #+local_aut #not in out of sample data
    +average_income
  # +nearest_station #problems in out of sample
   
    +total_floor_area
    +number_habitable_rooms 
    +property_type
   +tenure
    
    +current_energy_rating
   +energy_consumption_potential
   +energy_consumption_current
    +windows_energy_eff
    +co2_emissions_potential
   +co2_emissions_current
    +water_company
    
    ,
    train_data,
   method = "lm",
    trControl = control
   )

# summary of the results
model_lm_0$result

After excluding all insignificant variables, I create interaction variables between variables that are very important (e.g., total floor area, number of habitable rooms, and London zone), as well as non-linear terms (e.g., (total floor area) ² ) . Then, I replace the geographical categorical variable postcode short with London zones, because while many postcodes turn out to be insignificant, London zones seems to be a good indicator for geographical distribution of prices.

2 Linear Regression

#we are going to train the model and report the results using k-fold cross validation
model_lm<-train(
    price ~ 
   num_tube_lines 

    +district:property_type
    +london_zone*poly(total_floor_area,2)*number_habitable_rooms 
    +average_income
    
    +energy_consumption_potential
   +energy_consumption_current
   +current_energy_rating
    +windows_energy_eff
    +co2_emissions_potential
   +co2_emissions_current
    +water_company
    ,
    train_data,
   method = "lm",
    trControl = control
   )
## + Fold01: intercept=TRUE 
## - Fold01: intercept=TRUE 
## + Fold02: intercept=TRUE 
## - Fold02: intercept=TRUE 
## + Fold03: intercept=TRUE 
## - Fold03: intercept=TRUE 
## + Fold04: intercept=TRUE 
## - Fold04: intercept=TRUE 
## + Fold05: intercept=TRUE 
## - Fold05: intercept=TRUE 
## + Fold06: intercept=TRUE 
## - Fold06: intercept=TRUE 
## + Fold07: intercept=TRUE 
## - Fold07: intercept=TRUE 
## + Fold08: intercept=TRUE 
## - Fold08: intercept=TRUE 
## + Fold09: intercept=TRUE 
## - Fold09: intercept=TRUE 
## + Fold10: intercept=TRUE 
## - Fold10: intercept=TRUE 
## Aggregating results
## Fitting final model on full training set
# summary of the results
model_lm$result
##   intercept   RMSE Rsquared    MAE RMSESD RsquaredSD MAESD
## 1      TRUE 233128    0.804 117778  40439     0.0284  7436

Then, I plot the results of the final model as well as the importance of each variable:

model_lm$results
##   intercept   RMSE Rsquared    MAE RMSESD RsquaredSD MAESD
## 1      TRUE 233128    0.804 117778  40439     0.0284  7436
# we can check variable importance as well
importance <- varImp(model_lm, scale=TRUE)
plot(importance)

Prediction lm

Below I use the predict function to test the performance of the model in testing data and summarize the performance of the linear regression model.

# We can predict the testing values

predictions_lm <- predict(model_lm,test_data)

lm_results<-data.frame(RMSE = RMSE(predictions_lm, test_data$price), #how much did qe predict wrong
                            Rsquare = R2(predictions_lm, test_data$price)) #how much does the model cover
lm_results                         
##     RMSE Rsquare
## 1 208392   0.835

The performance of the model (in R²): - Training 0.8102 - Testing 0.8358638

LASSO

As second model, I performed a LASSO regression using the same variables as in the linear regression. LASSO regression is a type of linear regression that shirks the impact of the variables (regularization) and eliminates insignificant variables (parameter selection). To do so, I introduce artificially a bias (lambda) which adds a penalty to the coefficients for each variable. As consequence, all variables have a lower coefficient, and some go down to zero, resulting into a simpler model with less variance but a higher bias. I optimize the model calculating the error of the model (RMSE – root mean standard error) as well as the explanatory power of my model (R²) for different biases (lambda). Finally, I select the model with the lowest RMSE and highest R².

#split data into training & testing -> already done
#we need to optimize the lambda in this sequence
lambda_seq <- seq(0, 1000, length =100)

#we use cross fold validation
set.seed(1)
control <- trainControl(
  method="cv",
  number = 10,
  verboseIter = FALSE)

#LASSO regression to select the best lambda
set.seed(1)
lasso_fit <- train(price ~
    #  distance_to_station #not significant
    num_tube_lines #not significant
    
    +whether_old_or_new #not significant
    +freehold_or_leasehold #not significant

    +distance_to_station
    +district:property_type
    +london_zone*poly(total_floor_area,2)*number_habitable_rooms 
 
    +average_income
    
    +energy_consumption_potential
    +energy_consumption_current #new
    +current_energy_rating #new
    +windows_energy_eff
    +co2_emissions_potential
    +co2_emissions_current #new
    +water_company,
    
        data=train_data,
        method="glmnet",
        preProc = c("center", "scale"), #This option standardizes the data before running the LASSO regression if alpha = 0 ->RIDGE REG
  
   trControl = control,
  tuneGrid = expand.grid(alpha = 1, lambda = lambda_seq) #alpha=1 specifies to run a LASSO regression. If alpha=0 the model would run ridge regression.

)

coef(lasso_fit$finalModel, lasso_fit$bestTune$lambda)
## 166 x 1 sparse Matrix of class "dgCMatrix"
##                                                                       1
## (Intercept)                                                    594480.4
## num_tube_lines                                                  15627.7
## whether_old_or_newY                                               224.5
## freehold_or_leaseholdL                                         -16424.4
## distance_to_station                                             -2672.4
## london_zone                                                   -183472.0
## poly(total_floor_area, 2)1                                    1014162.8
## poly(total_floor_area, 2)2                                     321756.7
## number_habitable_rooms                                         -77333.0
## average_income                                                  60282.3
## energy_consumption_potential                                   -31968.4
## energy_consumption_current                                      -3455.7
## current_energy_ratingC                                           7340.9
## current_energy_ratingD                                          11157.3
## current_energy_ratingE                                           1872.5
## current_energy_ratingF                                          -6706.5
## current_energy_ratingG                                          -6338.3
## windows_energy_effGood                                           6860.1
## windows_energy_effPoor                                          12230.9
## windows_energy_effVery Good                                      5449.2
## windows_energy_effVery Poor                                     15294.0
## co2_emissions_potential                                         52468.6
## co2_emissions_current                                           19618.9
## water_companyEssex & Suffolk Water                               1951.1
## water_companyLeep Utilities                                       480.6
## water_companySES Water                                          16647.1
## water_companyThames Water                                       19487.2
## districtBarking and Dagenham:property_typeD                       741.2
## districtBarnet:property_typeD                                    8899.5
## districtBexley:property_typeD                                    6705.5
## districtBrent:property_typeD                                       17.9
## districtBromley:property_typeD                                   4916.5
## districtCamden:property_typeD                                   20854.7
## districtCity of London:property_typeD                               .  
## districtCroydon:property_typeD                                      .  
## districtEaling:property_typeD                                   -2317.4
## districtEnfield:property_typeD                                   1697.5
## districtGreenwich:property_typeD                                -2060.0
## districtHackney:property_typeD                                      .  
## districtHammersmith and Fulham:property_typeD                       .  
## districtHaringey:property_typeD                                     .  
## districtHarrow:property_typeD                                   15067.9
## districtHavering:property_typeD                                  8804.1
## districtHillingdon:property_typeD                               11477.0
## districtHounslow:property_typeD                                 -1948.6
## districtIslington:property_typeD                                 3586.6
## districtKensington and Chelsea:property_typeD                   26187.9
## districtKingston upon Thames:property_typeD                     17165.4
## districtLambeth:property_typeD                                      .  
## districtLewisham:property_typeD                                 -3538.1
## districtMerton:property_typeD                                    7032.9
## districtNewham:property_typeD                                    -830.4
## districtRedbridge:property_typeD                                 -505.6
## districtRichmond upon Thames:property_typeD                     21716.2
## districtSouthwark:property_typeD                                 1695.9
## districtSutton:property_typeD                                   -2205.3
## districtTower Hamlets:property_typeD                                .  
## districtWaltham Forest:property_typeD                           -2466.1
## districtWandsworth:property_typeD                                1418.8
## districtWestminster:property_typeD                                  .  
## districtBarking and Dagenham:property_typeF                     -2026.1
## districtBarnet:property_typeF                                   -3295.1
## districtBexley:property_typeF                                   -6900.2
## districtBrent:property_typeF                                     1751.8
## districtBromley:property_typeF                                 -10417.3
## districtCamden:property_typeF                                   17719.7
## districtCity of London:property_typeF                            6747.4
## districtCroydon:property_typeF                                 -11591.3
## districtEaling:property_typeF                                   -3341.8
## districtEnfield:property_typeF                                  -3003.8
## districtGreenwich:property_typeF                                -4605.3
## districtHackney:property_typeF                                   5717.0
## districtHammersmith and Fulham:property_typeF                    6835.5
## districtHaringey:property_typeF                                  2921.2
## districtHarrow:property_typeF                                   -3248.2
## districtHavering:property_typeF                                 -1685.6
## districtHillingdon:property_typeF                               -2903.0
## districtHounslow:property_typeF                                 -4087.1
## districtIslington:property_typeF                                 7575.7
## districtKensington and Chelsea:property_typeF                   58576.9
## districtKingston upon Thames:property_typeF                     -6698.3
## districtLambeth:property_typeF                                    190.1
## districtLewisham:property_typeF                                 -9213.2
## districtMerton:property_typeF                                   -5475.4
## districtNewham:property_typeF                                   -2019.2
## districtRedbridge:property_typeF                                -6521.9
## districtRichmond upon Thames:property_typeF                     -1321.1
## districtSouthwark:property_typeF                                 3226.7
## districtSutton:property_typeF                                  -13613.8
## districtTower Hamlets:property_typeF                            -5952.0
## districtWaltham Forest:property_typeF                            -377.0
## districtWandsworth:property_typeF                                 432.1
## districtWestminster:property_typeF                              51586.5
## districtBarking and Dagenham:property_typeS                     -2821.9
## districtBarnet:property_typeS                                    9386.9
## districtBexley:property_typeS                                   -9391.0
## districtBrent:property_typeS                                      941.9
## districtBromley:property_typeS                                  -4364.0
## districtCamden:property_typeS                                    4957.4
## districtCity of London:property_typeS                               .  
## districtCroydon:property_typeS                                 -14299.3
## districtEaling:property_typeS                                     545.1
## districtEnfield:property_typeS                                    229.0
## districtGreenwich:property_typeS                                -8143.4
## districtHackney:property_typeS                                   4730.4
## districtHammersmith and Fulham:property_typeS                    4410.2
## districtHaringey:property_typeS                                  -583.5
## districtHarrow:property_typeS                                    3694.2
## districtHavering:property_typeS                                  6868.8
## districtHillingdon:property_typeS                                5663.6
## districtHounslow:property_typeS                                   732.9
## districtIslington:property_typeS                                10416.9
## districtKensington and Chelsea:property_typeS                   21702.0
## districtKingston upon Thames:property_typeS                      6164.4
## districtLambeth:property_typeS                                  -5793.6
## districtLewisham:property_typeS                                 -9075.0
## districtMerton:property_typeS                                   -3421.5
## districtNewham:property_typeS                                   -1764.4
## districtRedbridge:property_typeS                                -8344.5
## districtRichmond upon Thames:property_typeS                     18367.4
## districtSouthwark:property_typeS                                 4246.4
## districtSutton:property_typeS                                   -7473.4
## districtTower Hamlets:property_typeS                                .  
## districtWaltham Forest:property_typeS                            2508.4
## districtWandsworth:property_typeS                                   .  
## districtWestminster:property_typeS                              36645.3
## districtBarking and Dagenham:property_typeT                     -3750.0
## districtBarnet:property_typeT                                    2038.4
## districtBexley:property_typeT                                   -9459.4
## districtBrent:property_typeT                                     5873.7
## districtBromley:property_typeT                                  -9290.5
## districtCamden:property_typeT                                   17439.6
## districtCity of London:property_typeT                               .  
## districtCroydon:property_typeT                                 -19230.9
## districtEaling:property_typeT                                     925.4
## districtEnfield:property_typeT                                   -909.1
## districtGreenwich:property_typeT                                -8340.4
## districtHackney:property_typeT                                   5746.9
## districtHammersmith and Fulham:property_typeT                   19177.7
## districtHaringey:property_typeT                                  3045.0
## districtHarrow:property_typeT                                     945.0
## districtHavering:property_typeT                                  3049.0
## districtHillingdon:property_typeT                                 -95.0
## districtHounslow:property_typeT                                  3878.1
## districtIslington:property_typeT                                11445.4
## districtKensington and Chelsea:property_typeT                  104020.5
## districtKingston upon Thames:property_typeT                      1782.9
## districtLambeth:property_typeT                                  -6490.8
## districtLewisham:property_typeT                                -13920.0
## districtMerton:property_typeT                                   -7423.7
## districtNewham:property_typeT                                  -13263.4
## districtRedbridge:property_typeT                                -9798.2
## districtRichmond upon Thames:property_typeT                     15582.9
## districtSouthwark:property_typeT                                    .  
## districtSutton:property_typeT                                  -11001.5
## districtTower Hamlets:property_typeT                             -551.7
## districtWaltham Forest:property_typeT                             398.7
## districtWandsworth:property_typeT                                 456.0
## districtWestminster:property_typeT                              35411.7
## london_zone:poly(total_floor_area, 2)1                        -791217.7
## london_zone:poly(total_floor_area, 2)2                        -272639.0
## london_zone:number_habitable_rooms                             103682.4
## poly(total_floor_area, 2)1:number_habitable_rooms             -401559.0
## poly(total_floor_area, 2)2:number_habitable_rooms              -68996.2
## london_zone:poly(total_floor_area, 2)1:number_habitable_rooms  399638.7
## london_zone:poly(total_floor_area, 2)2:number_habitable_rooms   57036.3

Predict lasso

#lasso_fit$results
plot(lasso_fit)

predictions_lasso <- predict(lasso_fit, test_data)
lasso_results <- data.frame( RMSE =RMSE(predictions_lasso, test_data$price),
                             Rsquare =R2(predictions_lasso, test_data$price))

lasso_results
##     RMSE Rsquare
## 1 207501   0.836

The performance of the model (in R²): - Training: 0.8026139 - Testing: 0.8360793

KNN

As third model I use the k-Nearest Neighbours (k-NN) model. It predicts a value based on a datapoint’s k nearest neighbours. This means, that the price of a property is predicted based on the k most similar properties in the training dataset. To select the variables, I use the knowledge gained from the linear regression and take the significant variables from the linear regression. Furthermore, I include latitude and longitude as this model does not assume a linear relationship between the dependent and the independent variable. Finally, I remove the interaction variables given this model’s ability in determining them by itself. Before running the model, I make sure to standardize the variables, as the distance between the points should not be influenced by different units of the variables. Then, I optimize my model using different numbers for the number of neighbours (k).

To do so, I first use 10 random values for k with tuneLength, and then optimize for values close to the best performing k-value with tuneGrid.

KNN: 10 random values for k

#knn
# selecting the best k with the highest R²
set.seed(1) #because I use cross-validation and want to be able to replicate the model

knn_fit_1 <- train(
  price ~ 
    
        #  distance_to_station #not significant
   num_tube_lines #not significant
    
    +latitude
    +longitude
  
    +district
    +property_type
    +london_zone
    +total_floor_area
    +number_habitable_rooms 
    +average_income
    
    +energy_consumption_potential 
    +windows_energy_eff
    +co2_emissions_potential
    +water_company
  ,
  train_data,
                   
                method = "knn", 
                 trControl = control, #use the same as I used in linear regression
                 tuneLength = 10, #number of parameter values train function will try
                 preProcess = c("center", "scale"), #center and scale the data in k-nn this is pretty important
                 metric="RMSE" #default metric is accuracy, I change it to R²

  )

print(knn_fit_1)
## k-Nearest Neighbors 
## 
## 10499 samples
##    13 predictor
## 
## Pre-processing: centered (52), scaled (52) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 9450, 9450, 9449, 9448, 9450, 9449, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE    Rsquared  MAE   
##    5  267358  0.746     125679
##    7  268003  0.750     124951
##    9  268917  0.754     125417
##   11  272270  0.753     126574
##   13  275337  0.750     127294
##   15  278772  0.746     128228
##   17  279448  0.749     128974
##   19  280789  0.749     129780
##   21  283118  0.748     130544
##   23  285960  0.745     131288
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
plot(knn_fit_1)

KNN: tuneGrid

The best k seems to be between 1 and 7, therefore I use tuneGrid to get the best k.

#knn2
Grid_knn <- expand.grid(k=seq(1, 7, 1))
# selecting the best k with the highest R²
set.seed(1) #because I use cross-validation and want to be able to replicate the model

knn_fit_2 <- train(  
  price ~ 
    
        #  distance_to_station #not significant
   num_tube_lines #not significant
    
    +latitude
    +longitude
  
    +district
    +property_type
    +london_zone
    +total_floor_area
    +number_habitable_rooms 
    +average_income
    
    +energy_consumption_potential 
    +windows_energy_eff
    +co2_emissions_potential
    +water_company
  ,
  train_data,

    method = "knn", 
     trControl = control, #use the same as I used in linear regression
     tuneGrid = Grid_knn, #looking for numbers around 
     preProcess = c("center", "scale"), #center and scale the data in k-nn this is pretty important
     metric="RMSE") #default metric is accuracy is binary, otherwise RMSE, I change it to R²

print(knn_fit_2)
## k-Nearest Neighbors 
## 
## 10499 samples
##    13 predictor
## 
## Pre-processing: centered (52), scaled (52) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 9450, 9450, 9449, 9448, 9450, 9449, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE    Rsquared  MAE   
##   1  313051  0.658     149998
##   2  281924  0.709     134824
##   3  275575  0.725     130291
##   4  271762  0.734     127669
##   5  267358  0.746     125679
##   6  265788  0.752     124721
##   7  268003  0.750     124951
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 6.
plot(knn_fit_2)

Prediction KNN

#predict the price of each house in the test data set
#recall that the output of "train" function (knn_fit) automatically keeps the best model 
knn_prediction <- predict(knn_fit_2, newdata = test_data)

knn_results<-data.frame(RMSE = RMSE(knn_prediction, test_data$price), Rsquare = R2(knn_prediction, test_data$price))

knn_results
##     RMSE Rsquare
## 1 260115   0.751

The number of neighbors that optimize RMSE are k=6.

The performance of the model (in R²): - training:0.7524809 - testing: 0.7509709

Regression Tree Model

The fourth model is a regression tree and splits the data base multiple times based on various variables with respective cut-off values. After each split, subsets are created that are again divided based on another variable. The splitting stops after a predefined number of splits or other parameters that can be pre-set. One of these parameters, is the complexity parameter (cp) that stabilizes that a split is executed, only if the cost this additional split is below its value. I optimized the model (low RMSE and high R²) for this parameter. This model detects relationships between variables as well as nonlinear variables. Consequently, I do not need to create interaction variables myself. I use all the significant variables from the linear regression but remove the interactions between them. On the other side, this model is not as good in detecting linear relationships and is a very unstable method due to overfitting (high variance). This means if the data changes even slightly they can fit very different models.

Tree 1

#no need to scale the data

set.seed(12) #because I use cross-validation and want to be able to replicate the model
model_tree_1 <- train(
  price ~ 
    
        #  distance_to_station #not significant
   num_tube_lines #not significant
    
    +latitude
    +longitude
    
    +district
    +property_type
    +london_zone
    +total_floor_area
    +number_habitable_rooms 
    +average_income
    
    +energy_consumption_potential 
    +windows_energy_eff
    +co2_emissions_potential
    +water_company
  ,
  train_data,
  method = "rpart", 
  metric= "RMSE",
  trControl = control, #I use the same as in lm  
  tuneLength= 30
  
    )

#You can view how the tree performs
model_tree_1$results
##         cp   RMSE Rsquared    MAE RMSESD RsquaredSD MAESD
## 1  0.00164 264365    0.742 141511  38949     0.0554  5740
## 2  0.00186 266556    0.738 142519  39248     0.0550  5941
## 3  0.00198 267970    0.735 143592  39405     0.0556  6322
## 4  0.00202 268254    0.735 143988  39548     0.0559  6702
## 5  0.00273 272978    0.726 147145  38766     0.0526  7032
## 6  0.00283 273553    0.725 147789  39076     0.0525  7533
## 7  0.00289 273818    0.725 148295  38975     0.0521  7398
## 8  0.00318 276656    0.719 150837  39662     0.0559  7903
## 9  0.00408 278710    0.714 152584  38184     0.0530  7414
## 10 0.00443 281791    0.708 153687  39279     0.0513  7013
## 11 0.00456 282010    0.708 153932  39086     0.0518  6708
## 12 0.00462 282636    0.707 154158  39313     0.0526  6738
## 13 0.00510 285748    0.701 155335  40228     0.0553  6965
## 14 0.00571 288520    0.695 156575  41560     0.0596  7778
## 15 0.00619 293865    0.682 157678  41021     0.0672  7851
## 16 0.00632 294350    0.680 158019  40738     0.0679  7639
## 17 0.00798 297009    0.675 162522  42001     0.0736  7525
## 18 0.00857 299525    0.670 164962  44730     0.0729  8582
## 19 0.00891 300604    0.667 165095  44285     0.0747  8356
## 20 0.00966 303419    0.659 168440  44942     0.0764  9998
## 21 0.01282 315338    0.632 174684  44254     0.0797 11046
## 22 0.01352 319583    0.621 177083  42237     0.0855 10596
## 23 0.01359 319583    0.621 177083  42237     0.0855 10596
## 24 0.01832 327989    0.601 183778  41518     0.0846  8308
## 25 0.02398 332593    0.589 185846  41072     0.0773  8996
## 26 0.03224 345498    0.563 188729  45582     0.0573  9130
## 27 0.04310 358185    0.527 196768  42264     0.0782  9802
## 28 0.07590 376073    0.474 215196  36290     0.0903 13055
## 29 0.16197 422396    0.345 235651  66448     0.0874 13000
## 30 0.31416 465807    0.276 254039  70395     0.0327 26705
#summary(model2_tree)

#You can view the final tree
rpart.plot(model_tree_1$finalModel)

#you can also visualize the variable importance
importance <- varImp(model_tree_1, scale=TRUE)
plot(importance)

Tree 2

The best cp seems to be between 0.000 and 0.0001, therefore I use tuneGrid to get the best cp.

#no need to scale the data

#run model
set.seed(1) #because I use cross-validation and want to be able to replicate the model
model_tree_2 <- train(
  price ~ 
    
        #  distance_to_station #not significant
   num_tube_lines #not significant
    
    +latitude
    +longitude
  
    +district
    +property_type
    +london_zone
    +total_floor_area
    +number_habitable_rooms 
    +average_income
    
    +energy_consumption_potential 
    +windows_energy_eff
    +co2_emissions_potential
    +water_company
  ,
  train_data,
 
 
  method = "rpart",
  metric="RMSE",
  trControl = control, #I use the same as in lm  
  tuneGrid= expand.grid(cp=seq(0.000, 0.0001, 0.00001))
    )

#You can view how the tree performs
model_tree_2$results
##       cp   RMSE Rsquared    MAE RMSESD RsquaredSD MAESD
## 1  0e+00 243218    0.786 117368  38730     0.0502  6176
## 2  1e-05 242872    0.787 116591  38809     0.0500  6334
## 3  2e-05 242700    0.787 116504  38766     0.0497  6220
## 4  3e-05 242835    0.787 116555  38590     0.0498  6031
## 5  4e-05 243097    0.786 117376  38655     0.0500  6023
## 6  5e-05 243311    0.785 117656  38492     0.0499  5799
## 7  6e-05 243437    0.785 117937  38576     0.0500  6003
## 8  7e-05 243507    0.785 118145  38853     0.0507  6221
## 9  8e-05 243700    0.784 118314  38962     0.0517  6125
## 10 9e-05 243853    0.784 118551  38927     0.0523  6049
## 11 1e-04 243958    0.784 118951  38977     0.0526  6112
#summary(model_tree_2)
plot(model_tree_2)

#You can view the final tree
rpart.plot(model_tree_2$finalModel)

#you can also visualize the variable importance
importance <- varImp(model_tree_2, scale=TRUE)
plot(importance)

RSquared is 0.7869522 for cp = 0.00002

Prediction

# We can predict the testing values
predictions_tree <- predict(model_tree_2,test_data)

tree_results<-data.frame(RMSE = RMSE(predictions_tree, test_data$price), #how much did qe predict wrong
                            Rsquare = R2(predictions_tree, test_data$price)) #how much does the model cover
tree_results                         
##     RMSE Rsquare
## 1 229721     0.8

The performance of the model (in R²): - Training R² 0.7869522 - Testing R² is 0.8000181.

We need to be careful about making conclusions based on model trees because there is a high variance. This means if the data changes even slightly they can fit very different models. (you could test this using different seeds)

Random Forest

Random Forest is an ensembled learning method that creates multiple trees and takes the average of the individual trees’ predictions as prediction. It corrects the tendency of regression trees to overfitt to their training dataset. To optimize RMSE and R² for the random trees, I tuned the number of variables to possibly split at in each node and as split rule I selected “variance” as opposed to “extratrees”. To save computational power I used 5 as a minimum node size (which is the default option for prediction).

RF 1

#random forest is an assembled method 

set.seed(1)
rf_fit <- train(
  price ~ 
    
        #  distance_to_station #not significant
   num_tube_lines #not significant
    
    +latitude
    +longitude
    
    +district
    +property_type
    +london_zone
    +total_floor_area
    +number_habitable_rooms 
    +average_income
    
    +energy_consumption_potential 
    +windows_energy_eff
    +co2_emissions_potential
    +water_company
  ,
  train_data,
 
  
  method = "ranger",
  metric="RMSE", 
  trControl = control,
  tuneLength= 10,
  importance = 'permutation')

print(rf_fit)
## Random Forest 
## 
## 10499 samples
##    13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 9450, 9450, 9449, 9448, 9450, 9449, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   RMSE    Rsquared  MAE   
##    2    variance    326210  0.768     155557
##    2    extratrees  367703  0.710     180575
##    7    variance    219461  0.840      98794
##    7    extratrees  239633  0.815     106881
##   13    variance    206101  0.850      95272
##   13    extratrees  216523  0.838      97900
##   18    variance    203677  0.852      95053
##   18    extratrees  210693  0.844      96271
##   24    variance    202533  0.852      95173
##   24    extratrees  208625  0.845      95546
##   29    variance    202227  0.851      95203
##   29    extratrees  207037  0.847      95353
##   35    variance    201855  0.851      95260
##   35    extratrees  205923  0.848      95334
##   40    variance    202009  0.851      95441
##   40    extratrees  207782  0.844      95555
##   46    variance    202827  0.849      95701
##   46    extratrees  205341  0.848      95319
##   52    variance    202402  0.850      95710
##   52    extratrees  205920  0.846      95628
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 35, splitrule = variance
##  and min.node.size = 5.
plot(rf_fit)

After selecting

RF 2

The best .mtry seems to be between 25 and 29, therefore I use tune grid to find the best value.

#random forest is an assembled method 
gridRF <- data.frame(.mtry = c(25:29), .splitrule="variance", .min.node.size = 5)


set.seed(1)
rf_fit_2 <- train(price~
  
   distance_to_station
   +latitude
   +longitude
    
   #+num_tube_lines #not significant
    
   # +whether_old_or_new #not significant
   +freehold_or_leasehold
  
   +district
   +property_type
   +london_zone
   +total_floor_area
   +number_habitable_rooms 
   
   +energy_consumption_potential 
   +windows_energy_eff
   +co2_emissions_potential
   +water_company,
  
    train_data, 
                                  
    method = "ranger",
    metric="RMSE", #?? rmse or R²
    trControl = control, #same as lm
    tuneGrid = gridRF, # The tuneGrid parameter lets us decide which values the main parameter will take While tuneLength only limit the number of default parameters to use.
    importance = 'permutation',
    verbose = FALSE)

print(rf_fit_2)
## Random Forest 
## 
## 10499 samples
##    13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 9450, 9450, 9449, 9448, 9450, 9449, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE    Rsquared  MAE  
##   25    208283  0.843     98899
##   26    208096  0.843     98988
##   27    207808  0.843     98923
##   28    208763  0.841     98830
##   29    208463  0.842     98984
## 
## Tuning parameter 'splitrule' was held constant at a value of variance
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 27, splitrule = variance
##  and min.node.size = 5.
plot(rf_fit_2)

The best .mtry is 27.

Prediction RF

rf_prediction <- predict(rf_fit_2, newdata = test_data)

rf_results<-data.frame(RMSE = RMSE(rf_prediction, test_data$price), Rsquare = R2(rf_prediction, test_data$price))

rf_results
##     RMSE Rsquare
## 1 194590   0.855

Gradient Boosting Machine

Also gradient boosting is an ensembled learning method based on trees. Gradient boosting method combines the current model with the next best possible model as long as the combined model presents a lower overall error (RMSE) than the individual model. To optimize RMSE and R², I tune the maximum nodes per tree and the number of trees. An increasing number of trees reduces the error but could lead to over-fitting and needs much computational power. In addition, the learning rate (shrinkage) and the minimum number of observations in tree’s terminal nodes could potentially be tuned. I used a slow learn rate of 0.05 as recommended when growing trees and due to the size of my training data, 10 as minimum number of observations.

modelLookup("gbm")

#Usual trainControl - take the same

#Expand the search grid (see above for definitions)
grid<-expand.grid(interaction.depth = seq(4, 8, by = 2), #2. interaction.depth (Maximum nodes per tree) - number of splits it has to perform on a tree (starting from a single node).
                  n.trees = seq(500, 1500, by = 500), #Number of trees (the number of gradient boosting iteration) i.e. N. Increasing N reduces the error on training set, but setting it too high may lead to over-fitting.
                  shrinkage =0.05, #It is considered as a learning rate.Shrinkage is commonly used in ridge regression where it reduces regression coefficients to zero and, thus, reduces the impact of potentially unstable regression coefficients. , Use a small shrinkage (slow learn rate) when growing many trees. 
                  n.minobsinnode = 10)#the minimum number of observations in trees' terminal nodes. Set n.minobsinnode = 10. When working with small training samples it may be vital to lower this setting to five or even three.

set.seed(1)
#Train for gbm
gbmFit1 <- train(price~
    
    distance_to_station 
    +latitude
    +longitude
    
    +freehold_or_leasehold
  
    +district
    +property_type
    +london_zone
    +total_floor_area
    +number_habitable_rooms 
    
    +energy_consumption_potential 
    +windows_energy_eff
    +co2_emissions_potential
    +water_company,
  
     train_data,
                 
                 method = "gbm", 
                 trControl = control,#same as for lm
                 tuneGrid =grid,
                   metric = "RMSE",
                 verbose = FALSE
                 )


print(gbmFit1)
modelLookup("gbm")
##   model         parameter                   label forReg forClass probModel
## 1   gbm           n.trees   # Boosting Iterations   TRUE     TRUE      TRUE
## 2   gbm interaction.depth          Max Tree Depth   TRUE     TRUE      TRUE
## 3   gbm         shrinkage               Shrinkage   TRUE     TRUE      TRUE
## 4   gbm    n.minobsinnode Min. Terminal Node Size   TRUE     TRUE      TRUE
#Usual trainControl - take the same

#Expand the search grid (see above for definitions)
grid<-expand.grid(interaction.depth = 8,
                  n.trees = 1500,
                  shrinkage =0.05, 
                  n.minobsinnode = 10)#the minimum number of observations in trees' terminal nodes. Set n.minobsinnode = 10. When working with small training samples it may be vital to lower this setting to five or even three.

set.seed(1)
#Train for gbm
gbmFit1 <- train(price~
    
    distance_to_station 
    +latitude
    +longitude
    

    +freehold_or_leasehold
  
    +district
    +property_type
    +london_zone
    +total_floor_area
    +number_habitable_rooms 
    
    +energy_consumption_potential 
    +windows_energy_eff
    +co2_emissions_potential
    +water_company,
  
     train_data,
                 
                 method = "gbm", 
                 trControl = control,#same as for lm
                 tuneGrid =grid,
                   metric = "RMSE",
                 verbose = FALSE
                 )


print(gbmFit1)
## Stochastic Gradient Boosting 
## 
## 10499 samples
##    13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 9450, 9450, 9449, 9448, 9450, 9449, ... 
## Resampling results:
## 
##   RMSE    Rsquared  MAE  
##   210923  0.84      99060
## 
## Tuning parameter 'n.trees' was held constant at a value of 1500
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.05
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10

RMSE was used to select the optimal model using the smallest value. The final values used for the model were n.trees = 1500, interaction.depth = 8, shrinkage = 0.05 and n.minobsinnode = 10.

gbm_prediction <- predict(gbmFit1, newdata = test_data)

gbm_results<-data.frame(RMSE = RMSE(gbm_prediction, test_data$price), Rsquare = R2(gbm_prediction, test_data$price))

gbm_results

Perfomance of the model (in R²): - Training 0.8403008 - Testing 0.8472716 201039.8

Stacking

Finally, I combine all the models that I trained and make a final prediction based on the predictions of the individual models. This method is an ensembled learning method called stacking and usually outperforms all the individual models.

#number of folds in cross validation
CVfolds <- 5

#Define folds
set.seed(1)
  #create five folds with no repeats
indexPreds <- createMultiFolds(train_data$price, CVfolds,times = 1) 
#Define traincontrol using folds
ctrl <- trainControl(method = "cv",  number = CVfolds, returnResamp = "final", savePredictions = "final", index = indexPreds,sampling = NULL)

#LINEAR REGRESSION
model_lm<-train(
    price ~ 
   num_tube_lines 
   +distance_to_station
   
    +district:property_type
    +london_zone*poly(total_floor_area,2)*number_habitable_rooms 
   
   +average_income
    
    +energy_consumption_potential 
    +windows_energy_eff
    +co2_emissions_potential
    +water_company
    
    ,
    train_data,
   method = "lm",
    trControl = ctrl
   )

# LASSO
lasso_fit <- train(price ~

   num_tube_lines 
   +distance_to_station
   
    +district:property_type
    +london_zone*poly(total_floor_area,2)*number_habitable_rooms 

   +average_income
    
    +energy_consumption_potential 
    +windows_energy_eff
    +co2_emissions_potential
    +water_company,
    
        data=train_data,
        method="glmnet",
        preProc = c("center", "scale"), #This option standardizes the data before running the LASSO regression if alpha = 0 ->RIDGE REG
  
   trControl = ctrl,
  tuneGrid = expand.grid(alpha = 1, lambda = 40.40404) #insert the optimized
)

# TREE
model_tree_2 <- train(
  price ~ 
    
   num_tube_lines 
    
    +latitude
    +longitude
    
    +district
    +property_type
    +london_zone
    +total_floor_area
    +number_habitable_rooms 
    +average_income
    
    +energy_consumption_potential 
    +windows_energy_eff
    +co2_emissions_potential
    +water_company
  ,
  train_data,
 
  method = "rpart",
  metric="RMSE",
  trControl = ctrl,  
  tuneGrid= expand.grid(cp=0.00002)
    )

#KNN

knn_fit_2 <- train(  
  price ~ 
    
   num_tube_lines 
    
    +latitude
    +longitude

    +district
    +property_type
    +london_zone
    +total_floor_area
    +number_habitable_rooms 
    +average_income
    
    +energy_consumption_potential 
    +windows_energy_eff
    +co2_emissions_potential
    +water_company
  ,
  train_data,

    method = "knn", 
     trControl = ctrl,
     tuneGrid = expand.grid(k=6), #looking for numbers around 
     preProcess = c("center", "scale"), #center and scale the data in k-nn this is pretty important
     metric="RMSE") #default metric is accuracy is binary, otherwise RMSE, I change it to R²


# Random Forest
rf_fit_2 <- train(price~
  
   distance_to_station
   +latitude
   +longitude
    
 
   +freehold_or_leasehold
  
   +district
   +property_type
   +london_zone
   +total_floor_area
   +number_habitable_rooms 
   
   +energy_consumption_potential 
   +windows_energy_eff
   +co2_emissions_potential
   +water_company,
  
    train_data, 
                                  
    method = "ranger",
    metric="RMSE", 
    trControl = ctrl, 
    tuneGrid = data.frame(.mtry = 27, .splitrule="variance", .min.node.size = 5),
    importance = 'permutation')


# Gradient
grid<-expand.grid(interaction.depth = 8, #seq(6, 10, by = 2), #2. interaction.depth (Maximum nodes per tree) - number of splits it has to perform on a tree (starting from a single node).
                  n.trees = 1500,##Number of trees (the number of gradient boosting iteration) i.e. N. Increasing N reduces the error on training set, but setting it too high may lead to over-fitting.
                  shrinkage =0.05, #It is considered as a learning rate, use a small shrinkage (slow learn rate) when growing many trees. 
                  n.minobsinnode = 10)#the minimum number of observations in trees' terminal nodes. 

gbmFit1 <- train(price~
    
    distance_to_station 
    +latitude
    +longitude
    
    +freehold_or_leasehold
  
    +district
    +property_type
    +london_zone
    +total_floor_area
    +number_habitable_rooms 
    
    +energy_consumption_potential 
    +windows_energy_eff
    +co2_emissions_potential
    +water_company,
  
     train_data,
                 
                 method = "gbm", 
                 trControl = ctrl,
                 tuneGrid =grid,
                   metric = "RMSE"
                 )
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1 247437688705.4287             nan     0.0500 16801252301.2286
##      2 232000278268.7830             nan     0.0500 15690971801.8620
##      3 218828715591.3823             nan     0.0500 14106227202.8752
##      4 205149091076.5482             nan     0.0500 12788910307.6939
##      5 193468845035.5358             nan     0.0500 9757749739.6899
##      6 183062089130.4374             nan     0.0500 10848202067.8948
##      7 172291539975.4364             nan     0.0500 9470000832.9087
##      8 163214068742.9066             nan     0.0500 7615975344.3065
##      9 154737960754.7700             nan     0.0500 7875555203.9193
##     10 147833823108.9143             nan     0.0500 7109522111.4091
##     20 95624937375.2434             nan     0.0500 2488688721.7182
##     40 58092570913.3721             nan     0.0500 950912601.8645
##     60 45720667173.3667             nan     0.0500 216688799.9767
##     80 40422896772.1081             nan     0.0500 137066553.2981
##    100 37088400609.1000             nan     0.0500 -80460469.0406
##    120 34632966244.9075             nan     0.0500 -120556986.8314
##    140 32705038931.6038             nan     0.0500 -82286260.6530
##    160 31108822663.6984             nan     0.0500 -50359260.6044
##    180 29804648110.2538             nan     0.0500 -45773553.0638
##    200 28475772744.5148             nan     0.0500 -61852314.9401
##    220 27377462100.7449             nan     0.0500 -47809341.3956
##    240 26490908447.8670             nan     0.0500 -128925155.8264
##    260 25556126149.4352             nan     0.0500 -19549531.7794
##    280 24635458756.1371             nan     0.0500 -78076118.9399
##    300 23948105351.6986             nan     0.0500 -38316972.5993
##    320 23272549122.9257             nan     0.0500 -33706969.5887
##    340 22490097524.4572             nan     0.0500 -59175625.4364
##    360 21913320289.2327             nan     0.0500 -81212438.8221
##    380 21274755158.0386             nan     0.0500 -36026982.3494
##    400 20599748547.6230             nan     0.0500 -14355642.3414
##    420 20068631707.7723             nan     0.0500 -12469755.2147
##    440 19542858228.6832             nan     0.0500 -11172592.0571
##    460 18957357908.1856             nan     0.0500 -30209488.4526
##    480 18520113900.8436             nan     0.0500 -44097434.8410
##    500 18106351966.9998             nan     0.0500 -2689730.3581
##    520 17729385466.6864             nan     0.0500 -23576089.0579
##    540 17364459474.5803             nan     0.0500 -23042515.0439
##    560 16993376893.4093             nan     0.0500 -26829867.2448
##    580 16637945433.6468             nan     0.0500 -5031381.5125
##    600 16241509125.1088             nan     0.0500 -28517106.0280
##    620 15949782148.7863             nan     0.0500 -19627916.7605
##    640 15644076619.6955             nan     0.0500 -19562244.3131
##    660 15367835705.5778             nan     0.0500 -32529324.8970
##    680 15121015306.8522             nan     0.0500 -6995581.9433
##    700 14829581815.4196             nan     0.0500 12299534.4644
##    720 14584374758.2405             nan     0.0500 -18207259.6416
##    740 14361383025.5789             nan     0.0500 -10153063.1291
##    760 14113383270.6949             nan     0.0500 -2342932.3574
##    780 13859412434.8414             nan     0.0500 -21503185.5878
##    800 13608157151.5257             nan     0.0500 -17122922.7747
##    820 13405498629.3242             nan     0.0500 -11937472.9078
##    840 13194032278.9143             nan     0.0500 -29046910.0443
##    860 12991025036.5170             nan     0.0500 -12257807.3376
##    880 12772064407.5574             nan     0.0500 -4436759.1798
##    900 12605245762.4946             nan     0.0500 -2316528.2865
##    920 12436950735.2306             nan     0.0500 -5876477.6883
##    940 12248571760.3645             nan     0.0500 -4719345.9821
##    960 12071145325.9307             nan     0.0500 -8056584.7381
##    980 11872934646.7866             nan     0.0500 -12635342.6009
##   1000 11691570997.7380             nan     0.0500 -12517932.8066
##   1020 11533281478.7139             nan     0.0500 -11155896.7995
##   1040 11358480874.8904             nan     0.0500 -7089073.4319
##   1060 11233333550.3298             nan     0.0500 -9573067.8657
##   1080 11087773656.8590             nan     0.0500 -11676314.5897
##   1100 10939282478.3953             nan     0.0500 -10325639.8591
##   1120 10806951942.8404             nan     0.0500 -10566758.2213
##   1140 10673602322.8670             nan     0.0500 -8394480.4835
##   1160 10532386780.4276             nan     0.0500 -12659095.1066
##   1180 10388414557.8044             nan     0.0500 5537969.6325
##   1200 10267046954.5759             nan     0.0500 -5298494.5268
##   1220 10138555309.0737             nan     0.0500 -1417217.0564
##   1240 10007393587.1272             nan     0.0500 -7221218.4085
##   1260 9884409148.0953             nan     0.0500 -3893119.3180
##   1280 9766316088.5224             nan     0.0500 -6982833.8691
##   1300 9641018051.4913             nan     0.0500 -6058292.3689
##   1320 9525771025.4948             nan     0.0500 -13425171.2655
##   1340 9428161845.2522             nan     0.0500 -12974105.7873
##   1360 9310307904.4361             nan     0.0500 -4800012.5692
##   1380 9197895335.2128             nan     0.0500 -5212347.4768
##   1400 9095293289.4673             nan     0.0500 -11650171.7175
##   1420 8985995031.2347             nan     0.0500 2916449.3046
##   1440 8885559645.2912             nan     0.0500 -7129216.5155
##   1460 8790457619.8345             nan     0.0500 -9212977.3824
##   1480 8696663815.7667             nan     0.0500 -6896679.9748
##   1500 8597522170.1863             nan     0.0500 -6298537.6785
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1 251203392392.6218             nan     0.0500 19344882605.9957
##      2 234529263549.3858             nan     0.0500 17483443616.1385
##      3 220173872432.6830             nan     0.0500 14467723251.8060
##      4 206328159598.5661             nan     0.0500 11265029035.4564
##      5 194161485143.1612             nan     0.0500 10743208848.0867
##      6 183708627434.2637             nan     0.0500 10668275198.3186
##      7 173237403545.4839             nan     0.0500 10336632424.6296
##      8 164298076597.5103             nan     0.0500 7780511081.7556
##      9 155792614724.3971             nan     0.0500 7173932923.2228
##     10 148457780399.1842             nan     0.0500 6910057664.0994
##     20 97763732035.2382             nan     0.0500 2875515430.3423
##     40 58900773855.1187             nan     0.0500 918231443.3753
##     60 46183142670.6044             nan     0.0500 253414611.3341
##     80 40201278078.0805             nan     0.0500 69071359.8496
##    100 36699451908.0091             nan     0.0500 -4516673.3067
##    120 34302018774.7417             nan     0.0500 -9793333.4514
##    140 32239454530.4458             nan     0.0500 -35196531.9144
##    160 30903961176.0136             nan     0.0500 -107953654.9803
##    180 29609868801.4378             nan     0.0500 11564621.6200
##    200 28272624249.7456             nan     0.0500 42234699.3563
##    220 27215128104.3425             nan     0.0500 -38628780.3761
##    240 26227552075.7280             nan     0.0500 -36024153.9477
##    260 25382314915.9630             nan     0.0500 5581784.5762
##    280 24558449616.3890             nan     0.0500 -17399545.5599
##    300 23800143557.5414             nan     0.0500 -28672447.0224
##    320 23228259880.5985             nan     0.0500 -17536645.3697
##    340 22611033158.6527             nan     0.0500 -39774918.4282
##    360 22140754789.0262             nan     0.0500 -29909985.4045
##    380 21539089965.2597             nan     0.0500 -56837071.4607
##    400 21006010445.1879             nan     0.0500 -19866737.5342
##    420 20570501692.2485             nan     0.0500 -29590002.3431
##    440 20127696884.5072             nan     0.0500 -45831246.8909
##    460 19655207296.3137             nan     0.0500 -20441181.9312
##    480 19241343967.6648             nan     0.0500 -26365636.8724
##    500 18832311092.0197             nan     0.0500 -45451824.1377
##    520 18316443033.3673             nan     0.0500 -60991947.8298
##    540 17919126655.5172             nan     0.0500 -14575932.5523
##    560 17578663913.5246             nan     0.0500 -22899199.9812
##    580 17232961523.5101             nan     0.0500 -30097122.0064
##    600 16889764649.2289             nan     0.0500 -12629884.3628
##    620 16578600806.3272             nan     0.0500 -13320013.4055
##    640 16304986461.5123             nan     0.0500 -6501267.9757
##    660 16005890114.1592             nan     0.0500 -18230020.6700
##    680 15686469037.5680             nan     0.0500 -46999612.9232
##    700 15403856111.9631             nan     0.0500 -27166298.6348
##    720 15127514481.1521             nan     0.0500 -24820471.2971
##    740 14852155115.9459             nan     0.0500 -16084368.7082
##    760 14568011376.2146             nan     0.0500 -19219897.0704
##    780 14294724529.1704             nan     0.0500 9065125.2962
##    800 14023171519.7364             nan     0.0500 -26434488.8656
##    820 13819591704.6066             nan     0.0500 -18162991.4190
##    840 13591781834.2017             nan     0.0500 -20190265.1968
##    860 13382386123.3599             nan     0.0500 -4419774.3000
##    880 13175049039.0262             nan     0.0500 -18230969.3747
##    900 12981550883.2149             nan     0.0500 -16212708.1816
##    920 12774177215.8483             nan     0.0500 -12548070.3101
##    940 12584930645.3376             nan     0.0500 -8206728.6178
##    960 12406856216.8821             nan     0.0500 -17873884.2780
##    980 12208764478.9520             nan     0.0500 4002501.3718
##   1000 12041402422.3606             nan     0.0500 -15007376.5228
##   1020 11888619859.1288             nan     0.0500 -10304794.2639
##   1040 11729419641.1294             nan     0.0500 -7997410.8868
##   1060 11552969034.2644             nan     0.0500 -1060328.2748
##   1080 11398003761.4592             nan     0.0500 -16261175.9755
##   1100 11243669602.3688             nan     0.0500 -654272.8387
##   1120 11107895547.1591             nan     0.0500 -14972216.5820
##   1140 10992705728.9962             nan     0.0500 -3329225.7788
##   1160 10850873737.8402             nan     0.0500 -17387176.9088
##   1180 10719762185.4472             nan     0.0500 -18864670.8849
##   1200 10581010520.8986             nan     0.0500 -3139779.3311
##   1220 10436798609.9160             nan     0.0500 -18672376.0004
##   1240 10322270295.5927             nan     0.0500 -14248382.3870
##   1260 10188391548.8068             nan     0.0500 4382505.7696
##   1280 10052155777.3462             nan     0.0500 -1733655.9220
##   1300 9938786238.2663             nan     0.0500 3050129.0254
##   1320 9822575226.6972             nan     0.0500 -11962581.4963
##   1340 9711724050.8268             nan     0.0500 -11770566.6723
##   1360 9587038187.6770             nan     0.0500 -822333.6319
##   1380 9488740978.3182             nan     0.0500 -6620612.3834
##   1400 9383516481.3260             nan     0.0500 -9789046.0878
##   1420 9267603771.8339             nan     0.0500 -340856.8886
##   1440 9166119635.1323             nan     0.0500 -11159009.6714
##   1460 9067276906.2260             nan     0.0500 -79573.5737
##   1480 8969042475.5866             nan     0.0500 -4810129.8591
##   1500 8876747816.7409             nan     0.0500 1258214.5088
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1 257832453782.5124             nan     0.0500 19602069228.8723
##      2 241082847214.0711             nan     0.0500 18014915801.4955
##      3 226147129184.6718             nan     0.0500 15610551148.6476
##      4 212483943930.3416             nan     0.0500 14120372400.8190
##      5 200290053366.6906             nan     0.0500 12807336022.7842
##      6 188683691101.6127             nan     0.0500 9697859385.7568
##      7 177055708949.3166             nan     0.0500 11286524552.0228
##      8 167361969167.4400             nan     0.0500 9084676439.0758
##      9 158245404946.2050             nan     0.0500 8411053317.3486
##     10 149852517903.2663             nan     0.0500 7286703276.9721
##     20 95932971388.8225             nan     0.0500 3872198582.6725
##     40 57469358375.5171             nan     0.0500 859251216.6947
##     60 44132083337.5622             nan     0.0500 -45260808.1626
##     80 38734596859.0976             nan     0.0500 42685945.9327
##    100 35783232325.8149             nan     0.0500 -59407880.7167
##    120 33534472681.1316             nan     0.0500 -90182502.5801
##    140 31670865299.9807             nan     0.0500 30693393.0432
##    160 30399762830.5174             nan     0.0500 -75498694.2365
##    180 29053507112.1285             nan     0.0500 -32876814.1289
##    200 27660526988.9776             nan     0.0500 -59604574.9750
##    220 26661881812.8947             nan     0.0500 -44880190.4880
##    240 25805513127.7589             nan     0.0500 -34775555.9173
##    260 24951058150.6592             nan     0.0500 -38280813.8107
##    280 24265190184.7988             nan     0.0500 -89865069.2265
##    300 23514899191.8895             nan     0.0500 -55939094.2960
##    320 22846529489.5833             nan     0.0500 -21678901.5008
##    340 22241722931.6147             nan     0.0500 -72639543.4436
##    360 21669254450.4368             nan     0.0500 -50194359.9988
##    380 21071626810.0067             nan     0.0500 -68234595.7483
##    400 20669073172.2180             nan     0.0500 742271.2956
##    420 20198969016.0675             nan     0.0500 -30482270.8289
##    440 19701711111.2000             nan     0.0500 -13790515.0276
##    460 19277684929.6747             nan     0.0500 -17597674.2445
##    480 18851986432.5235             nan     0.0500 -30658769.2931
##    500 18419872309.0325             nan     0.0500 -12709707.5661
##    520 18034503236.8089             nan     0.0500 -2652920.0498
##    540 17669477555.0583             nan     0.0500 -18499678.1658
##    560 17323280148.1029             nan     0.0500 -6389876.5418
##    580 17002830888.6033             nan     0.0500 -11224128.3790
##    600 16687777520.3583             nan     0.0500 -27055275.9887
##    620 16385256636.7353             nan     0.0500 -4810463.7090
##    640 16071995743.3280             nan     0.0500 -17698091.0570
##    660 15807069050.6145             nan     0.0500 -14777260.5033
##    680 15507406014.9158             nan     0.0500 -31272771.7744
##    700 15218299674.2426             nan     0.0500 -20817229.4336
##    720 14972368172.4079             nan     0.0500 -36049140.5975
##    740 14734570617.9544             nan     0.0500 -20105028.3027
##    760 14460319326.3111             nan     0.0500 -6387907.1781
##    780 14211407588.8203             nan     0.0500 3711776.9274
##    800 13961047159.7350             nan     0.0500 -24287686.6757
##    820 13744819908.1639             nan     0.0500 3170680.0159
##    840 13477364031.8947             nan     0.0500 -28227326.6305
##    860 13263240331.5939             nan     0.0500 -15065353.7046
##    880 13082541088.3425             nan     0.0500 -16028019.4990
##    900 12902809225.5303             nan     0.0500 -23779974.8605
##    920 12724318288.2856             nan     0.0500 -3147657.2575
##    940 12539307152.1351             nan     0.0500 -3654384.6394
##    960 12335050319.1319             nan     0.0500 -14443040.5316
##    980 12168639840.8320             nan     0.0500 -14977260.9678
##   1000 11996209100.4582             nan     0.0500 -18693892.2972
##   1020 11855901659.7468             nan     0.0500 -8798645.8295
##   1040 11699581425.6238             nan     0.0500 -15970362.9078
##   1060 11539082468.2978             nan     0.0500 2981656.5609
##   1080 11394951775.2452             nan     0.0500 -13174392.2152
##   1100 11235164367.1390             nan     0.0500 -5736375.9502
##   1120 11106895481.5597             nan     0.0500 -10727736.8546
##   1140 10942314463.4981             nan     0.0500 -12364184.2845
##   1160 10803770740.9620             nan     0.0500 -4580430.6311
##   1180 10679405648.8745             nan     0.0500 3395139.3460
##   1200 10570311415.0426             nan     0.0500 -19403432.6361
##   1220 10442359059.4609             nan     0.0500 -4806954.1705
##   1240 10291586634.7517             nan     0.0500 -3595122.0516
##   1260 10160402194.9309             nan     0.0500 806074.6402
##   1280 10035462951.9658             nan     0.0500 -3462910.6448
##   1300 9902785388.8658             nan     0.0500 8891951.5147
##   1320 9783589863.5066             nan     0.0500 -8452873.1774
##   1340 9657263218.3910             nan     0.0500 -6885831.0535
##   1360 9547039887.5819             nan     0.0500 -9991172.5045
##   1380 9431280498.4185             nan     0.0500 -6288002.1895
##   1400 9322022219.2580             nan     0.0500 -9258141.5213
##   1420 9223275930.6152             nan     0.0500 -8740198.1921
##   1440 9120383844.6298             nan     0.0500 -11226898.1740
##   1460 9009490297.2610             nan     0.0500 -1332762.9914
##   1480 8915281768.0753             nan     0.0500 2071345.6554
##   1500 8822700901.2461             nan     0.0500 -8519650.6575
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1 258768875615.4841             nan     0.0500 20788264223.2576
##      2 242298274267.3581             nan     0.0500 16189472667.0661
##      3 227263309703.9819             nan     0.0500 15607427543.3734
##      4 213562699686.1152             nan     0.0500 12810910203.6035
##      5 200583642217.8555             nan     0.0500 12715454695.3170
##      6 188612312716.7159             nan     0.0500 9291512107.7753
##      7 177815681232.6924             nan     0.0500 8854729872.0179
##      8 168707280715.0586             nan     0.0500 9448714372.1164
##      9 160327643564.3338             nan     0.0500 9304522755.6371
##     10 152052729079.2238             nan     0.0500 7285335039.3477
##     20 99277880132.5523             nan     0.0500 3525041391.9227
##     40 59075420404.3493             nan     0.0500 758098076.2381
##     60 45905132944.3801             nan     0.0500 150864832.3737
##     80 40277765243.4037             nan     0.0500 27085350.5500
##    100 37252416978.5639             nan     0.0500 -29250723.9635
##    120 34915631799.6153             nan     0.0500 -80017300.3011
##    140 32887470759.3464             nan     0.0500 -74906855.3093
##    160 31401257065.3274             nan     0.0500 -27781236.9718
##    180 30008414301.3661             nan     0.0500 -54407669.8603
##    200 28995391616.5640             nan     0.0500 -38983037.9838
##    220 27992351945.6012             nan     0.0500 -54059565.1321
##    240 27094069813.1610             nan     0.0500 -25595973.8679
##    260 26056645515.6979             nan     0.0500 -39778018.3336
##    280 25106734311.1263             nan     0.0500 -8148969.1881
##    300 24401721846.3725             nan     0.0500 -31716119.1427
##    320 23592250736.3010             nan     0.0500 -24536548.2942
##    340 22819019757.0565             nan     0.0500 21452978.2207
##    360 22235140028.0625             nan     0.0500 -42580206.8320
##    380 21650235391.2817             nan     0.0500 -23121481.4526
##    400 21147917667.8512             nan     0.0500 -40850834.1961
##    420 20649394224.4801             nan     0.0500 -55632730.6568
##    440 20164827285.9685             nan     0.0500 -34340111.6986
##    460 19708008366.1965             nan     0.0500 -26904900.6645
##    480 19163493100.6559             nan     0.0500 11462885.8101
##    500 18762139710.6187             nan     0.0500 -27961580.6964
##    520 18248676146.9786             nan     0.0500 -36589403.1521
##    540 17865487362.6215             nan     0.0500 -5754457.7977
##    560 17442198203.4223             nan     0.0500 -17487433.4555
##    580 17092016004.5553             nan     0.0500 -31181480.5679
##    600 16778577401.1917             nan     0.0500 -26402570.4252
##    620 16418756018.0820             nan     0.0500 -2622766.7861
##    640 16155299955.4977             nan     0.0500 -28190448.2910
##    660 15852122475.1059             nan     0.0500 -16868453.1137
##    680 15575640034.9584             nan     0.0500 -15345975.5106
##    700 15324115238.3178             nan     0.0500 -12851562.0436
##    720 15110786147.6783             nan     0.0500 -17310284.7596
##    740 14853194945.5165             nan     0.0500 -7848858.8144
##    760 14614309703.2426             nan     0.0500 -17807954.1999
##    780 14369559183.0986             nan     0.0500 -12510305.6648
##    800 14092528357.5526             nan     0.0500 -6374997.2305
##    820 13873968825.9696             nan     0.0500 -11252037.1803
##    840 13682949620.5290             nan     0.0500 345197.7131
##    860 13463736144.5008             nan     0.0500 -11103568.2348
##    880 13277150862.4057             nan     0.0500 -20943408.6402
##    900 13041187565.4077             nan     0.0500 3852015.0117
##    920 12839575609.2895             nan     0.0500 -11925749.6978
##    940 12629135530.9884             nan     0.0500 -12482762.9303
##    960 12445795992.2094             nan     0.0500 -9646217.7820
##    980 12260808708.5173             nan     0.0500 -15758908.5782
##   1000 12083464020.0268             nan     0.0500 -9510428.8467
##   1020 11895993728.2972             nan     0.0500 -3224239.8537
##   1040 11736551883.0273             nan     0.0500 -8985759.7424
##   1060 11560260419.9724             nan     0.0500 -17368129.0168
##   1080 11395890559.5762             nan     0.0500 -11636344.5716
##   1100 11231462960.8721             nan     0.0500 -9201865.8273
##   1120 11095110727.1141             nan     0.0500 4321402.4205
##   1140 10956612229.8037             nan     0.0500 -16920348.8914
##   1160 10817810555.9859             nan     0.0500 -8250178.0734
##   1180 10686306953.7374             nan     0.0500 -16032329.7398
##   1200 10548078737.6017             nan     0.0500 -1641263.5025
##   1220 10416379858.2776             nan     0.0500 -10226178.8852
##   1240 10296831939.4207             nan     0.0500 -7050287.6795
##   1260 10182730595.6681             nan     0.0500 -5787638.6946
##   1280 10055694115.5930             nan     0.0500 -5402335.5294
##   1300 9930160381.4398             nan     0.0500 -11579779.3864
##   1320 9807854130.5481             nan     0.0500 -8008083.5546
##   1340 9717342106.8317             nan     0.0500 -14565342.7346
##   1360 9611239891.0548             nan     0.0500 1316366.7541
##   1380 9500817498.2141             nan     0.0500 -1069072.8462
##   1400 9415000376.5840             nan     0.0500 -1209869.0486
##   1420 9314737734.9074             nan     0.0500 -5298688.6195
##   1440 9220758829.8052             nan     0.0500 -11794477.6778
##   1460 9123231821.6435             nan     0.0500 -8354817.5872
##   1480 9024711307.9251             nan     0.0500 -19240373.8671
##   1500 8920470232.8748             nan     0.0500 -6467160.5776
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1 258909339918.8690             nan     0.0500 18844999337.4506
##      2 243891483172.0254             nan     0.0500 15078073696.5793
##      3 229591388437.0222             nan     0.0500 12956388217.6491
##      4 215024474771.6006             nan     0.0500 15502393800.4451
##      5 202976870668.8875             nan     0.0500 13618598007.9925
##      6 192131546117.8822             nan     0.0500 10726006952.1989
##      7 181814208454.6968             nan     0.0500 10066577822.1216
##      8 172262606644.6008             nan     0.0500 9347443783.6331
##      9 163401508866.7413             nan     0.0500 9269358684.6585
##     10 154624664126.8134             nan     0.0500 8049065044.0284
##     20 100380384229.5783             nan     0.0500 3535429361.4723
##     40 58787923181.1791             nan     0.0500 633834165.9540
##     60 45226458906.7937             nan     0.0500 293888167.3495
##     80 39951415652.9891             nan     0.0500 -22940656.4192
##    100 36855600725.6866             nan     0.0500 -196595541.1954
##    120 34258667364.7127             nan     0.0500 -92221266.9923
##    140 32210515254.2621             nan     0.0500 -16944888.6598
##    160 30568807494.8040             nan     0.0500 -82288389.1737
##    180 29064076672.0278             nan     0.0500 -96274393.4498
##    200 27792900937.6425             nan     0.0500 -49455106.5358
##    220 26588756829.0611             nan     0.0500 -57142617.5414
##    240 25519894899.6631             nan     0.0500 -95028649.7053
##    260 24494794405.7692             nan     0.0500 -56032097.4869
##    280 23732351364.7546             nan     0.0500 -59850140.7348
##    300 23042201593.4881             nan     0.0500 -28454664.9745
##    320 22328019599.3422             nan     0.0500 -55136947.5459
##    340 21581907236.1899             nan     0.0500 -25208196.8429
##    360 20939722023.3211             nan     0.0500 -22806351.2030
##    380 20348132864.2421             nan     0.0500 -25897418.2975
##    400 19867626719.8213             nan     0.0500 -51518384.1216
##    420 19339397599.9123             nan     0.0500 1981498.5579
##    440 18893382630.2561             nan     0.0500 -30185496.9763
##    460 18454067362.9378             nan     0.0500 1701148.0979
##    480 18052457201.8958             nan     0.0500 -18003665.3678
##    500 17618309814.1054             nan     0.0500 -32596938.6268
##    520 17225763925.7692             nan     0.0500 -12783373.4203
##    540 16872406732.2674             nan     0.0500 -34156346.3204
##    560 16567314977.6310             nan     0.0500 -13774437.4431
##    580 16234993611.0253             nan     0.0500 -5251047.4825
##    600 15895991653.8401             nan     0.0500 -28171362.0151
##    620 15576906468.9031             nan     0.0500 -6249888.8830
##    640 15284718113.8470             nan     0.0500 -19379694.2935
##    660 14972057047.7235             nan     0.0500 -1297524.0640
##    680 14717157974.3155             nan     0.0500 -12022341.8302
##    700 14442294239.9017             nan     0.0500 -16836561.6537
##    720 14200901942.8578             nan     0.0500 -15692619.5557
##    740 13931803657.5258             nan     0.0500 -8677562.6867
##    760 13677331957.0521             nan     0.0500 -2594177.5133
##    780 13462495016.4761             nan     0.0500 -9975521.7596
##    800 13247003867.6707             nan     0.0500 -5867438.0628
##    820 13025909794.3190             nan     0.0500 3209899.8004
##    840 12830231174.3768             nan     0.0500 -2641002.0109
##    860 12633627472.2446             nan     0.0500 -4280496.4233
##    880 12471302227.5607             nan     0.0500 -7976938.1678
##    900 12275922232.6236             nan     0.0500 -115036.5489
##    920 12093343104.6787             nan     0.0500 -15011981.1700
##    940 11936987780.9866             nan     0.0500 -6111448.2191
##    960 11787014782.8474             nan     0.0500 -11467468.5610
##    980 11638467041.1413             nan     0.0500 -6987665.9372
##   1000 11475426528.5388             nan     0.0500 -17822682.1647
##   1020 11325521744.5253             nan     0.0500 -13978043.4160
##   1040 11162272668.8892             nan     0.0500 -2627996.8707
##   1060 10978886950.4393             nan     0.0500 -14897115.6624
##   1080 10820635107.8792             nan     0.0500 -4220121.8955
##   1100 10688247211.5504             nan     0.0500 -10470243.2729
##   1120 10555227235.7617             nan     0.0500 -3808133.0762
##   1140 10433006669.8847             nan     0.0500 -8005977.1351
##   1160 10321301887.8452             nan     0.0500 -13349388.6427
##   1180 10183031244.9192             nan     0.0500 -7734792.7876
##   1200 10071892216.8314             nan     0.0500 -9085863.0799
##   1220 9941231028.1395             nan     0.0500 -2587628.1854
##   1240 9829051266.5810             nan     0.0500 -6591424.2098
##   1260 9701752922.7926             nan     0.0500 -1366447.6359
##   1280 9590841384.2869             nan     0.0500 -2543759.8275
##   1300 9479209751.0747             nan     0.0500 -15792012.2073
##   1320 9359235283.2509             nan     0.0500 -33807.9187
##   1340 9246705188.1748             nan     0.0500 -3802791.4265
##   1360 9143149175.8891             nan     0.0500 -7896967.3225
##   1380 9037994879.4216             nan     0.0500 -3053550.2541
##   1400 8947764588.8148             nan     0.0500 -7862819.1870
##   1420 8849245916.7855             nan     0.0500 -2314903.7042
##   1440 8767298474.2565             nan     0.0500 -9934904.6934
##   1460 8658106336.2579             nan     0.0500 -3818158.6872
##   1480 8559464725.5251             nan     0.0500 -10247861.3962
##   1500 8462116142.4200             nan     0.0500 -177752.6975
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1 254061589297.6797             nan     0.0500 16707467520.3205
##      2 237727876492.4179             nan     0.0500 16341916930.8013
##      3 223250792049.8754             nan     0.0500 13949844036.1071
##      4 210597930293.1893             nan     0.0500 12996577532.9726
##      5 198652192665.2593             nan     0.0500 11806650447.0869
##      6 186625546581.9215             nan     0.0500 10778725861.8742
##      7 176893152834.0851             nan     0.0500 8918958067.7665
##      8 166297679018.5577             nan     0.0500 9835802406.7236
##      9 157571619326.7144             nan     0.0500 8218010207.1319
##     10 149997070285.4969             nan     0.0500 7561139285.5028
##     20 96711863945.0235             nan     0.0500 3336591495.7178
##     40 57134800820.1563             nan     0.0500 921554483.9302
##     60 44518284490.6246             nan     0.0500 232751469.6994
##     80 38984317768.0901             nan     0.0500 64681869.7492
##    100 36037628840.8775             nan     0.0500 -15603862.1741
##    120 33893332248.7045             nan     0.0500 36574013.0376
##    140 32322552413.1837             nan     0.0500 14276389.9365
##    160 30939286279.4923             nan     0.0500 -33730113.9104
##    180 29775392412.7720             nan     0.0500 -97194805.1931
##    200 28663486991.3102             nan     0.0500 -3540365.6663
##    220 27700360966.8449             nan     0.0500 8748848.2965
##    240 26736155783.5938             nan     0.0500 -41201853.5041
##    260 25989916027.8502             nan     0.0500 -57744328.4367
##    280 25287119178.2563             nan     0.0500 -44412715.6066
##    300 24470943385.5846             nan     0.0500 -49963342.7560
##    320 23799857488.8041             nan     0.0500 -19067004.7585
##    340 23183169638.7120             nan     0.0500 -20477871.5461
##    360 22683072900.0236             nan     0.0500 -42506709.1344
##    380 22125128108.6084             nan     0.0500 -51340005.9311
##    400 21623948384.1186             nan     0.0500 -35473910.2428
##    420 21157446857.0273             nan     0.0500 -22753420.2709
##    440 20583118511.9698             nan     0.0500 -26880528.0392
##    460 20176683737.8660             nan     0.0500 -39577606.3761
##    480 19781244585.3747             nan     0.0500 -25837473.8527
##    500 19408666881.4744             nan     0.0500 -13570609.1997
##    520 19053558919.2727             nan     0.0500 -23920631.0353
##    540 18702947212.8924             nan     0.0500 -29762458.1933
##    560 18361135336.6851             nan     0.0500 -10106440.8142
##    580 18057533510.3451             nan     0.0500 -11185155.3344
##    600 17697920464.6967             nan     0.0500 -23962152.0447
##    620 17384044560.3564             nan     0.0500 -1980860.3223
##    640 17092813323.3172             nan     0.0500 3451863.4620
##    660 16826451520.8222             nan     0.0500 -12484111.8287
##    680 16526345530.6201             nan     0.0500 -12800386.9243
##    700 16273595849.4316             nan     0.0500 -15184608.6109
##    720 15999756545.5648             nan     0.0500 -2329555.7880
##    740 15700752700.1574             nan     0.0500 -13109842.7084
##    760 15442364264.2017             nan     0.0500 -21702209.2440
##    780 15152009205.9528             nan     0.0500 6648059.8984
##    800 14935381902.6730             nan     0.0500 -15800294.5268
##    820 14724355671.9934             nan     0.0500 -25300481.7705
##    840 14497642202.0234             nan     0.0500 -7246855.3086
##    860 14301229878.1407             nan     0.0500 -17235959.9812
##    880 14118342559.7032             nan     0.0500 -24505958.8202
##    900 13867825949.3564             nan     0.0500 -2146673.6135
##    920 13675825370.8646             nan     0.0500 -11339556.8270
##    940 13496316501.2680             nan     0.0500 -3193636.2969
##    960 13326734670.8289             nan     0.0500 -1182916.8065
##    980 13150610118.5390             nan     0.0500 -9339504.2453
##   1000 12980442027.6338             nan     0.0500 -7968451.5679
##   1020 12805816292.9469             nan     0.0500 -13006352.9698
##   1040 12642013291.1660             nan     0.0500 -1777712.7314
##   1060 12468608971.9994             nan     0.0500 -8613299.5845
##   1080 12311180917.6355             nan     0.0500 -14600521.8862
##   1100 12160230572.6988             nan     0.0500 -9682642.6602
##   1120 11987462775.5350             nan     0.0500 -12374557.2553
##   1140 11842061976.5453             nan     0.0500 -7026835.8001
##   1160 11710632738.8370             nan     0.0500 3810345.3085
##   1180 11568116271.3770             nan     0.0500 -8856220.3612
##   1200 11439690894.8156             nan     0.0500 -7870074.7704
##   1220 11289178575.9091             nan     0.0500 -6932127.5612
##   1240 11170197237.6925             nan     0.0500 -8821673.1030
##   1260 11036398698.8110             nan     0.0500 -13423092.0605
##   1280 10898335689.2964             nan     0.0500 -5092489.6338
##   1300 10774694486.7303             nan     0.0500 -3030104.0653
##   1320 10668773961.5829             nan     0.0500 -9842026.1921
##   1340 10549785829.0096             nan     0.0500 -7735773.5874
##   1360 10431741671.0151             nan     0.0500 -11927802.4062
##   1380 10328873683.4323             nan     0.0500 -780244.1208
##   1400 10230866002.6588             nan     0.0500 -9649780.4170
##   1420 10132219931.3737             nan     0.0500 -6574881.4469
##   1440 10033889257.0840             nan     0.0500 -2939305.4942
##   1460 9935996593.9181             nan     0.0500 679091.0797
##   1480 9849385202.6572             nan     0.0500 -10371527.3147
##   1500 9756616797.8631             nan     0.0500 -7318059.3543
#combine the results 
#make sure to use the method names  from above
multimodel <- list(
  lm = model_lm, 
  gbm = gbmFit1,
  knn=knn_fit_2,
  glmnet=lasso_fit,
  rpart = model_tree_2,
 ramger =rf_fit_2
  )
class(multimodel) <- "caretList"

The figures below compare the 6 models used for stacking visually in terms of RMSE and R².

#we can visualize the differences in performance of each algorithm for each fold 
  dotplot(resamples(multimodel), metric = "Rsquared") #you can set metric=MAE, RMSE, or Rsquared 

    splom(resamples(multimodel), metric = "Rsquared")

  dotplot(resamples(multimodel), metric = "RMSE") #you can set metric=MAE, RMSE, or Rsquared 

    splom(resamples(multimodel), metric = "RMSE")

The figure below shows the correlation between the models that I stacked together. The linear regression correlated strongly with the LASSO regression, but also with gradient boost and random forest. The selection of very similar variables throughout the whole process could be a reason for this.

modelCor(resamples(multimodel))
##           lm   gbm   knn glmnet rpart ramger
## lm     1.000 0.926 0.752  0.963 0.726  0.893
## gbm    0.926 1.000 0.738  0.792 0.577  0.907
## knn    0.752 0.738 1.000  0.706 0.905  0.951
## glmnet 0.963 0.792 0.706  1.000 0.766  0.807
## rpart  0.726 0.577 0.905  0.766 1.000  0.840
## ramger 0.893 0.907 0.951  0.807 0.840  1.000

Now, I run the model:

#we can now use stacking with the list of models
library(caretEnsemble)  
model_list <- caretStack(multimodel,
    trControl=ctrl,
    method="lm",
    metric = "RMSE")

  summary(model_list)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3263032   -51716     1810    50039  4408661 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.36e+04   3.47e+03   -9.67  < 2e-16 ***
## lm           1.28e+00   1.16e-01   11.06  < 2e-16 ***
## gbm          3.80e-01   2.01e-02   18.95  < 2e-16 ***
## knn          1.36e-01   1.28e-02   10.59  < 2e-16 ***
## glmnet      -1.13e+00   1.17e-01   -9.68  < 2e-16 ***
## rpart        6.63e-02   1.32e-02    5.01  5.4e-07 ***
## ramger       3.28e-01   2.71e-02   12.11  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2e+05 on 10492 degrees of freedom
## Multiple R-squared:  0.854,  Adjusted R-squared:  0.854 
## F-statistic: 1.02e+04 on 6 and 10492 DF,  p-value: <2e-16
#predict the price of each house in the test data set
#recall that the output of "train" function (knn_fit) automatically keeps the best model 
all_prediction <- predict(model_list, newdata = test_data)

all_results<-data.frame(RMSE = RMSE(all_prediction, test_data$price), Rsquare = R2(all_prediction, test_data$price))

all_results
##     RMSE Rsquare
## 1 187360   0.866

Selection of the model

To compare the performance of the different models I look at RMSE and R², the same parameters I used to optimize them. The best model is the model with the highest R² and the lowest RMSE. A high R² indicated that a high portion of overall variability in the dataset is explained by this model, while RMSE shows the error of the prediction.

data.frame(name=c("Linear Regression", "LASSO", "KNN", "Regression Tree", "Random Forest", "Gradient Boosting", "Stacked Model"),RMSE_Training= c(227300, 234251, 268281, 252842, 207808, 210764, 199600), RSquared_Training= c(0.8133, 0.8026, 0.7472, 0.7694, 0.8432
, 0.8403, 0.8541))
##                name RMSE_Training RSquared_Training
## 1 Linear Regression        227300             0.813
## 2             LASSO        234251             0.803
## 3               KNN        268281             0.747
## 4   Regression Tree        252842             0.769
## 5     Random Forest        207808             0.843
## 6 Gradient Boosting        210764             0.840
## 7     Stacked Model        199600             0.854

Stacking is clearly the best method, explaining 85.41% of overall variability and having the lowest error. The table shows a difference in R² and RMSE between training and testing data. While the difference is not huge, it is surprising that most models perform better in the testing data than in the training data. This is unexpected but might be caused by the seed I used when splitting the total data into training and testing. I would expect, to have a slightly different result when using another seed.

Pick investments

To select the 200 houses out of the 2,000 on the market for sale at the moment I applied my best model, the stacking model. I added the by the model predicted price to the dataset and calculated the profit margin comparing the predicted price to the asking price ((predicted price – asking price) /asking prices). Finally, I ranked the properties by the calculated profit margin and selected the top 200. These 200 properties give an average return of 69.63%. My model calculates an average return of 3.78% over the whole dataset.

numchoose=200
oos<-london_house_prices_2019_out_of_sample

#predict the value of houses
oos$predict <- predict(model_list,oos)

#Choose the ones you want to invest here

#Let's find the profit margin given our predicted price and asking price
oos_data<- oos%>%
  mutate(profitMargin=(predict-asking_price)/asking_price)%>%
  arrange(-profitMargin)

#Make sure you choose exactly 200 of them
oos_data$buy=0
oos_data[1:numchoose,]$buy=1

#let's find the actual profit

oos_data<-oos_data%>%
  mutate(actualProfit=buy*profitMargin)

#if we invest in everything
mean(oos_data$profitMargin)
## [1] 0.0378
#just invest in those we chose
sum(oos_data$actualProfit)/numchoose
## [1] 0.696

To control, I calculate how much profit I would make on the training dataset:

##try for testing data
numchoose=200

#predict the value of houses
train_data$predict <- predict(model_list,train_data)

#Choose the ones you want to invest here
#Let's find the profit margin given our predicted price and asking price
train_data_pred<- train_data%>%
  mutate(profitMargin=(predict-price)/price)%>%
  arrange(-profitMargin)

#Make sure you choose exactly 200 of them
train_data_pred$buy=0
train_data_pred[1:numchoose,]$buy=1


#let's find the actual profit
train_data_pred<-train_data_pred%>%
  mutate(actualProfit=buy*profitMargin)

#if we invest in everything
mean(train_data_pred$actualProfit)
## [1] 0.0182
#just invest in those we chose
sum(train_data_pred$actualProfit)/numchoose
## [1] 0.955

Conclusion

To sum up, I used available features of houses and historic data of all transactions done in London in 2019 to calculate seven different estimation engines. I used the methods linear regression, k-NN, and trees as well as the ensembled methods random forest, gradient boosting and stacking. While all of them are able to explain the difference in house prices in London at least to 75%, the best performing model, namely the stacking model, manages to explain 85.4%. This estimation engine has subsequently been used to select the 200 most promising houses to invest in, with a predicted return of 69.63%. Limitations of this project are the available information on the specific houses as well as the assumption that the asking price will not change. In addition, if I would have sufficient computing power, I would tune additional parameters of the models that I have used. Other reasonable information that could be useful is commuting time to the center especially for properties that are located far from the center, as well as distance to supermarket and other facilities. Also, the architectural style, brightness of the rooms and interior design could be significant.