House Prices
https://www.kaggle.com/c/house-prices-advanced-regression-techniques
The object is to predict SalePrice which is provided in train.csv, but missing in test.csv. The submission should just have two columns, Id and SalePrice.
To get a clue which of the many factors to focus on:
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
model <- lm(SalePrice ~ . , data = train)
summary(model)
Next limiting this to those with three asterisks. A snag is one of the rows in KitchenQual has an “NA” which needs to be replaced by some other value for predict
not to throw an error.
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
model <- lm(SalePrice ~ OverallQual + LotArea + RoofMatl + ExterQual +
BsmtQual + BsmtFinSF1 + X1stFlrSF + X2ndFlrSF + KitchenQual, data = train)
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test[c(96), c("KitchenQual")] <- "TA"
test$BsmtFinSF1 <- as.numeric(test$BsmtFinSF1)
test$SalePrice <- predict(object = model, newdata = test)
write.csv(test[,c("Id","SalePrice")], "submission.csv", quote = F, row.names = F)
Analyzing a categorical variable
This needs table
to turn into numerical values, and then barplot
or pie
can be used.
pie(table(train$MSZoning))
pie(table(train$Street))
summary(train) Id MSSubClass MSZoning LotFrontage LotArea
Min. : 1.0 Min. : 20.0 C (all): 10 NA :259 Min. : 1300
1st Qu.: 365.8 1st Qu.: 20.0 FV : 65 60 :143 1st Qu.: 7554
Median : 730.5 Median : 50.0 RH : 16 70 : 70 Median : 9478
Mean : 730.5 Mean : 56.9 RL :1151 80 : 69 Mean : 10517
3rd Qu.:1095.2 3rd Qu.: 70.0 RM : 218 50 : 57 3rd Qu.: 11602
Max. :1460.0 Max. :190.0 75 : 53 Max. :215245
(Other):809
Street Alley LotShape LandContour Utilities LotConfig
Grvl: 6 Grvl: 50 IR1:484 Bnk: 63 AllPub:1459 Corner : 263
Pave:1454 NA :1369 IR2: 41 HLS: 50 NoSeWa: 1 CulDSac: 94
Pave: 41 IR3: 10 Low: 36 FR2 : 47
Reg:925 Lvl:1311 FR3 : 4
Inside :1052
LandSlope Neighborhood Condition1 Condition2 BldgType
Gtl:1382 NAmes :225 Norm :1260 Norm :1445 1Fam :1220
Mod: 65 CollgCr:150 Feedr : 81 Feedr : 6 2fmCon: 31
Sev: 13 OldTown:113 Artery : 48 Artery : 2 Duplex: 52
Edwards:100 RRAn : 26 PosN : 2 Twnhs : 43
Somerst: 86 PosN : 19 RRNn : 2 TwnhsE: 114
Gilbert: 79 RRAe : 11 PosA : 1
(Other):707 (Other): 15 (Other): 2
HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd
1Story :726 Min. : 1.000 Min. :1.000 Min. :1872 Min. :1950
2Story :445 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967
1.5Fin :154 Median : 6.000 Median :5.000 Median :1973 Median :1994
SLvl : 65 Mean : 6.099 Mean :5.575 Mean :1971 Mean :1985
SFoyer : 37 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004
1.5Unf : 14 Max. :10.000 Max. :9.000 Max. :2010 Max. :2010
(Other): 19
RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
Flat : 13 CompShg:1434 VinylSd:515 VinylSd:504 BrkCmn : 15
Gable :1141 Tar&Grv: 11 HdBoard:222 MetalSd:214 BrkFace:445
Gambrel: 11 WdShngl: 6 MetalSd:220 HdBoard:207 NA : 8
Hip : 286 WdShake: 5 Wd Sdng:206 Wd Sdng:197 None :864
Mansard: 7 ClyTile: 1 Plywood:108 Plywood:142 Stone :128
Shed : 2 Membran: 1 CemntBd: 61 CmentBd: 60
(Other): 2 (Other):128 (Other):136
MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
0 :861 Ex: 52 Ex: 3 BrkTil:146 Ex:121 Fa: 45 Av:221
108 : 8 Fa: 14 Fa: 28 CBlock:634 Fa: 35 Gd: 65 Gd:134
180 : 8 Gd:488 Gd: 146 PConc :647 Gd:618 NA: 37 Mn:114
72 : 8 TA:906 Po: 1 Slab : 24 NA: 37 Po: 2 NA: 38
NA : 8 TA:1282 Stone : 6 TA:649 TA:1311 No:953
120 : 7 Wood : 3
(Other):560
BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF
ALQ:220 Min. : 0.0 ALQ: 19 Min. : 0.00 Min. : 0.0
BLQ:148 1st Qu.: 0.0 BLQ: 33 1st Qu.: 0.00 1st Qu.: 223.0
GLQ:418 Median : 383.5 GLQ: 14 Median : 0.00 Median : 477.5
LwQ: 74 Mean : 443.6 LwQ: 46 Mean : 46.55 Mean : 567.2
NA : 37 3rd Qu.: 712.2 NA : 38 3rd Qu.: 0.00 3rd Qu.: 808.0
Rec:133 Max. :5644.0 Rec: 54 Max. :1474.00 Max. :2336.0
Unf:430 Unf:1256
TotalBsmtSF Heating HeatingQC CentralAir Electrical X1stFlrSF
Min. : 0.0 Floor: 1 Ex:741 N: 95 FuseA: 94 Min. : 334
1st Qu.: 795.8 GasA :1428 Fa: 49 Y:1365 FuseF: 27 1st Qu.: 882
Median : 991.5 GasW : 18 Gd:241 FuseP: 3 Median :1087
Mean :1057.4 Grav : 7 Po: 1 Mix : 1 Mean :1163
3rd Qu.:1298.2 OthW : 2 TA:428 NA : 1 3rd Qu.:1391
Max. :6110.0 Wall : 4 SBrkr:1334 Max. :4692
X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
Min. : 0 Min. : 0.000 Min. : 334 Min. :0.0000
1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000
Median : 0 Median : 0.000 Median :1464 Median :0.0000
Mean : 347 Mean : 5.845 Mean :1515 Mean :0.4253
3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000
Max. :2065 Max. :572.000 Max. :5642 Max. :3.0000
BsmtHalfBath FullBath HalfBath BedroomAbvGr
Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000
1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000
Median :0.00000 Median :2.000 Median :0.0000 Median :3.000
Mean :0.05753 Mean :1.565 Mean :0.3829 Mean :2.866
3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
Max. :2.00000 Max. :3.000 Max. :2.0000 Max. :8.000
KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces
Min. :0.000 Ex:100 Min. : 2.000 Maj1: 14 Min. :0.000
1st Qu.:1.000 Fa: 39 1st Qu.: 5.000 Maj2: 5 1st Qu.:0.000
Median :1.000 Gd:586 Median : 6.000 Min1: 31 Median :1.000
Mean :1.047 TA:735 Mean : 6.518 Min2: 34 Mean :0.613
3rd Qu.:1.000 3rd Qu.: 7.000 Mod : 15 3rd Qu.:1.000
Max. :3.000 Max. :14.000 Sev : 1 Max. :3.000
Typ :1360
FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
Ex: 24 2Types : 6 NA : 81 Fin:352 Min. :0.000
Fa: 33 Attchd :870 2005 : 65 NA : 81 1st Qu.:1.000
Gd:380 Basment: 19 2006 : 59 RFn:422 Median :2.000
NA:690 BuiltIn: 88 2004 : 53 Unf:605 Mean :1.767
Po: 20 CarPort: 9 2003 : 50 3rd Qu.:2.000
TA:313 Detchd :387 2007 : 49 Max. :4.000
NA : 81 (Other):1103
GarageArea GarageQual GarageCond PavedDrive WoodDeckSF
Min. : 0.0 Ex: 3 Ex: 2 N: 90 Min. : 0.00
1st Qu.: 334.5 Fa: 48 Fa: 35 P: 30 1st Qu.: 0.00
Median : 480.0 Gd: 14 Gd: 9 Y:1340 Median : 0.00
Mean : 473.0 NA: 81 NA: 81 Mean : 94.24
3rd Qu.: 576.0 Po: 3 Po: 7 3rd Qu.:168.00
Max. :1418.0 TA:1311 TA:1326 Max. :857.00
OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch
Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
Median : 25.00 Median : 0.00 Median : 0.00 Median : 0.00
Mean : 46.66 Mean : 21.95 Mean : 3.41 Mean : 15.06
3rd Qu.: 68.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00
Max. :547.00 Max. :552.00 Max. :508.00 Max. :480.00
PoolArea PoolQC Fence MiscFeature MiscVal
Min. : 0.000 Ex: 2 GdPrv: 59 Gar2: 2 Min. : 0.00
1st Qu.: 0.000 Fa: 2 GdWo : 54 NA :1406 1st Qu.: 0.00
Median : 0.000 Gd: 3 MnPrv: 157 Othr: 2 Median : 0.00
Mean : 2.759 NA:1453 MnWw : 11 Shed: 49 Mean : 43.49
3rd Qu.: 0.000 NA :1179 TenC: 1 3rd Qu.: 0.00
Max. :738.000 Max. :15500.00
MoSold YrSold SaleType SaleCondition SalePrice
Min. : 1.000 Min. :2006 WD :1267 Abnorml: 101 Min. : 34900
1st Qu.: 5.000 1st Qu.:2007 New : 122 AdjLand: 4 1st Qu.:129975
Median : 6.000 Median :2008 COD : 43 Alloca : 12 Median :163000
Mean : 6.322 Mean :2008 ConLD : 9 Family : 20 Mean :180921
3rd Qu.: 8.000 3rd Qu.:2009 ConLI : 5 Normal :1198 3rd Qu.:214000
Max. :12.000 Max. :2010 ConLw : 5 Partial: 125 Max. :755000
(Other): 9
Introductory Example
Kaggle provides a training set (which includes the result, Survived, along with various predictors), a test set which doesn’t include Survived, and a simple example gender_submission.csv to show how results should be submitted.
Writing submission file
The model used for gender_submission.csv is all female passengers survived and all male passengers didn’t.
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Survived <- as.integer(test$Sex == "female")
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)
Checking accuracy of model
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
train$Prediction <- as.integer(train$Sex == "female")
(sum(train$Survived == train$Prediction)/length(train$Survived)) * 100
This shows assuming all females survived and all males died is about 79% accurate.
Making the model more complex
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
model <- glm(Survived ~ Sex, data = train)
train$Prediction <- round(predict(model, newdata = train))
(sum(train$Survived == train$Prediction)/length(train$Survived)) * 100
predict(model)
allocates 0.1889081 to males and 0.7420382 to females, so rounding produces the same as assuming all females survived and all females didn’t of 78.67565%.
Making age a factor
A snag here is Age is NA for many entries
Subset train to only rows with Age provided
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
has_age <- subset(train, !is.na(train$Age))
model <- glm(Survived ~ Sex + Age, data = has_age)
has_age$Prediction <- round(predict(model, newdata = has_age))
(sum(has_age$Survived == has_age$Prediction)/length(has_age$Survived)) * 100
This reduces the accuracy of the model slightly to 78.0112%
Subset train to only rows with no Age provided
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
no_age <- subset(train, is.na(train$Age))
model <- glm(Survived ~ Sex, data = no_age)
no_age$Prediction <- round(predict(model, newdata = no_age))
(sum(no_age$Survived == no_age$Prediction)/length(no_age$Survived)) * 100
Passengers with no age provided had a lower survival rate, 0.1290323 for males and 0.6792453 for females. The accuracy of the prediction rose to 81.35593%.
Subset to children and adults
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
children <- subset(train, train$Age < 15)
model <- glm(Survived ~ Sex, data = children)
children$Prediction <- round(predict(model, newdata = children))
(sum(children$Survived == children$Prediction)/length(children$Survived)) * 100
The odds for males rose to 0.5384615 if they were under 15 while those for females dropped to 0.6153846. The accuracy fell to 57.69231%.
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
adults <- subset(train, train$Age >= 15)
model <- glm(Survived ~ Sex, data = adults)
adults$Prediction <- round(predict(model, newdata = adults))
(sum(adults$Survived == adults$Prediction)/length(adults$Survived)) * 100
The odds for male adults fell to 0.1739130 and for females fell to 0.7792793. Accuracy rose to 80.97484%.
Assuming all females and children under 14 survived
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
train$Prediction <- as.integer(train$Sex == "female" | (!is.na(train$Age) & train$Age < 14))
(sum(train$Survived == train$Prediction)/length(train$Survived)) * 100
This increases the accuracy slightly to 79.23681%.
Subsetting by class
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
first <- subset(train, Pclass == 1)
(sum(first$Survived == 1)/length(first$PassengerId)) * 100
first_male <- subset(train, Pclass == 1 & Sex == "male")
(sum(first_male$Survived == 1)/length(first_male$PassengerId)) * 100
first_female <- subset(train, Pclass == 1 & Sex == "female")
(sum(first_female$Survived == 1)/length(first_female$PassengerId)) * 100
62.96296% of first class passengers survived. The odds for males rose to 36.88525% and females to 96.80851%.
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
second <- subset(train, Pclass == 2)
(sum(second$Survived == 1)/length(second$PassengerId)) * 100
second_male <- subset(train, Pclass == 2 & Sex == "male")
(sum(second_male$Survived == 1)/length(second_male$PassengerId)) * 100
second_female <- subset(train, Pclass == 2 & Sex == "female")
(sum(second_female$Survived == 1)/length(second_female$PassengerId)) * 100
47.28261% overall, 15.74074% males and 92.10526% females.
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
third <- subset(train, Pclass == 3)
(sum(third$Survived == 1)/length(third$PassengerId)) * 100
third_male <- subset(train, Pclass == 3 & Sex == "male")
(sum(third_male$Survived == 1)/length(third_male$PassengerId)) * 100
third_female <- subset(train, Pclass == 3 & Sex == "female")
(sum(third_female$Survived == 1)/length(third_female$PassengerId)) * 100
Only 24.23625% of third class passengers survived. Only 13.54467% of males and 50% of females.
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
above_third <- subset(train, Pclass < 3)
(sum(above_third$Survived == 1)/length(above_third$PassengerId)) * 100
above_third_male <- subset(train, Pclass < 3 & Sex == "male")
(sum(above_third_male$Survived == 1)/length(above_third_male$PassengerId)) * 100
above_third_female <- subset(train, Pclass < 3 & Sex == "female")
(sum(above_third_female$Survived == 1)/length(above_third_female$PassengerId)) * 100
Better than 3rd class survived 55.75% overall, 26.95652% of males survived and 94.70588% of females.
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
third_children <- subset(train, Pclass == 3 & Age < 14)
(sum(third_children$Survived == 1)/length(third_children$PassengerId)) * 100
above_third_children <- subset(train, Pclass < 3 & Age < 14)
(sum(above_third_children$Survived == 1)/length(above_third_children$PassengerId)) * 100
Only 42.85714% of 3rd class children survived while 95.45455% of children in 2nd and 1st class survived.
Assuming females and children in 1st or 2nd class survived, and all 3rd class passengers died.
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
train$Prediction <- as.integer((train$Sex == "female" & train$Pclass < 3) |
(!is.na(train$Age) & train$Age < 14 & train$Pclass < 3))
(sum(train$Survived == train$Prediction)/length(train$Survived)) * 100
This raises the accuracy to 80.02245%.
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Survived <- as.integer((test$Sex == "female" & test$Pclass < 3) |
(!is.na(test$Age) & test$Age < 14 & test$Pclass < 3))
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)
This worked and received a public score of 0.78468, placing me 1783 on the leaderboard.