Titanic
Introductory Example
Kaggle provides a training set (which includes the result, Survived, along with various predictors), a test set which doesn’t include Survived, and a simple example gender_submission.csv to show how results should be submitted.
Writing submission file
The model used for gender_submission.csv is all female passengers survived and all male passengers didn’t.
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Survived <- as.integer(test$Sex == "female")
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)
Checking accuracy of “all females lived, all males died” assumption.
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
train$Prediction <- as.integer(train$Sex == "female")
(sum(train$Survived == train$Prediction)/nrow(train)) * 100
On the training data, this assumption gives 78.67565% accuracy.
Visualisation
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
svg("pie1.svg", width = 11, pointsize = 12, family = "sans")
pie(table(train[,c("Survived", "Sex")]), labels=c("Females Died 81","Females Lived 233","Males Died 468","Males Lived 109"))
dev.off()
Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data. — R’s pie help page.
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
svg("barplot1.svg", width = 11, pointsize = 12, family = "sans")
barplot(table(train[,c("Survived", "Sex")]), col = c(4,5))
legend("topleft", legend = c("Died", "Lived"), fill = c(4,5))
dev.off()
Factors can be combined by concatenating strings using paste.
train$ClassSex <- paste(train$Pclass, train$Sex)
svg("barplot2.svg", width = 11, pointsize = 12, family = "sans")
barplot(table(train[,c("Survived", "ClassSex")]), col = c(4,5))
legend("topleft", legend = c("Died", "Lived"), fill = c(4,5))
dev.off()
Indexes
PassengerID is obviously an index. Another is Name. While Ticket is not unique, presumably families travelled with a shared ticket, most tickets only have a frequency of 1. Cabin is similar.
Best Subset Selection
leaps
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
library(leaps)
regfit.full <- regsubsets(Survived ~ ., train)
This throws the following error:
Reordering variables and trying again:
Error in leaps.exhaustive(a, really.big) :
Exhaustive search will be S L O W, must specify really.big=T
In addition: Warning message:
In leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in = force.in, :
1015 linear dependencies found
Using really.big=T crashes my laptop.
Next step is to remove all the ID colums from the data.
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
library(leaps)
regfit <- regsubsets(Survived ~ . - PassengerId - Name - Ticket - Cabin - Embarked, train)
reg.summary <- summary(regfit)
Pclass Sexmale Age SibSp Parch Fare
1 ( 1 ) " " "*" " " " " " " " "
2 ( 1 ) "*" "*" " " " " " " " "
3 ( 1 ) "*" "*" "*" " " " " " "
4 ( 1 ) "*" "*" "*" "*" " " " "
5 ( 1 ) "*" "*" "*" "*" " " "*"
6 ( 1 ) "*" "*" "*" "*" "*" "*"
This suggests the subsets.
- Sex
- Sex, Pclass
- Sex, Pclass, Age
- Sex. Pclass, Age, SibSp
- Sex. Pclass, Age, SibSp, Fare
- Sex. Pclass, Age, SibSp, Fare, Parch
The optimal subset size is suggested by summary(regfit)
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
library(leaps)
regfit <- regsubsets(Survived ~ . - PassengerId - Name - Ticket - Cabin - Embarked, train)
reg.summary <- summary(regfit)
svg("subset-selection.svg", width = 11, pointsize = 12, family = "sans")
par(mfrow = c(2, 2))
which.min(reg.summary$rss) # 6
plot(reg.summary$rss , xlab = "Number of Variables", ylab = "RSS", type = "l")
points(6, reg.summary$rss[6], col = "red", cex = 2, pch = 20)
which.max(reg.summary$adjr2) # 4
plot(reg.summary$adjr2 , xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l")
points(4, reg.summary$adjr2[4], col = "red", cex = 2, pch = 20)
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
which.min(reg.summary$cp) # 4
points(4, reg.summary$cp[4], col = "red", cex = 2, pch = 20)
which.min(reg.summary$bic) # 4
plot(reg.summary$bic , xlab = "Number of Variables", ylab = "BIC", type = "l")
points(4, reg.summary$bic[4], col = "red", cex = 2, pch = 20)
dev.off()
\(P_c\), AIC, BIC, and Adjusted \(R^2\) are explained on p244 of An Introduction to Statistical Learning with Applications in R.
- RSS: Residual Sum of Squares, the average amount that the response will deviate from the true regression line. MSE = RSS/n.
- Rsq: \(R^2\), it always takes on a value between 0 and 1, and is independent of the scale of Y. An \(R^2\) statistic that is close to 1 indicates that a large proportion of the variability in the response is explained by the regression. A number near 0 indicates that the regression does not explain much of the variability in the response; this might occur because the linear model is wrong, or the error variance \(σ^2\) is high, or both.
- \(P_c\): Mallow’s \(P_c\) statistic adds a penalty to the training RSS in order to adjust for the fact that the training error tends to underestimate the test error. the penalty increases as the number of predictors in the model increase.
- Bayesian information criterion (BIC): Similar to
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
library(pls)
model <- pcr(Survived ~ . - PassengerId - Name - Ticket - Cabin - Embarked, data = train, validation = "CV")
reg.summary <- summary(regfit)
This function has slightly different syntax from other model-fitting functions that we have encountered thus far in this book. In particular, we must pass in an x matrix as well as a y vector, and we do not use the y ∼ x syntax.
The x matric can be obtained with x <- model.matrix(object = Y ~ X1 + X2, data = train )[, -1]
NA values need to be removed in y since model.matrix does that for x.
If alpha=0 then a ridge regression model is fit, and if alpha=1 then a lasso model is fit.
library(glmnet)
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
train1 <- subset(train, !is.na(Age))
x <- model.matrix(object = Survived ~ Sex + Pclass + Age + SibSp, data = train1)
y <- train1$Survived
model <- glmnet(x, y, alpha = 0, family = "gaussian")
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Age[is.na(test$Age)] <- mean(test$Age, na.rm = TRUE)
test$Survived <- predict(object = model, newx = )
My first submission file
All females in better than 3rd class survived, and everyone under 14 in better than 3rd class:
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Survived <- as.integer(
(test$Sex == "female" & test$Pclass < 3) |
(test$Pclass < 3 & !is.na(test$Age) & test$Age < 14)
)
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)
This worked and received a public score of 0.78468, placing me 1783 on the leaderboard.
Examining the variables
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
model <- glm(Survived ~ Sex + Age + Pclass + SibSp + Parch + Fare, data = train)
summary(model)
Submission based on glm
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
train$Age[is.na(train$Age)] <- round(mean(train$Age, na.rm = TRUE), digits = 2)
model <- glm(Survived ~ Sex + Age + Pclass + SibSp + Parch + Fare, data = train)
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Age[is.na(test$Age)] <- round(mean(test$Age, na.rm = TRUE), digits = 2)
test$Predict <- predict(object = model, newdata = test)
test$Survived <- as.integer(test$Predict > 0.75)
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)
This scored 0.75
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
train$Age[is.na(train$Age)] <- round(mean(train$Age, na.rm = TRUE), digits = 2)
model <- glm(Survived ~ Sex + Age + Pclass + SibSp + Parch + Fare, data = train)
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Age[is.na(test$Age)] <- round(mean(test$Age, na.rm = TRUE), digits = 2)
test$Predict <- predict(object = model, newdata = test)
test$Survived <- round(test$Predict, digits = 0)
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)
this improved the score to 0.76555, but still not as good as first guestimate.
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
train$AC <- ifelse(!is.na(train$Age) & train$Age < 14, "Child", "Adult")
model <- glm(Survived ~ Sex + AC + Pclass + SibSp + Parch + Fare, data = train)
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$AC <- ifelse(!is.na(test$Age) & test$Age < 14, "Child", "Adult")
test$Predict <- predict(object = model, newdata = test)
test$Survived <- round(test$Predict, digits = 0)
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)
This improved the score to 0.77511, still not as good as first guestimate.
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
train$AC <- ifelse(!is.na(train$Age) & train$Age < 20, "Child", "Adult")
model <- glm(Survived ~ Sex + AC + Pclass + SibSp + Parch + Fare, data = train)
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$AC <- ifelse(!is.na(test$Age) & test$Age < 20, "Child", "Adult")
test$Predict <- predict(object = model, newdata = test)
test$Survived <- round(test$Predict, digits = 0)
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)
This reduced the score to 0.77272.
Tree
library(tree)
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
model <- tree(Survived ~ Sex + Age + Pclass + SibSp + Parch + Fare, data = train)
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Predict <- predict(object = model, newdata = test)
test$Survived <- round(test$Predict,
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)
This produced 0.77511, same as glm.
library(tree)
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
model <- tree(Survived ~ Sex + Age + Pclass + SibSp + Parch + Fare, data = train)
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Predict <- predict(object = model, newdata = test)
test$Survived <- as.integer(test$Predict > 0.75)
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)
No change to score.
library(tree)
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
model <- tree(Survived ~ Sex + Age + Pclass + SibSp + Parch + Fare, data = train)
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Predict <- predict(object = model, newdata = test)
test$Survived <- as.integer(test$Predict > 0.8)
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)
No change.
Visual exploration
library(tree)
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
titanic_tree <- tree(formula = Survived ~ Sex + Age + Parch + SibSp, data = train, subset = Pclass == 1)
svg("tree_first.svg", width = 11, pointsize = 12, family = "sans")
plot(titanic_tree)
text(titanic_tree, pretty = 0)
dev.off()
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
titanic_tree <- tree(formula = Survived ~ Sex + Age + Parch + SibSp, data = train, subset = Pclass == 2)
svg("tree_second.svg", width = 11, pointsize = 12, family = "sans")
plot(titanic_tree)
text(titanic_tree, pretty = 0)
dev.off()
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
titanic_tree <- tree(formula = Survived ~ Sex + Age + Parch + SibSp, data = train, subset = Pclass == 3)
svg("tree_third.svg", width = 11, pointsize = 12, family = "sans")
plot(titanic_tree)
text(titanic_tree, pretty = 0)
dev.off()
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
svg("barplot_class.svg", width = 11, pointsize = 12, family = "sans")
barplot(table(train$Survived, train$Pclass),
names.arg=c("1st", "2nd", "3rd"),
legend.text = c("Died", "Lived"))
dev.off()
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
svg("barplot_class_grouped.svg", width = 11, pointsize = 12, family = "sans")
barplot(table(train$Survived, train$Pclass),
names.arg=c("1st", "2nd", "3rd"),
legend.text = c("Died", "Lived"),
beside = TRUE)
dev.off()
Odds of survival
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
odds_sex_class <- data.frame(
sex_class = c("1st Female", "2nd Female", "3rd Female", "1st Male Under 18", "2nd Male Under 18", "3rd Male Under 18"),
odds = c(
nrow(subset(train, Sex == "female" & Pclass == 1 & Survived == 1))/nrow(subset(train, Sex == "female" & Pclass == 1)),
nrow(subset(train, Sex == "female" & Pclass == 2 & Survived == 1))/nrow(subset(train, Sex == "female" & Pclass == 2)),
nrow(subset(train, Sex == "female" & Pclass == 3 & Survived == 1))/nrow(subset(train, Sex == "female" & Pclass == 3)),
nrow(subset(train, Sex == "male" & Pclass == 1 & !is.na(Age) & Age < 18 & Survived == 1))/nrow(subset(train, Sex == "male" & Pclass == 1 & !is.na(Age) & Age < 18)),
nrow(subset(train, Sex == "male" & Pclass == 2 & !is.na(Age) & Age < 18 & Survived == 1))/nrow(subset(train, Sex == "male" & Pclass == 2 & !is.na(Age) & Age < 18)),
nrow(subset(train, Sex == "male" & Pclass == 3 & !is.na(Age) & Age < 18 & Survived == 1))/nrow(subset(train, Sex == "male" & Pclass == 3 & !is.na(Age) & Age < 18))
))
svg("barplot_odds_class_sex_grouped.svg", width = 11, pointsize = 12, family = "sans")
barplot(odds_sex_class$odds, height = odds_sex_class$odds, names.arg = odds_sex_class$sex_class,
width = 6, ylim = c(0.0, 1.0))
dev.off()
)
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
odds_sex_class <- data.frame(
sex_class = c("1st Female", "2nd Female", "3rd Female", "1st Male", "2nd Male", "3rd Male"),
odds = c(
nrow(subset(train, Sex == "female" & Pclass == 1 & Survived == 1))/nrow(subset(train, Sex == "female" & Pclass == 1)),
nrow(subset(train, Sex == "female" & Pclass == 2 & Survived == 1))/nrow(subset(train, Sex == "female" & Pclass == 2)),
nrow(subset(train, Sex == "female" & Pclass == 3 & Survived == 1))/nrow(subset(train, Sex == "female" & Pclass == 3)),
nrow(subset(train, Sex == "male" & Pclass == 1 & Survived == 1))/nrow(subset(train, Sex == "male" & Pclass == 1)),
nrow(subset(train, Sex == "male" & Pclass == 2 & Survived == 1))/nrow(subset(train, Sex == "male" & Pclass == 2)),
nrow(subset(train, Sex == "male" & Pclass == 3 & Survived == 1))/nrow(subset(train, Sex == "male" & Pclass == 3))
))
svg("barplot_odds_class_sex_age_limit_grouped.svg", width = 11, pointsize = 12, family = "sans")
barplot(odds_sex_class$odds, height = odds_sex_class$odds, names.arg = odds_sex_class$sex_class,
width = 6, ylim = c(0.0, 1.0))
dev.off()
)
Next attempts
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Survived <- as.integer(
(test$Sex == "female" & test$Pclass < 3) |
(test$Sex == "male" & test$Pclass == 1 & !is.na(test$Age) & test$Age < 28) |
(test$Sex == "male" & test$Pclass == 2 & !is.na(test$Age) & test$Age < 12)
)
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)
This produced a result of 0.73444, so worse than first guestimate.
All children under 7 travelling with a parent survived.
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Survived <- as.integer(
(test$Sex == "female" & test$Pclass < 3) |
(test$Pclass < 3 & !is.na(test$Age) & test$Age < 19 & test$Parch > 0) |
(!is.na(test$Age) & test$Age < 7 & test$Parch > 0)
)
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)
This reduced the score to 0.77272.
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Survived <- as.integer(
(test$Sex == "female" & test$Pclass < 3) |
(test$Pclass < 3 & !is.na(test$Age) & test$Age < 19 & test$Parch > 0) |
(test$Sex == "female" & !is.na(test$Age) & test$Age < 3 & test$Parch > 0)
)
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)
Third rule makes no difference to first score if girls under 3 for any class are included, worse if over 3s and boys are included.
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Survived <- as.integer(
(test$Sex == "female" & test$Pclass < 3) |
(test$Pclass < 3 & !is.na(test$Age) & test$Age < 19 & (test$Parch > 0 | test$SibSp > 0)) |
(test$Sex == "female" & !is.na(test$Age) & test$Age < 3 & (test$Parch > 0 | test$SibSp > 0))
)
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Survived <- as.integer(
(test$Sex == "female" & test$Pclass < 3) |
(test$Pclass < 3 & !is.na(test$Age) & test$Age < 23 & (test$Parch > 0 | test$SibSp > 0)) |
(test$Sex == "female" & !is.na(test$Age) & test$Age < 3 & (test$Parch > 0 | test$SibSp > 0))
)
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)
The tree library does the above calculation diagrammatically:
library(tree)
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
tree_sex <- tree(formula = Survived ~ Sex, data = train)
svg("tree_sex.svg", width = 11, pointsize = 12, family = "sans")
plot(tree_sex)
text(tree_sex, pretty = 0)
dev.off()
Making the model more complex by introducing age as a factor
nrow(subset(train, Sex == "female" & Pclass < 3 & Survived == 1))/nrow(subset(train, Sex == "female" & Pclass < 3))
This shows nearly 95% of women in 1st and 2nd class survived, while only 50% of women traveling 3rd class survived.
nrow(subset(train, Sex == "female" & Pclass == 3 & Age < 6 & Survived == 1))/nrow(subset(train, Sex == "female" & Pclass == 3 & Age < 6))
While the odds for women travelling 3rd class was 50%, girls under 6 had a 73% chance of surviving.
nrow(subset(train, Sex == "male" & Pclass < 3 & Age < 18 & Survived == 1))/nrow(subset(train, Sex == "male" & Pclass < 3 & Age < 18))
1st and 2nd class male passengers under 18 had a 87% survival rate, which dropped to 23% for 3rd class males under 18.
Not even very young 3rd class males seemed to have had much chance of surviving.
library(tree)
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
tree_parch <- tree(formula = Survived ~ Sex + Pclass + Age + SibSp + Parch, data = train)
svg("tree_parch.svg", width = 11, pointsize = 12, family = "sans")
plot(tree_parch)
text(tree_parch, pretty = 0)
dev.off()
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
model <- glm(Survived ~ Sex, data = train)
train$Prediction <- round(predict(model, newdata = train))
(sum(train$Survived == train$Prediction)/length(train$Survived)) * 100
predict(model) allocates 0.1889081 to males and 0.7420382 to females, so rounding produces the same as assuming all females survived and all females didn’t of 78.67565%.
Making age a factor
Relationship between Age Group and Survival
A snag here is Age is NA for many entries
Subset train to only rows with Age provided
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
has_age <- subset(train, !is.na(train$Age))
model <- glm(Survived ~ Sex + Age, data = has_age)
has_age$Prediction <- round(predict(model, newdata = has_age))
(sum(has_age$Survived == has_age$Prediction)/length(has_age$Survived)) * 100
This reduces the accuracy of the model slightly to 78.0112%
Subset train to only rows with no Age provided
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
no_age <- subset(train, is.na(train$Age))
model <- glm(Survived ~ Sex, data = no_age)
no_age$Prediction <- round(predict(model, newdata = no_age))
(sum(no_age$Survived == no_age$Prediction)/length(no_age$Survived)) * 100
Passengers with no age provided had a lower survival rate, 0.1290323 for males and 0.6792453 for females. The accuracy of the prediction rose to 81.35593%.
Subset to children and adults
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
children <- subset(train, train$Age < 15)
model <- glm(Survived ~ Sex, data = children)
children$Prediction <- round(predict(model, newdata = children))
(sum(children$Survived == children$Prediction)/length(children$Survived)) * 100
The odds for males rose to 0.5384615 if they were under 15 while those for females dropped to 0.6153846. The accuracy fell to 57.69231%.
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
adults <- subset(train, train$Age >= 15)
model <- glm(Survived ~ Sex, data = adults)
adults$Prediction <- round(predict(model, newdata = adults))
(sum(adults$Survived == adults$Prediction)/length(adults$Survived)) * 100
The odds for male adults fell to 0.1739130 and for females fell to 0.7792793. Accuracy rose to 80.97484%.
Assuming all females and children under 14 survived
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
train$Prediction <- as.integer(train$Sex == "female" | (!is.na(train$Age) & train$Age < 14))
(sum(train$Survived == train$Prediction)/length(train$Survived)) * 100
This increases the accuracy slightly to 79.23681%.
Subsetting by class
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
first <- subset(train, Pclass == 1)
(sum(first$Survived == 1)/length(first$PassengerId)) * 100
first_male <- subset(train, Pclass == 1 & Sex == "male")
(sum(first_male$Survived == 1)/length(first_male$PassengerId)) * 100
first_female <- subset(train, Pclass == 1 & Sex == "female")
(sum(first_female$Survived == 1)/length(first_female$PassengerId)) * 100
62.96296% of first class passengers survived. The odds for males rose to 36.88525% and females to 96.80851%.
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
second <- subset(train, Pclass == 2)
(sum(second$Survived == 1)/length(second$PassengerId)) * 100
second_male <- subset(train, Pclass == 2 & Sex == "male")
(sum(second_male$Survived == 1)/length(second_male$PassengerId)) * 100
second_female <- subset(train, Pclass == 2 & Sex == "female")
(sum(second_female$Survived == 1)/length(second_female$PassengerId)) * 100
47.28261% overall, 15.74074% males and 92.10526% females.
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
third <- subset(train, Pclass == 3)
(sum(third$Survived == 1)/length(third$PassengerId)) * 100
third_male <- subset(train, Pclass == 3 & Sex == "male")
(sum(third_male$Survived == 1)/length(third_male$PassengerId)) * 100
third_female <- subset(train, Pclass == 3 & Sex == "female")
(sum(third_female$Survived == 1)/length(third_female$PassengerId)) * 100
Only 24.23625% of third class passengers survived. Only 13.54467% of males and 50% of females.
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
above_third <- subset(train, Pclass < 3)
(sum(above_third$Survived == 1)/length(above_third$PassengerId)) * 100
above_third_male <- subset(train, Pclass < 3 & Sex == "male")
(sum(above_third_male$Survived == 1)/length(above_third_male$PassengerId)) * 100
above_third_female <- subset(train, Pclass < 3 & Sex == "female")
(sum(above_third_female$Survived == 1)/length(above_third_female$PassengerId)) * 100
Better than 3rd class survived 55.75% overall, 26.95652% of males survived and 94.70588% of females.
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
third_children <- subset(train, Pclass == 3 & Age < 14)
(sum(third_children$Survived == 1)/length(third_children$PassengerId)) * 100
above_third_children <- subset(train, Pclass < 3 & Age < 14)
(sum(above_third_children$Survived == 1)/length(above_third_children$PassengerId)) * 100
Only 42.85714% of 3rd class children survived while 95.45455% of children in 2nd and 1st class survived.
Assuming females and children in 1st or 2nd class survived, and all 3rd class passengers died.
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
train$Prediction <- as.integer((train$Sex == "female" & train$Pclass < 3) |
(!is.na(train$Age) & train$Age < 14 & train$Pclass < 3))
(sum(train$Survived == train$Prediction)/length(train$Survived)) * 100
This raises the accuracy to 80.02245%.
Raising Age to 16.
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Survived <- as.integer((test$Sex == "female" & test$Pclass < 3) |
(!is.na(test$Age) & test$Age < 16 & test$Pclass < 3))
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)
This dropped the score to 0.78229.
Raising Age to 15.
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Survived <- as.integer((test$Sex == "female" & test$Pclass < 3) |
(!is.na(test$Age) & test$Age < 15 & test$Pclass < 3))
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)
The score remained at 0.78229, so < 14 seems best.
Decision Tree
Some factors need to be removed to keep within tree’s limit.
library(tree)
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
titanic_tree <- tree(formula = Survived ~ . - PassengerId - Name - Ticket - Cabin, data = train)
svg("titanic_tree.svg", width = 11, pointsize = 12, family = "sans")
plot(titanic_tree)
text(titanic_tree, pretty = 0)
dev.off()
This suggests the following rules
-
Sex ==“female” & Pclass < 3
-
Sex ==“female” & Pclass == 3 & Fare < 20.8
-
Females in 3rd class younger than 38.5
-
Males in 1st and 2nd class younger than 6.5
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Survived <- as.integer(
(test$Sex == "female" & test$Pclass < 3) |
(test$Sex == "female" & test$Pclass == 3 & !is.na(test$Age) & test$Age < 38.5) |
(test$Sex == "male" & test$Pclass < 3 & !is.na(test$Age) & test$Age < 6.5))
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)
This produced 0.77272, so still worse than first guestimate.
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Survived <- as.integer(
(test$Sex == "female" & test$Pclass < 3) |
(test$Sex == "female" & test$Pclass == 3 & !is.na(test$Age) & test$Age < 38.5) |
(test$Sex == "male" & test$Pclass == 2 & !is.na(test$Age) & test$Age < 6.5) |
(test$Sex == "male" & test$Pclass == 1 & !is.na(test$Age) & test$Age < 53))
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)
This produced 0.72488, so even worse.
Decision Trees
For 1st and 2nd class females
nrow(subset(train, Sex == "female" & Survived == 1 & Pclass < 3))/nrow(subset(train, Sex == "female" & Pclass < 3))
this rose to 0.9470588. but from 3rd class to 0.5.
tree(formula,
data,
weights,
subset,
na.action = na.pass,
control = tree.control(nobs, ...),
method = "recursive.partition",
split = c("deviance", "gini"),
model = FALSE,
x = FALSE,
y = TRUE,
wts = TRUE,
...)
library(treemap)
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
train$Outcome <- factor(ifelse(train$Survived == 1, c("Lived"), c("Died")))
treemap(
dtf = train,
index = c("Survived", "Sex"),
vSize = "Survived"
)
vSize,
vColor = NULL,
stdErr = NULL,
type = "index",
fun.aggregate = "sum",
title = NA,
title.legend = NA,
algorithm = "pivotSize",
sortID = "-size",
mirror.x = FALSE,
mirror.y = FALSE,
palette = NA,
palette.HCL.options = NULL,
range = NA,
mapping = NA,
n = 7,
na.rm = TRUE,
na.color = "#DDDDDD",
na.text = "Missing",
fontsize.title = 14,
fontsize.labels = 11,
fontsize.legend = 12,
fontcolor.labels = NULL,
fontface.labels = c("bold", rep("plain", length(index) - 1)),
fontfamily.title = "sans",
fontfamily.labels = "sans",
fontfamily.legend = "sans",
border.col = "black",
border.lwds = c(length(index) + 1, (length(index) - 1):1),
lowerbound.cex.labels = 0.4,
inflate.labels = FALSE,
bg.labels = NULL,
force.print.labels = FALSE,
overlap.labels = 0.5,
align.labels = c("center", "center"),
xmod.labels = 0,
ymod.labels = 0,
eval.labels = FALSE,
position.legend = NULL,
reverse.legend = FALSE,
format.legend = NULL,
drop.unused.levels = TRUE,
aspRatio = NA,
vp = NULL,
draw = TRUE,
...
)
Hierarchical Clustering
hclust(d, method = "complete", members = NULL)
plot(x, labels = NULL, hang = 0.1, check = TRUE,
axes = TRUE, frame.plot = FALSE, ann = TRUE,
main = "Cluster Dendrogram",
sub = NULL, xlab = NULL, ylab = "Height", ...)
Histogram to figure out age
library(dplyr)
library(ggplot2)
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
train$RoundedAge <- as.integer(round(train$Age))
by_age <- train |> group_by(RoundedAge) |>
summarize(total = n(), survived = sum(Survived), odds = (survived/total) * 100)
by_age1 = by_age[,c("RoundedAge","odds")]
Hierarchical Clustering
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
hc <- hclust(d = dist(train[,c("Survived", "Sex", "Age")]), method = "ave")
plot(hc)
plot(hc, hang = -1)