Poop Sheet

Titanic

Introductory Example

Kaggle provides a training set (which includes the result, Survived, along with various predictors), a test set which doesn’t include Survived, and a simple example gender_submission.csv to show how results should be submitted.

Writing submission file

The model used for gender_submission.csv is all female passengers survived and all male passengers didn’t.

test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Survived <- as.integer(test$Sex == "female")
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)

Checking accuracy of “all females lived, all males died” assumption.

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
train$Prediction <- as.integer(train$Sex == "female")
(sum(train$Survived == train$Prediction)/nrow(train)) * 100

On the training data, this assumption gives 78.67565% accuracy.

Visualisation

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
svg("pie1.svg", width = 11, pointsize = 12, family = "sans")
pie(table(train[,c("Survived", "Sex")]), labels=c("Females Died 81","Females Lived 233","Males Died 468","Males Lived 109"))
dev.off()

Pie of male/female survivors

Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data. — R’s pie help page.

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
svg("barplot1.svg", width = 11, pointsize = 12, family = "sans")
barplot(table(train[,c("Survived", "Sex")]), col = c(4,5))
legend("topleft", legend = c("Died", "Lived"), fill = c(4,5))
dev.off()

Barplot of male/female survivors

Factors can be combined by concatenating strings using paste.

train$ClassSex <- paste(train$Pclass, train$Sex)
svg("barplot2.svg", width = 11, pointsize = 12, family = "sans")
barplot(table(train[,c("Survived", "ClassSex")]), col = c(4,5))
legend("topleft", legend = c("Died", "Lived"), fill = c(4,5))
dev.off()

Barplot of male/female survivors subsetted by class

Indexes

PassengerID is obviously an index. Another is Name. While Ticket is not unique, presumably families travelled with a shared ticket, most tickets only have a frequency of 1. Cabin is similar.

Best Subset Selection

leaps

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
library(leaps)
regfit.full <- regsubsets(Survived ~ ., train)

This throws the following error:

Reordering variables and trying again:
Error in leaps.exhaustive(a, really.big) : 
  Exhaustive search will be S L O W, must specify really.big=T
In addition: Warning message:
In leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in = force.in,  :
  1015  linear dependencies found

Using really.big=T crashes my laptop.

Next step is to remove all the ID colums from the data.

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
library(leaps)
regfit <- regsubsets(Survived ~ . - PassengerId - Name - Ticket - Cabin - Embarked, train)
reg.summary <- summary(regfit)
         Pclass Sexmale Age SibSp Parch Fare
1  ( 1 ) " "    "*"     " " " "   " "   " " 
2  ( 1 ) "*"    "*"     " " " "   " "   " " 
3  ( 1 ) "*"    "*"     "*" " "   " "   " " 
4  ( 1 ) "*"    "*"     "*" "*"   " "   " " 
5  ( 1 ) "*"    "*"     "*" "*"   " "   "*" 
6  ( 1 ) "*"    "*"     "*" "*"   "*"   "*" 

This suggests the subsets.

  1. Sex
  2. Sex, Pclass
  3. Sex, Pclass, Age
  4. Sex. Pclass, Age, SibSp
  5. Sex. Pclass, Age, SibSp, Fare
  6. Sex. Pclass, Age, SibSp, Fare, Parch

The optimal subset size is suggested by summary(regfit)

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
library(leaps)
regfit <- regsubsets(Survived ~ . - PassengerId - Name - Ticket - Cabin - Embarked, train)
reg.summary <- summary(regfit)
svg("subset-selection.svg", width = 11, pointsize = 12, family = "sans")
par(mfrow = c(2, 2))
which.min(reg.summary$rss) # 6
plot(reg.summary$rss , xlab = "Number of Variables", ylab = "RSS", type = "l")
points(6, reg.summary$rss[6], col = "red", cex = 2, pch = 20)
which.max(reg.summary$adjr2) # 4
plot(reg.summary$adjr2 , xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l")
points(4, reg.summary$adjr2[4], col = "red", cex = 2, pch = 20)
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
which.min(reg.summary$cp) # 4
points(4, reg.summary$cp[4], col = "red", cex = 2, pch = 20)
which.min(reg.summary$bic) # 4
plot(reg.summary$bic , xlab = "Number of Variables", ylab = "BIC", type = "l")
points(4, reg.summary$bic[4], col = "red", cex = 2, pch = 20)
dev.off()

Subset Selection

\(P_c\), AIC, BIC, and Adjusted \(R^2\) are explained on p244 of An Introduction to Statistical Learning with Applications in R.

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
library(pls)
model <- pcr(Survived ~ . - PassengerId - Name - Ticket - Cabin - Embarked, data = train, validation = "CV")

reg.summary <- summary(regfit)

glmnet

This function has slightly different syntax from other model-fitting functions that we have encountered thus far in this book. In particular, we must pass in an x matrix as well as a y vector, and we do not use the y ∼ x syntax.

The x matric can be obtained with x <- model.matrix(object = Y ~ X1 + X2, data = train )[, -1]

NA values need to be removed in y since model.matrix does that for x.

If alpha=0 then a ridge regression model is fit, and if alpha=1 then a lasso model is fit.

library(glmnet)
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
train1 <- subset(train, !is.na(Age))
x <- model.matrix(object = Survived ~ Sex + Pclass + Age + SibSp, data = train1)
y <- train1$Survived
model <- glmnet(x, y, alpha = 0, family = "gaussian")
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Age[is.na(test$Age)] <- mean(test$Age, na.rm = TRUE)
test$Survived <- predict(object = model, newx = )

My first submission file

All females in better than 3rd class survived, and everyone under 14 in better than 3rd class:

test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Survived <- as.integer(
  (test$Sex == "female" & test$Pclass < 3) |
  (test$Pclass < 3 & !is.na(test$Age) & test$Age < 14)
)
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)

This worked and received a public score of 0.78468, placing me 1783 on the leaderboard.

Examining the variables

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
model <- glm(Survived ~ Sex + Age + Pclass + SibSp + Parch + Fare, data = train)
summary(model)

Submission based on glm

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
train$Age[is.na(train$Age)] <- round(mean(train$Age, na.rm = TRUE), digits = 2)
model <- glm(Survived ~ Sex + Age + Pclass + SibSp + Parch + Fare, data = train)
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Age[is.na(test$Age)] <- round(mean(test$Age, na.rm = TRUE), digits = 2)
test$Predict <- predict(object = model, newdata = test)
test$Survived <- as.integer(test$Predict > 0.75)
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)

This scored 0.75

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
train$Age[is.na(train$Age)] <- round(mean(train$Age, na.rm = TRUE), digits = 2)
model <- glm(Survived ~ Sex + Age + Pclass + SibSp + Parch + Fare, data = train)
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Age[is.na(test$Age)] <- round(mean(test$Age, na.rm = TRUE), digits = 2)
test$Predict <- predict(object = model, newdata = test)
test$Survived <- round(test$Predict, digits = 0)
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)

this improved the score to 0.76555, but still not as good as first guestimate.

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
train$AC <- ifelse(!is.na(train$Age) & train$Age < 14, "Child", "Adult")
model <- glm(Survived ~ Sex + AC + Pclass + SibSp + Parch + Fare, data = train)
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$AC <- ifelse(!is.na(test$Age) & test$Age < 14, "Child", "Adult")
test$Predict <- predict(object = model, newdata = test)
test$Survived <- round(test$Predict, digits = 0)
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)

This improved the score to 0.77511, still not as good as first guestimate.

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
train$AC <- ifelse(!is.na(train$Age) & train$Age < 20, "Child", "Adult")
model <- glm(Survived ~ Sex + AC + Pclass + SibSp + Parch + Fare, data = train)
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$AC <- ifelse(!is.na(test$Age) & test$Age < 20, "Child", "Adult")
test$Predict <- predict(object = model, newdata = test)
test$Survived <- round(test$Predict, digits = 0)
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)

This reduced the score to 0.77272.

Tree

library(tree)
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
model <- tree(Survived ~ Sex + Age + Pclass + SibSp + Parch + Fare, data = train)
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Predict <- predict(object = model, newdata = test)
test$Survived <- round(test$Predict,
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)

This produced 0.77511, same as glm.

library(tree)
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
model <- tree(Survived ~ Sex + Age + Pclass + SibSp + Parch + Fare, data = train)
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Predict <- predict(object = model, newdata = test)
test$Survived <- as.integer(test$Predict > 0.75)
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)

No change to score.

library(tree)
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
model <- tree(Survived ~ Sex + Age + Pclass + SibSp + Parch + Fare, data = train)
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Predict <- predict(object = model, newdata = test)
test$Survived <- as.integer(test$Predict > 0.8)
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)

No change.

Visual exploration

library(tree)
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
titanic_tree <- tree(formula = Survived ~ Sex + Age + Parch + SibSp, data = train, subset = Pclass == 1)
svg("tree_first.svg", width = 11, pointsize = 12, family = "sans")
plot(titanic_tree)
text(titanic_tree, pretty = 0)
dev.off()

Decision tree for 1st class

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
titanic_tree <- tree(formula = Survived ~ Sex + Age + Parch + SibSp, data = train, subset = Pclass == 2)
svg("tree_second.svg", width = 11, pointsize = 12, family = "sans")
plot(titanic_tree)
text(titanic_tree, pretty = 0)
dev.off()

Decision tree for 2nd class

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
titanic_tree <- tree(formula = Survived ~ Sex + Age + Parch + SibSp, data = train, subset = Pclass == 3)
svg("tree_third.svg", width = 11, pointsize = 12, family = "sans")
plot(titanic_tree)
text(titanic_tree, pretty = 0)
dev.off()

Decision tree for 3rd class

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
svg("barplot_class.svg", width = 11, pointsize = 12, family = "sans")
barplot(table(train$Survived, train$Pclass), 
  names.arg=c("1st", "2nd", "3rd"),
  legend.text = c("Died", "Lived"))
dev.off()

Barplot by class

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
svg("barplot_class_grouped.svg", width = 11, pointsize = 12, family = "sans")
barplot(table(train$Survived, train$Pclass), 
  names.arg=c("1st", "2nd", "3rd"),
  legend.text = c("Died", "Lived"),
  beside = TRUE)
dev.off()

Grouped barplot by class

Odds of survival

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
odds_sex_class <- data.frame(
  sex_class = c("1st Female", "2nd Female", "3rd Female", "1st Male Under 18", "2nd Male Under 18", "3rd Male Under 18"),
  odds = c(
    nrow(subset(train, Sex == "female" & Pclass == 1 & Survived == 1))/nrow(subset(train, Sex == "female" & Pclass == 1)),
    nrow(subset(train, Sex == "female" & Pclass == 2 & Survived == 1))/nrow(subset(train, Sex == "female" & Pclass == 2)),
    nrow(subset(train, Sex == "female" & Pclass == 3 & Survived == 1))/nrow(subset(train, Sex == "female" & Pclass == 3)),
    nrow(subset(train, Sex == "male" & Pclass == 1 & !is.na(Age) & Age < 18 & Survived == 1))/nrow(subset(train, Sex == "male" & Pclass == 1 & !is.na(Age) & Age < 18)),
    nrow(subset(train, Sex == "male" & Pclass == 2 & !is.na(Age) & Age < 18 & Survived == 1))/nrow(subset(train, Sex == "male" & Pclass == 2 & !is.na(Age) & Age < 18)),
    nrow(subset(train, Sex == "male" & Pclass == 3 & !is.na(Age) & Age < 18 & Survived == 1))/nrow(subset(train, Sex == "male" & Pclass == 3 & !is.na(Age) & Age < 18))
))
svg("barplot_odds_class_sex_grouped.svg", width = 11, pointsize = 12, family = "sans")
barplot(odds_sex_class$odds, height = odds_sex_class$odds, names.arg = odds_sex_class$sex_class, 
  width = 6, ylim = c(0.0, 1.0))
dev.off()
)

Odds bargraph

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
odds_sex_class <- data.frame(
  sex_class = c("1st Female", "2nd Female", "3rd Female", "1st Male", "2nd Male", "3rd Male"),
  odds = c(
    nrow(subset(train, Sex == "female" & Pclass == 1 & Survived == 1))/nrow(subset(train, Sex == "female" & Pclass == 1)),
    nrow(subset(train, Sex == "female" & Pclass == 2 & Survived == 1))/nrow(subset(train, Sex == "female" & Pclass == 2)),
    nrow(subset(train, Sex == "female" & Pclass == 3 & Survived == 1))/nrow(subset(train, Sex == "female" & Pclass == 3)),
    nrow(subset(train, Sex == "male" & Pclass == 1 & Survived == 1))/nrow(subset(train, Sex == "male" & Pclass == 1)),
    nrow(subset(train, Sex == "male" & Pclass == 2 & Survived == 1))/nrow(subset(train, Sex == "male" & Pclass == 2)),
    nrow(subset(train, Sex == "male" & Pclass == 3 & Survived == 1))/nrow(subset(train, Sex == "male" & Pclass == 3))
))
svg("barplot_odds_class_sex_age_limit_grouped.svg", width = 11, pointsize = 12, family = "sans")
barplot(odds_sex_class$odds, height = odds_sex_class$odds, names.arg = odds_sex_class$sex_class, 
  width = 6, ylim = c(0.0, 1.0))
dev.off()
)

Odds bargraph

Next attempts

test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Survived <- as.integer(
  (test$Sex == "female" & test$Pclass < 3) |
  (test$Sex == "male" & test$Pclass == 1 & !is.na(test$Age) & test$Age < 28) |
  (test$Sex == "male" & test$Pclass == 2 & !is.na(test$Age) & test$Age < 12)
)
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)

This produced a result of 0.73444, so worse than first guestimate.

All children under 7 travelling with a parent survived.

test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Survived <- as.integer(
  (test$Sex == "female" & test$Pclass < 3) |
  (test$Pclass < 3 & !is.na(test$Age) & test$Age < 19 & test$Parch > 0) |
  (!is.na(test$Age) & test$Age < 7 & test$Parch > 0)
)
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)

This reduced the score to 0.77272.

test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Survived <- as.integer(
  (test$Sex == "female" & test$Pclass < 3) |
  (test$Pclass < 3 & !is.na(test$Age) & test$Age < 19 & test$Parch > 0) |
  (test$Sex == "female" & !is.na(test$Age) & test$Age < 3 & test$Parch > 0)
)
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)

Third rule makes no difference to first score if girls under 3 for any class are included, worse if over 3s and boys are included.

test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Survived <- as.integer(
  (test$Sex == "female" & test$Pclass < 3) |
  (test$Pclass < 3 & !is.na(test$Age) & test$Age < 19 & (test$Parch > 0 | test$SibSp > 0)) |
  (test$Sex == "female" & !is.na(test$Age) & test$Age < 3 & (test$Parch > 0 | test$SibSp > 0))
)
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Survived <- as.integer(
  (test$Sex == "female" & test$Pclass < 3) |
  (test$Pclass < 3 & !is.na(test$Age) & test$Age < 23 & (test$Parch > 0 | test$SibSp > 0)) |
  (test$Sex == "female" & !is.na(test$Age) & test$Age < 3 & (test$Parch > 0 | test$SibSp > 0))
)
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)

The tree library does the above calculation diagrammatically:

library(tree)
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
tree_sex <- tree(formula = Survived ~ Sex, data = train)
svg("tree_sex.svg", width = 11, pointsize = 12, family = "sans")
plot(tree_sex)
text(tree_sex, pretty = 0)
dev.off()

Female/Male odds

Making the model more complex by introducing age as a factor

nrow(subset(train, Sex == "female" & Pclass < 3 & Survived == 1))/nrow(subset(train, Sex == "female" & Pclass < 3))

This shows nearly 95% of women in 1st and 2nd class survived, while only 50% of women traveling 3rd class survived.

nrow(subset(train, Sex == "female" & Pclass == 3 & Age < 6 & Survived == 1))/nrow(subset(train, Sex == "female" & Pclass == 3 & Age < 6))

While the odds for women travelling 3rd class was 50%, girls under 6 had a 73% chance of surviving.

nrow(subset(train, Sex == "male" & Pclass < 3 & Age < 18 & Survived == 1))/nrow(subset(train, Sex == "male" & Pclass < 3 & Age < 18))

1st and 2nd class male passengers under 18 had a 87% survival rate, which dropped to 23% for 3rd class males under 18.

Not even very young 3rd class males seemed to have had much chance of surviving.

library(tree)
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
tree_parch <- tree(formula = Survived ~ Sex + Pclass + Age + SibSp + Parch, data = train)
svg("tree_parch.svg", width = 11, pointsize = 12, family = "sans")
plot(tree_parch)
text(tree_parch, pretty = 0)
dev.off()

Female/Male odds

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
model <- glm(Survived ~ Sex, data = train)
train$Prediction <- round(predict(model, newdata = train))
(sum(train$Survived == train$Prediction)/length(train$Survived)) * 100

predict(model) allocates 0.1889081 to males and 0.7420382 to females, so rounding produces the same as assuming all females survived and all females didn’t of 78.67565%.

Making age a factor

Relationship between Age Group and Survival

A snag here is Age is NA for many entries

Subset train to only rows with Age provided
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
has_age <- subset(train, !is.na(train$Age))
model <- glm(Survived ~ Sex + Age, data = has_age)
has_age$Prediction <- round(predict(model, newdata = has_age))
(sum(has_age$Survived == has_age$Prediction)/length(has_age$Survived)) * 100

This reduces the accuracy of the model slightly to 78.0112%

Subset train to only rows with no Age provided
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
no_age <- subset(train, is.na(train$Age))
model <- glm(Survived ~ Sex, data = no_age)
no_age$Prediction <- round(predict(model, newdata = no_age))
(sum(no_age$Survived == no_age$Prediction)/length(no_age$Survived)) * 100

Passengers with no age provided had a lower survival rate, 0.1290323 for males and 0.6792453 for females. The accuracy of the prediction rose to 81.35593%.

Subset to children and adults
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
children <- subset(train, train$Age < 15)
model <- glm(Survived ~ Sex, data = children)
children$Prediction <- round(predict(model, newdata = children))
(sum(children$Survived == children$Prediction)/length(children$Survived)) * 100

The odds for males rose to 0.5384615 if they were under 15 while those for females dropped to 0.6153846. The accuracy fell to 57.69231%.

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
adults <- subset(train, train$Age >= 15)
model <- glm(Survived ~ Sex, data = adults)
adults$Prediction <- round(predict(model, newdata = adults))
(sum(adults$Survived == adults$Prediction)/length(adults$Survived)) * 100

The odds for male adults fell to 0.1739130 and for females fell to 0.7792793. Accuracy rose to 80.97484%.

Assuming all females and children under 14 survived

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
train$Prediction <- as.integer(train$Sex == "female" | (!is.na(train$Age) & train$Age < 14))
(sum(train$Survived == train$Prediction)/length(train$Survived)) * 100

This increases the accuracy slightly to 79.23681%.

Subsetting by class

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
first <- subset(train, Pclass == 1)
(sum(first$Survived == 1)/length(first$PassengerId)) * 100
first_male <- subset(train, Pclass == 1 & Sex == "male")
(sum(first_male$Survived == 1)/length(first_male$PassengerId)) * 100
first_female <- subset(train, Pclass == 1 & Sex == "female")
(sum(first_female$Survived == 1)/length(first_female$PassengerId)) * 100

62.96296% of first class passengers survived. The odds for males rose to 36.88525% and females to 96.80851%.

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
second <- subset(train, Pclass == 2)
(sum(second$Survived == 1)/length(second$PassengerId)) * 100
second_male <- subset(train, Pclass == 2 & Sex == "male")
(sum(second_male$Survived == 1)/length(second_male$PassengerId)) * 100
second_female <- subset(train, Pclass == 2 & Sex == "female")
(sum(second_female$Survived == 1)/length(second_female$PassengerId)) * 100

47.28261% overall, 15.74074% males and 92.10526% females.

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
third <- subset(train, Pclass == 3)
(sum(third$Survived == 1)/length(third$PassengerId)) * 100
third_male <- subset(train, Pclass == 3 & Sex == "male")
(sum(third_male$Survived == 1)/length(third_male$PassengerId)) * 100
third_female <- subset(train, Pclass == 3 & Sex == "female")
(sum(third_female$Survived == 1)/length(third_female$PassengerId)) * 100

Only 24.23625% of third class passengers survived. Only 13.54467% of males and 50% of females.

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
above_third <- subset(train, Pclass < 3)
(sum(above_third$Survived == 1)/length(above_third$PassengerId)) * 100
above_third_male <- subset(train, Pclass < 3 & Sex == "male")
(sum(above_third_male$Survived == 1)/length(above_third_male$PassengerId)) * 100
above_third_female <- subset(train, Pclass < 3 & Sex == "female")
(sum(above_third_female$Survived == 1)/length(above_third_female$PassengerId)) * 100

Better than 3rd class survived 55.75% overall, 26.95652% of males survived and 94.70588% of females.

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
third_children <- subset(train, Pclass == 3 & Age < 14)
(sum(third_children$Survived == 1)/length(third_children$PassengerId)) * 100
above_third_children <- subset(train, Pclass < 3 & Age < 14)
(sum(above_third_children$Survived == 1)/length(above_third_children$PassengerId)) * 100

Only 42.85714% of 3rd class children survived while 95.45455% of children in 2nd and 1st class survived.

Assuming females and children in 1st or 2nd class survived, and all 3rd class passengers died.

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
train$Prediction <- as.integer((train$Sex == "female" & train$Pclass < 3) | 
  (!is.na(train$Age) & train$Age < 14 & train$Pclass < 3))
(sum(train$Survived == train$Prediction)/length(train$Survived)) * 100

This raises the accuracy to 80.02245%.

Raising Age to 16.

test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Survived <- as.integer((test$Sex == "female" & test$Pclass < 3) | 
  (!is.na(test$Age) & test$Age < 16 & test$Pclass < 3))
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)

This dropped the score to 0.78229.

Raising Age to 15.

test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Survived <- as.integer((test$Sex == "female" & test$Pclass < 3) | 
  (!is.na(test$Age) & test$Age < 15 & test$Pclass < 3))
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)

The score remained at 0.78229, so < 14 seems best.

Decision Tree

Some factors need to be removed to keep within tree’s limit.

library(tree)
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
titanic_tree <- tree(formula = Survived ~ . - PassengerId - Name - Ticket - Cabin, data = train)
svg("titanic_tree.svg", width = 11, pointsize = 12, family = "sans")
plot(titanic_tree)
text(titanic_tree, pretty = 0)
dev.off()

Titanic Decision Tree from train data

This suggests the following rules

test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Survived <- as.integer(
  (test$Sex == "female" & test$Pclass < 3) |
  (test$Sex == "female" & test$Pclass == 3 & !is.na(test$Age) & test$Age < 38.5) |
  (test$Sex == "male" & test$Pclass < 3 & !is.na(test$Age) & test$Age < 6.5))
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)

This produced 0.77272, so still worse than first guestimate.

test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Survived <- as.integer(
  (test$Sex == "female" & test$Pclass < 3) |
  (test$Sex == "female" & test$Pclass == 3 & !is.na(test$Age) & test$Age < 38.5) |
  (test$Sex == "male" & test$Pclass == 2 & !is.na(test$Age) & test$Age < 6.5) |
  (test$Sex == "male" & test$Pclass == 1 & !is.na(test$Age) & test$Age < 53))
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)

This produced 0.72488, so even worse.

Decision Trees

For 1st and 2nd class females

nrow(subset(train, Sex == "female" & Survived == 1 & Pclass < 3))/nrow(subset(train, Sex == "female" & Pclass < 3))

this rose to 0.9470588. but from 3rd class to 0.5.

tree(formula,
     data,
     weights,
     subset,
     na.action = na.pass,
     control = tree.control(nobs, ...),
     method = "recursive.partition",
     split = c("deviance", "gini"),
     model = FALSE,
     x = FALSE,
     y = TRUE,
     wts = TRUE,
     ...)

R Graph Gallery

library(treemap)
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)



train$Outcome <- factor(ifelse(train$Survived == 1, c("Lived"), c("Died")))


 
treemap(
  dtf = train,
  index = c("Survived", "Sex"),
  vSize = "Survived"
)
       vSize,
       vColor = NULL,
       stdErr = NULL,
       type = "index",
       fun.aggregate = "sum",
       title = NA,
       title.legend = NA,
       algorithm = "pivotSize",
       sortID = "-size",
       mirror.x = FALSE,
       mirror.y = FALSE,
       palette = NA,
       palette.HCL.options = NULL,
       range = NA,
       mapping = NA,
       n = 7,
       na.rm = TRUE,
       na.color = "#DDDDDD",
       na.text = "Missing",
       fontsize.title = 14,
       fontsize.labels = 11,
       fontsize.legend = 12,
       fontcolor.labels = NULL,
       fontface.labels = c("bold", rep("plain", length(index) - 1)),
       fontfamily.title = "sans",
       fontfamily.labels = "sans",
       fontfamily.legend = "sans",
       border.col = "black",
       border.lwds = c(length(index) + 1, (length(index) - 1):1),
       lowerbound.cex.labels = 0.4,
       inflate.labels = FALSE,
       bg.labels = NULL,
       force.print.labels = FALSE,
       overlap.labels = 0.5,
       align.labels = c("center", "center"),
       xmod.labels = 0,
       ymod.labels = 0,
       eval.labels = FALSE,
       position.legend = NULL,
       reverse.legend = FALSE,
       format.legend = NULL,
       drop.unused.levels = TRUE,
       aspRatio = NA,
       vp = NULL,
       draw = TRUE,
       ...
     )

Hierarchical Clustering

example

hclust(d, method = "complete", members = NULL)

plot(x, labels = NULL, hang = 0.1, check = TRUE,
  axes = TRUE, frame.plot = FALSE, ann = TRUE,
  main = "Cluster Dendrogram",
  sub = NULL, xlab = NULL, ylab = "Height", ...)

Histogram to figure out age

library(dplyr)
library(ggplot2)
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
train$RoundedAge <- as.integer(round(train$Age))
by_age <- train |> group_by(RoundedAge) |> 
  summarize(total = n(), survived = sum(Survived), odds = (survived/total) * 100)
by_age1 = by_age[,c("RoundedAge","odds")]

Hierarchical Clustering

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
hc <- hclust(d = dist(train[,c("Survived", "Sex", "Age")]), method = "ave")
     plot(hc)
     plot(hc, hang = -1)