Visualisation
R
High-level graphics functions initiate a new plot
curve
svg("curve_sin.svg", width = 11, pointsize = 12, family = "sans")
curve(sin, from = 0, to = 2 * pi, n = 101)
abline(h = 0)
dev.off()
https://www.kaggle.com/learn/intro-to-machine-learning
https://datacarpentry.github.io/R-ecology-lesson/
https://www.stat.cmu.edu/~ryantibs/statcomp/lectures/apply.html
R
- subset
- split
- reorder
- aggregate
- unique
- normalize
- scale
models
All of these take a formula as their first argument, a data frame as their second, subset as their third, weights as their fourth, na.action,…
- lm
- glm
- aov
- loess
- model.matrix
- randomForest (needs to be loaded with library(randomForest))
- lda (needs library(lda))
- naive_bayes (needs library(naivebayes))
- tree (needs to be loaded with library(tree))
- pcr (needs to be loaded with library(pcr)
- plsr
- gam
- gbm
- svm
- survfit
- survdiff
- coxph
- qda
- regsubsets
formula
~ operator
formula = y ~ x1 + x2 + x3
formula = y ~ .
ggplot2
dataframes
Accessing single columns
- Using the dollar sign ($) and the name of the column.
train$Survived
- Using square brackets with the index of the column after the comma.
train[,2]
Accessing groups of columns
train[,c("Sex", "Survived", "Age")]
train[,c(5, 2, 6)]
dplyr
- Rows
- filter()
- arrange()
- distinct()
- count()
- Columns
- mutate()
- select()
- rename()
- relocate()
- Groups
- group_by()
- summarize()
Predictors
Which predictors are associated with the response?
What is the relationship between the response and each predictor?
Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?
Classifiers
Models
model: an R object, typically returned by ‘lm’ or ‘glm’.
Assessing Model Accuracy
A good classifier is one for which the test error (2.9) is smallest.
Regression Versus Classification Problems
- linear regression
- logistic regression
- Generalized additive models
- boosting
- support vector machines
John Tukey
Tukey introduced the box plot in his 1977 book, “Exploratory Data Analysis”.
He also introduced the word bit as a portmanteau of binary digit and coined the term software.
iris
Iris is one of the databases built into R (ie it can be accessed without loading), and used by Learn R through examples among others to teach the basics.
Other built-in datasets include:
- mtcars: Motor Trend Car Road Tests
- ToothGrowth
- PlantGrowth
- USArrests
The data()
function provides the full list.
Scatterplot
iris$speciesID <- as.numeric(iris$Species) + 1
iris$shape <- ifelse(iris$Sepal.Length < 5.15, 17, 20)
svg("petal-width-length1.svg", width = 11, pointsize = 12, family = "sans")
plot(Petal.Length ~ Petal.Width, data = iris, col = speciesID, pch = shape)
legend("topleft", levels(iris$Species), fill = 2:4)
dev.off()
dendogram
credit
Y is the response and X the predictor.
Credit = read.csv("Credit.csv", header = T, na.strings = "?", stringsAsFactors = T)
house-prices
https://www.kaggle.com/c/house-prices-advanced-regression-techniques
The object is to predict SalePrice which is provided in train.csv, but missing in test.csv. The submission should just have two columns, Id and SalePrice.
To get a clue which of the many factors to focus on:
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
model <- lm(SalePrice ~ . , data = train)
summary(model)
Next limiting this to those with three asterisks. A snag is one of the rows in KitchenQual has an “NA” which needs to be replaced by some other value for predict
not to throw an error.
titanic
Introductory Example
Kaggle provides a training set (which includes the result, Survived, along with various predictors), a test set which doesn’t include Survived, and a simple example gender_submission.csv to show how results should be submitted.
Writing submission file
The model used for gender_submission.csv is all female passengers survived and all male passengers didn’t.
test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Survived <- as.integer(test$Sex == "female")
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)
Checking accuracy of “all females lived, all males died” assumption.
train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
train$Prediction <- as.integer(train$Sex == "female")
(sum(train$Survived == train$Prediction)/nrow(train)) * 100
On the training data, this assumption gives 78.67565% accuracy.
wage
Ordering boxplots
Ordered By Marital Status
Base plot
Wage = read.csv("Wage.csv", header = T, na.strings = "?", stringsAsFactors = T)
new_order <- with(Wage, reorder(maritl , wage, median , na.rm=T))
svg("ordered-maritl.svg", width = 11, pointsize = 12, family = "sans")
plot(new_order, Wage$wage)
dev.off()
ggplot2
library(ggplot2)
Wage = read.csv("Wage.csv", header = T, na.strings = "?", stringsAsFactors = T)
new_order <- with(Wage, reorder(maritl , wage, median , na.rm=T))
svg("ordered-maritl-ggplot2.svg", width = 11, pointsize = 12, family = "sans")
ggplot(data = Wage, mapping = aes(x = new_order, y = wage)) +
geom_boxplot(fill = c(2, 3, 7, 4, 8), alpha = 0.2) +
xlab("Marital Status") +
ylab("Wage")
dev.off()
Finding variable of interest
x must be a numeric vector
d3
geojson
The media type for GeoJSON is “application/geo+json”.
{
"type": "Feature",
"properties": {
"name": "Coors Field",
"amenity": "Baseball Stadium",
"popupContent": "This is where the Rockies play!"
},
"geometry": {
"type": "Point",
"coordinates": [-104.99404, 39.75621]
}
}
- Geometry
- a region of space
- Feature
- a spatially bounded entity
- FeatureCollection
- a list of Features
{
"type": "FeatureCollection",
"features": [{
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [102.0, 0.5]
},
"properties": {
"prop0": "value0"
}
}, {
"type": "Feature",
"geometry": {
"type": "LineString",
"coordinates": [
[102.0, 0.0],
[103.0, 1.0],
[104.0, 0.0],
[105.0, 1.0]
]
},
"properties": {
"prop0": "value0",
"prop1": 0.0
}
}, {
"type": "Feature",
"geometry": {
"type": "Polygon",
"coordinates": [
[
[100.0, 0.0],
[101.0, 0.0],
[101.0, 1.0],
[100.0, 1.0],
[100.0, 0.0]
]
]
},
"properties": {
"prop0": "value0",
"prop1": {
"this": "that"
}
}
}]
}
{
"type": "MultiLineString",
"coordinates": [
[
[170.0, 45.0], [180.0, 45.0]
], [
[-180.0, 45.0], [-170.0, 45.0]
]
]
}
{
"type": "Feature",
"bbox": [-10.0, -10.0, 10.0, 10.0],
"geometry": {
"type": "Polygon",
"coordinates": [
[
[-10.0, -10.0],
[10.0, -10.0],
[10.0, 10.0],
[-10.0, -10.0]
]
]
}
//...
}