Poop Sheet

Visualisation

R

https://www.bigbookofr.com/

High-level graphics functions initiate a new plot

curve
svg("curve_sin.svg", width = 11, pointsize = 12, family = "sans")
curve(sin, from = 0, to = 2 * pi, n = 101)
abline(h = 0)
dev.off()

sin curve

https://www.kaggle.com/learn/intro-to-machine-learning

https://r4ds.had.co.nz/

https://r4ds.hadley.nz/

https://datacarpentry.github.io/R-ecology-lesson/

https://www.stat.cmu.edu/~ryantibs/statcomp/lectures/apply.html

from data to viz

R Graph Gallery

R

models

All of these take a formula as their first argument, a data frame as their second, subset as their third, weights as their fourth, na.action,…

formula

~ operator

formula = y ~ x1 + x2 + x3

formula = y ~ .

ggplot2

Book

dataframes

Accessing single columns

  1. Using the dollar sign ($) and the name of the column. train$Survived
  2. Using square brackets with the index of the column after the comma. train[,2]

Accessing groups of columns

  1. train[,c("Sex", "Survived", "Age")]
  2. train[,c(5, 2, 6)]

dplyr

basics

  1. Rows
  2. filter()
  3. arrange()
  4. distinct()
  5. count()
  6. Columns
  7. mutate()
  8. select()
  9. rename()
  10. relocate()
  11. Groups
  12. group_by()
  13. summarize()

Predictors

predict

Which predictors are associated with the response?

hatvalues

What is the relationship between the response and each predictor?

Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?

Classifiers

Models

model: an R object, typically returned by ‘lm’ or ‘glm’.

Assessing Model Accuracy

A good classifier is one for which the test error (2.9) is smallest.

Regression Versus Classification Problems

John Tukey

Tukey introduced the box plot in his 1977 book, “Exploratory Data Analysis”.

He also introduced the word bit as a portmanteau of binary digit and coined the term software.

List Of JavaScript libraries

iris

Iris Flower Classification

Iris is one of the databases built into R (ie it can be accessed without loading), and used by Learn R through examples among others to teach the basics.

Other built-in datasets include:

The data() function provides the full list.

Scatterplot

iris$speciesID <- as.numeric(iris$Species) + 1
iris$shape <- ifelse(iris$Sepal.Length < 5.15, 17, 20)
svg("petal-width-length1.svg", width = 11, pointsize = 12, family = "sans")
plot(Petal.Length ~ Petal.Width, data = iris, col = speciesID, pch = shape)
legend("topleft", levels(iris$Species), fill = 2:4)
dev.off()

Petal width to length scatterplot

dendogram

Wikipedia

R Graph Gallery

credit

Y is the response and X the predictor.

Credit = read.csv("Credit.csv", header = T, na.strings = "?", stringsAsFactors = T)

house-prices

https://www.kaggle.com/c/house-prices-advanced-regression-techniques

The object is to predict SalePrice which is provided in train.csv, but missing in test.csv. The submission should just have two columns, Id and SalePrice.

To get a clue which of the many factors to focus on:

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
model <- lm(SalePrice ~ . , data = train)
summary(model)

Next limiting this to those with three asterisks. A snag is one of the rows in KitchenQual has an “NA” which needs to be replaced by some other value for predict not to throw an error.

titanic

Introductory Example

Kaggle provides a training set (which includes the result, Survived, along with various predictors), a test set which doesn’t include Survived, and a simple example gender_submission.csv to show how results should be submitted.

Writing submission file

The model used for gender_submission.csv is all female passengers survived and all male passengers didn’t.

test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Survived <- as.integer(test$Sex == "female")
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)

Checking accuracy of “all females lived, all males died” assumption.

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
train$Prediction <- as.integer(train$Sex == "female")
(sum(train$Survived == train$Prediction)/nrow(train)) * 100

On the training data, this assumption gives 78.67565% accuracy.

wage

Ordering boxplots

Ordered By Marital Status

Base plot

Wage = read.csv("Wage.csv", header = T, na.strings = "?", stringsAsFactors = T)
new_order <- with(Wage, reorder(maritl , wage, median , na.rm=T))
svg("ordered-maritl.svg", width = 11, pointsize = 12, family = "sans")
plot(new_order, Wage$wage)
dev.off()

Ordered boxplot

ggplot2

library(ggplot2)
Wage = read.csv("Wage.csv", header = T, na.strings = "?", stringsAsFactors = T)
new_order <- with(Wage, reorder(maritl , wage, median , na.rm=T))
svg("ordered-maritl-ggplot2.svg", width = 11, pointsize = 12, family = "sans")
ggplot(data = Wage, mapping = aes(x = new_order, y = wage)) + 
  geom_boxplot(fill = c(2, 3, 7, 4, 8), alpha = 0.2) +
  xlab("Marital Status") +
  ylab("Wage")
dev.off()

Ordered boxplot

Finding variable of interest

x must be a numeric vector

d3

Boxplot

geojson

The media type for GeoJSON is “application/geo+json”.

RFC 7946

Using GeoJSON with Leaflet

{
    "type": "Feature",
    "properties": {
        "name": "Coors Field",
        "amenity": "Baseball Stadium",
        "popupContent": "This is where the Rockies play!"
    },
    "geometry": {
        "type": "Point",
        "coordinates": [-104.99404, 39.75621]
    }
}
Geometry
a region of space
Feature
a spatially bounded entity
FeatureCollection
a list of Features
{
       "type": "FeatureCollection",
       "features": [{
           "type": "Feature",
           "geometry": {
               "type": "Point",
               "coordinates": [102.0, 0.5]
           },
           "properties": {
               "prop0": "value0"
           }
       }, {
           "type": "Feature",
           "geometry": {
               "type": "LineString",
               "coordinates": [
                   [102.0, 0.0],
                   [103.0, 1.0],
                   [104.0, 0.0],
                   [105.0, 1.0]
               ]
           },
           "properties": {
               "prop0": "value0",
               "prop1": 0.0
           }
       }, {
           "type": "Feature",
           "geometry": {
               "type": "Polygon",
               "coordinates": [
                   [
                       [100.0, 0.0],
                       [101.0, 0.0],
                       [101.0, 1.0],
                       [100.0, 1.0],
                       [100.0, 0.0]
                   ]
               ]
           },
           "properties": {
               "prop0": "value0",
               "prop1": {
                   "this": "that"
               }
           }
       }]
   }
   {
       "type": "MultiLineString",
       "coordinates": [
           [
               [170.0, 45.0], [180.0, 45.0]
           ], [
               [-180.0, 45.0], [-170.0, 45.0]
           ]
       ]
   }
   {
       "type": "Feature",
       "bbox": [-10.0, -10.0, 10.0, 10.0],
       "geometry": {
           "type": "Polygon",
           "coordinates": [
               [
                   [-10.0, -10.0],
                   [10.0, -10.0],
                   [10.0, 10.0],
                   [-10.0, -10.0]
               ]
           ]
       }
       //...
   }