Poop Sheet

Visualisation

Advanced R

An Introduction to Statistical Learning

R

It is often convenient to view several plots together.

We can achieve this by using the par() and mfrow() functions, which tell R par() to split the display screen into separate panels so that multiple plots can be viewed simultaneously. For example, par(mfrow = c(2, 2)) divides the plotting region into a 2 × 2 grid of panels.

https://www.bigbookofr.com/

High-level graphics functions initiate a new plot

curve
svg("curve_sin.svg", width = 11, pointsize = 12, family = "sans")
curve(sin, from = 0, to = 2 * pi, n = 101)
abline(h = 0)
dev.off()

sin curve

https://www.kaggle.com/learn/intro-to-machine-learning

https://r4ds.had.co.nz/

https://r4ds.hadley.nz/

https://datacarpentry.github.io/R-ecology-lesson/

https://www.stat.cmu.edu/~ryantibs/statcomp/lectures/apply.html

from data to viz

R Graph Gallery

R

models

All of these take a formula as their first argument, a data frame as their second, subset as their third, weights as their fourth, na.action,…

formula

~ operator

formula = y ~ x1 + x2 + x3

formula = y ~ .

ggplot2

Book

dataframes

Accessing single columns

  1. Using the dollar sign ($) and the name of the column. train$Survived
  2. Using square brackets with the index of the column after the comma. train[,2]

Accessing groups of columns

  1. train[,c("Sex", "Survived", "Age")]
  2. train[,c(5, 2, 6)]

dplyr

basics

  1. Rows
  2. filter()
  3. arrange()
  4. distinct()
  5. count()
  6. Columns
  7. mutate()
  8. select()
  9. rename()
  10. relocate()
  11. Groups
  12. group_by()
  13. summarize()

Predictors

predict

Which predictors are associated with the response?

hatvalues

What is the relationship between the response and each predictor?

Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?

Classifiers

Models

model: an R object, typically returned by ‘lm’ or ‘glm’.

Assessing Model Accuracy

A good classifier is one for which the test error (2.9) is smallest.

Regression Versus Classification Problems

John Tukey

Tukey introduced the box plot in his 1977 book, “Exploratory Data Analysis”.

He also introduced the word bit as a portmanteau of binary digit and coined the term software.

List Of JavaScript libraries

Python

The Python Tutorial

NumPy quickstart

auto

Auto = read.csv("Auto.csv", header = T, na.strings = "?", stringsAsFactors = T)

advertising

p 121

This is the first linear regression example from Statistical Learning covered in Chapter 2 and 3.

Linear Regression vs K-Nearest Neighbors

7 Questions

  1. Is there a relationship between sales and advertising budget?
  2. How strong is the relationship?
  3. Which media are associated with sales?
  4. How large is the association between each medium and sales?
  5. How accurately can we predict future sales?
  6. Is the relationship linear?
  7. Is there synergy among the advertising media?
Advertising = read.csv("Advertising.csv", header = T, na.strings = "?", stringsAsFactors = T)
head(Advertising)
  X    TV radio newspaper sales
1 1 230.1  37.8      69.2  22.1
2 2  44.5  39.3      45.1  10.4
3 3  17.2  45.9      69.3   9.3
4 4 151.5  41.3      58.5  18.5
5 5 180.8  10.8      58.4  12.9
6 6   8.7  48.9      75.0   7.2
Advertising = read.csv("Advertising.csv", header = T, na.strings = "?", stringsAsFactors = T)
svg("advertising1.svg", width = 11, pointsize = 12, family = "sans")
plot(sales ~ TV, data = Advertising, col = 2, xlab = "$1000", ylab = "Units Sold")
abline(lm(formula = sales ~ TV, data = Advertising), col = 2)
text(x = 10, y = 7.032594, "7.032594")
points(sales ~ radio, data = Advertising, col = 3)
abline(lm(formula = sales ~ radio, data = Advertising), col = 3)
text(x = 10, y = 9.31164, "9.31164")
points(sales ~ newspaper, data = Advertising, col = 4)
abline(lm(formula = sales ~ newspaper, data = Advertising), col = 4)
text(x = 10, y = 12.35141, "12.35141")
legend("bottomright", legend = c("TV", "Radio", "Newspaper"), fill = c(2,3,4))
dev.off()

linear relations

iris

Iris Flower Classification

Iris is one of the databases built into R (ie it can be accessed without loading), and used by Learn R through examples among others to teach the basics.

Other built-in datasets include:

The data() function provides the full list.

Scatterplot

iris$speciesID <- as.numeric(iris$Species) + 1
iris$shape <- ifelse(iris$Sepal.Length < 5.15, 17, 20)
svg("petal-width-length1.svg", width = 11, pointsize = 12, family = "sans")
plot(Petal.Length ~ Petal.Width, data = iris, col = speciesID, pch = shape)
legend("topleft", levels(iris$Species), fill = 2:4)
dev.off()

Petal width to length scatterplot

dendogram

Wikipedia

R Graph Gallery

credit

> Credit = read.csv("Credit.csv", header = T, na.strings = "?", stringsAsFactors = T)
> head(Credit)
   Income Limit Rating Cards Age Education Own Student Married Region Balance
1  14.891  3606    283     2  34        11  No      No     Yes  South     333
2 106.025  6645    483     3  82        15 Yes     Yes     Yes   West     903
3 104.593  7075    514     4  71        11  No      No      No   West     580
4 148.924  9504    681     3  36        11 Yes      No      No   West     964
5  55.882  4897    357     2  68        16  No      No     Yes  South     331
6  80.180  8047    569     4  77        10  No      No      No  South    1151

To get the correlation of “Own”, “Student”, “Married”, and “Region”, these need to be converted to dummy columns with numbers.

house-prices

This is another learning example offered by Kaggle.

The object is to predict SalePrice which is provided in train.csv, but missing in test.csv. The submission should just have two columns, Id and SalePrice.

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
ncol(train)

There are 81 columns, so an overhwelming number.

Qualitative Summary

summary(train$MSZoning)
C (all)      FV      RH      RL      RM 
     10      65      16    1151     218 

summary(train$Id)

“MSSubClass” “MSZoning” “LotFrontage”
[5] “LotArea” “Street” “Alley” “LotShape”
[9] “LandContour” “Utilities” “LotConfig” “LandSlope”
[13] “Neighborhood” “Condition1” “Condition2” “BldgType”
[17] “HouseStyle” “OverallQual” “OverallCond” “YearBuilt”
[21] “YearRemodAdd” “RoofStyle” “RoofMatl” “Exterior1st”
[25] “Exterior2nd” “MasVnrType” “MasVnrArea” “ExterQual”
[29] “ExterCond” “Foundation” “BsmtQual” “BsmtCond”
[33] “BsmtExposure” “BsmtFinType1” “BsmtFinSF1” “BsmtFinType2” [37] “BsmtFinSF2” “BsmtUnfSF” “TotalBsmtSF” “Heating”
[41] “HeatingQC” “CentralAir” “Electrical” “X1stFlrSF”
[45] “X2ndFlrSF” “LowQualFinSF” “GrLivArea” “BsmtFullBath” [49] “BsmtHalfBath” “FullBath” “HalfBath” “BedroomAbvGr” [53] “KitchenAbvGr” “KitchenQual” “TotRmsAbvGrd” “Functional”
[57] “Fireplaces” “FireplaceQu” “GarageType” “GarageYrBlt”
[61] “GarageFinish” “GarageCars” “GarageArea” “GarageQual”
[65] “GarageCond” “PavedDrive” “WoodDeckSF” “OpenPorchSF”
[69] “EnclosedPorch” “X3SsnPorch” “ScreenPorch” “PoolArea”
[73] “PoolQC” “Fence” “MiscFeature” “MiscVal”
[77] “MoSold” “YrSold” “SaleType” “SaleCondition” [81] “SalePrice”

titanic

Introductory Example

Kaggle provides a training set (which includes the result, Survived, along with various predictors), a test set which doesn’t include Survived, and a simple example gender_submission.csv to show how results should be submitted.

Writing submission file

The model used for gender_submission.csv is all female passengers survived and all male passengers didn’t.

test <- read.csv("test.csv", header = T, na.strings = "?", stringsAsFactors = T)
test$Survived <- as.integer(test$Sex == "female")
write.csv(test[,c("PassengerId","Survived")], "submission.csv", quote = F, row.names = F)

Checking accuracy of “all females lived, all males died” assumption.

train <- read.csv("train.csv", header = T, na.strings = "?", stringsAsFactors = T)
train$Prediction <- as.integer(train$Sex == "female")
(sum(train$Survived == train$Prediction)/nrow(train)) * 100

On the training data, this assumption gives 78.67565% accuracy.

wage

Ordering boxplots

Ordered By Marital Status

Base plot

Wage = read.csv("Wage.csv", header = T, na.strings = "?", stringsAsFactors = T)
new_order <- with(Wage, reorder(maritl , wage, median , na.rm=T))
svg("ordered-maritl.svg", width = 11, pointsize = 12, family = "sans")
plot(new_order, Wage$wage)
dev.off()

Ordered boxplot

ggplot2

library(ggplot2)
Wage = read.csv("Wage.csv", header = T, na.strings = "?", stringsAsFactors = T)
new_order <- with(Wage, reorder(maritl , wage, median , na.rm=T))
svg("ordered-maritl-ggplot2.svg", width = 11, pointsize = 12, family = "sans")
ggplot(data = Wage, mapping = aes(x = new_order, y = wage)) + 
  geom_boxplot(fill = c(2, 3, 7, 4, 8), alpha = 0.2) +
  xlab("Marital Status") +
  ylab("Wage")
dev.off()

Ordered boxplot

Finding variable of interest

x must be a numeric vector

d3

Boxplot

geojson

The media type for GeoJSON is “application/geo+json”.

RFC 7946

Using GeoJSON with Leaflet

{
    "type": "Feature",
    "properties": {
        "name": "Coors Field",
        "amenity": "Baseball Stadium",
        "popupContent": "This is where the Rockies play!"
    },
    "geometry": {
        "type": "Point",
        "coordinates": [-104.99404, 39.75621]
    }
}
Geometry
a region of space
Feature
a spatially bounded entity
FeatureCollection
a list of Features
{
       "type": "FeatureCollection",
       "features": [{
           "type": "Feature",
           "geometry": {
               "type": "Point",
               "coordinates": [102.0, 0.5]
           },
           "properties": {
               "prop0": "value0"
           }
       }, {
           "type": "Feature",
           "geometry": {
               "type": "LineString",
               "coordinates": [
                   [102.0, 0.0],
                   [103.0, 1.0],
                   [104.0, 0.0],
                   [105.0, 1.0]
               ]
           },
           "properties": {
               "prop0": "value0",
               "prop1": 0.0
           }
       }, {
           "type": "Feature",
           "geometry": {
               "type": "Polygon",
               "coordinates": [
                   [
                       [100.0, 0.0],
                       [101.0, 0.0],
                       [101.0, 1.0],
                       [100.0, 1.0],
                       [100.0, 0.0]
                   ]
               ]
           },
           "properties": {
               "prop0": "value0",
               "prop1": {
                   "this": "that"
               }
           }
       }]
   }
   {
       "type": "MultiLineString",
       "coordinates": [
           [
               [170.0, 45.0], [180.0, 45.0]
           ], [
               [-180.0, 45.0], [-170.0, 45.0]
           ]
       ]
   }
   {
       "type": "Feature",
       "bbox": [-10.0, -10.0, 10.0, 10.0],
       "geometry": {
           "type": "Polygon",
           "coordinates": [
               [
                   [-10.0, -10.0],
                   [10.0, -10.0],
                   [10.0, 10.0],
                   [-10.0, -10.0]
               ]
           ]
       }
       //...
   }