Poop Sheet

Advertising

This is the first linear regression example from Statistical Learning covered in Chapter 2 and 3.

Advertising = read.csv("Advertising.csv", header = T, na.strings = "?", stringsAsFactors = T)
head(Advertising)
  X    TV radio newspaper sales
1 1 230.1  37.8      69.2  22.1
2 2  44.5  39.3      45.1  10.4
3 3  17.2  45.9      69.3   9.3
4 4 151.5  41.3      58.5  18.5
5 5 180.8  10.8      58.4  12.9
6 6   8.7  48.9      75.0   7.2
Advertising = read.csv("Advertising.csv", header = T, na.strings = "?", stringsAsFactors = T)
svg("advertising1.svg", width = 11, pointsize = 12, family = "sans")
plot(sales ~ TV, data = Advertising, col = 2, xlab = "$1000", ylab = "Units Sold")
abline(lm(formula = sales ~ TV, data = Advertising), col = 2)
text(x = 10, y = 7.032594, "7.032594")
points(sales ~ radio, data = Advertising, col = 3)
abline(lm(formula = sales ~ radio, data = Advertising), col = 3)
text(x = 10, y = 9.31164, "9.31164")
points(sales ~ newspaper, data = Advertising, col = 4)
abline(lm(formula = sales ~ newspaper, data = Advertising), col = 4)
text(x = 10, y = 12.35141, "12.35141")
legend("bottomright", legend = c("TV", "Radio", "Newspaper"), fill = c(2,3,4))
dev.off()

linear relations

Multiple Linear Regression

Advertising = read.csv("Advertising.csv", header = T, na.strings = "?", stringsAsFactors = T)
model = lm(sales ~ TV + radio + newspaper, data = Advertising)
summary(model)
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.938889   0.311908   9.422   <2e-16 ***
TV           0.045765   0.001395  32.809   <2e-16 ***
radio        0.188530   0.008611  21.893   <2e-16 ***
newspaper   -0.001037   0.005871  -0.177     0.86 
Advertising = read.csv("Advertising.csv", header = T, na.strings = "?", stringsAsFactors = T)
svg("advertising2.svg", width = 11, pointsize = 12, family = "sans")
plot(sales ~ TV, data = Advertising, col = 2, xlab = "$1000", ylab = "Units Sold")
points(sales ~ radio, data = Advertising, col = 3)
points(sales ~ newspaper, data = Advertising, col = 4)
abline(a = 2.938889, b = 0.045765, col = 2)
abline(a = 2.938889, b = 0.188530, col = 3)
abline(a = 2.938889, b = -0.001037, col = 4)
legend("bottomright", legend = c("TV", "Radio", "Newspaper"), fill = c(2,3,4))
dev.off()

multiple linear relations

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 7.032594   0.457843   15.36   <2e-16 ***
TV          0.047537   0.002691   17.67   <2e-16 ***

Residual standard error: 3.259 on 198 degrees of freedom
Multiple R-squared:  0.6119,	Adjusted R-squared:  0.6099 
F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

(Intercept)  9.31164    0.56290  16.542   <2e-16 ***
radio        0.20250    0.02041   9.921   <2e-16 ***

Residual standard error: 4.275 on 198 degrees of freedom
Multiple R-squared:  0.332,	Adjusted R-squared:  0.3287 
F-statistic: 98.42 on 1 and 198 DF,  p-value: < 2.2e-16

(Intercept) 12.35141    0.62142   19.88  < 2e-16 ***
newspaper    0.05469    0.01658    3.30  0.00115 ** 

Residual standard error: 5.092 on 198 degrees of freedom
Multiple R-squared:  0.05212,	Adjusted R-squared:  0.04733 
F-statistic: 10.89 on 1 and 198 DF,  p-value: 0.001148

Correlation Matrix

> cor(Advertising$sales, Advertising$TV)
[1] 0.7822244
> cor(Advertising$sales, Advertising$radio)
[1] 0.5762226
> cor(Advertising$sales, Advertising$newspaper)
[1] 0.228299
> cor(Advertising$TV, Advertising$radio)
[1] 0.05480866
> cor(Advertising$TV, Advertising$newspaper)
[1] 0.05664787
> cor(Advertising$radio, Advertising$newspaper)
[1] 0.3541038
> cor(Advertising)
                    X         TV       radio   newspaper       sales
X          1.00000000 0.01771469 -0.11068044 -0.15494414 -0.05161625
TV         0.01771469 1.00000000  0.05480866  0.05664787  0.78222442
radio     -0.11068044 0.05480866  1.00000000  0.35410375  0.57622257
newspaper -0.15494414 0.05664787  0.35410375  1.00000000  0.22829903
sales     -0.05161625 0.78222442  0.57622257  0.22829903  1.00000000

To get rid of the index column:

> cor(Advertising[,c("sales", "TV", "radio", "newspaper")])
              sales         TV      radio  newspaper
sales     1.0000000 0.78222442 0.57622257 0.22829903
TV        0.7822244 1.00000000 0.05480866 0.05664787
radio     0.5762226 0.05480866 1.00000000 0.35410375
newspaper 0.2282990 0.05664787 0.35410375 1.00000000

The correlation matrix can be visualised with heatmap:

Advertising = read.csv("Advertising.csv", header = T, na.strings = "?", stringsAsFactors = T)
svg("heatmap1.svg", width = 11, pointsize = 12, family = "sans")
heatmap(cor(Advertising[,c("sales", "TV", "radio", "newspaper")]))
dev.off()

Heatmap

Measuring the Quality of a fit

mean squared error (MSE)

K-nearest neighbours

Chapter 2 and 4

Parametric Methods

Chapter 6

Non-Parametric Methods

Chapter 5 & 7

Subset Selection

Lasso

Chapter 6

Least Squares

Chapter 3

Generalized Additive Models

Chapter 7

Trees

Bagging, Boosting

Chapter 8

Support Vector Machines

Chapter 9

Deep Learning

Chapter 10

Jargon

Excluding the index column

Index columns can be identified using table, eg table(Advertising$X) which will show the frequency of each entry is 1. They can then be removed from calculations with rownames.

Advertising = read.csv("Advertising.csv", header = T, na.strings = "?", stringsAsFactors = T)
rownames(Advertising) <- Advertising[,c("X")]