Poop Sheet

Advertising

p 121

This is the first linear regression example from Statistical Learning covered in Chapter 2 and 3.

Linear Regression vs K-Nearest Neighbors

7 Questions

  1. Is there a relationship between sales and advertising budget?
  2. How strong is the relationship?
  3. Which media are associated with sales?
  4. How large is the association between each medium and sales?
  5. How accurately can we predict future sales?
  6. Is the relationship linear?
  7. Is there synergy among the advertising media?
Advertising = read.csv("Advertising.csv", header = T, na.strings = "?", stringsAsFactors = T)
head(Advertising)
  X    TV radio newspaper sales
1 1 230.1  37.8      69.2  22.1
2 2  44.5  39.3      45.1  10.4
3 3  17.2  45.9      69.3   9.3
4 4 151.5  41.3      58.5  18.5
5 5 180.8  10.8      58.4  12.9
6 6   8.7  48.9      75.0   7.2
Advertising = read.csv("Advertising.csv", header = T, na.strings = "?", stringsAsFactors = T)
svg("advertising1.svg", width = 11, pointsize = 12, family = "sans")
plot(sales ~ TV, data = Advertising, col = 2, xlab = "$1000", ylab = "Units Sold")
abline(lm(formula = sales ~ TV, data = Advertising), col = 2)
text(x = 10, y = 7.032594, "7.032594")
points(sales ~ radio, data = Advertising, col = 3)
abline(lm(formula = sales ~ radio, data = Advertising), col = 3)
text(x = 10, y = 9.31164, "9.31164")
points(sales ~ newspaper, data = Advertising, col = 4)
abline(lm(formula = sales ~ newspaper, data = Advertising), col = 4)
text(x = 10, y = 12.35141, "12.35141")
legend("bottomright", legend = c("TV", "Radio", "Newspaper"), fill = c(2,3,4))
dev.off()

linear relations

Multiple Linear Regression

Advertising = read.csv("Advertising.csv", header = T, na.strings = "?", stringsAsFactors = T)
model = lm(sales ~ TV + radio + newspaper, data = Advertising)
summary(model)
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.938889   0.311908   9.422   <2e-16 ***
TV           0.045765   0.001395  32.809   <2e-16 ***
radio        0.188530   0.008611  21.893   <2e-16 ***
newspaper   -0.001037   0.005871  -0.177     0.86 

Residual standard error: 1.686 on 196 degrees of freedom
Multiple R-squared:  0.8972,	Adjusted R-squared:  0.8956 
F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

The higher the F-statistic and lower the p-value, the more likely there is a relationship between the Response and Predictor.

Advertising = read.csv("Advertising.csv", header = T, na.strings = "?", stringsAsFactors = T)
svg("advertising2.svg", width = 11, pointsize = 12, family = "sans")
plot(sales ~ TV, data = Advertising, col = 2, xlab = "$1000", ylab = "Units Sold")
points(sales ~ radio, data = Advertising, col = 3)
points(sales ~ newspaper, data = Advertising, col = 4)
abline(a = 2.938889, b = 0.045765, col = 2)
abline(a = 2.938889, b = 0.188530, col = 3)
abline(a = 2.938889, b = -0.001037, col = 4)
legend("bottomright", legend = c("TV", "Radio", "Newspaper"), fill = c(2,3,4))
dev.off()

multiple linear relations

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 7.032594   0.457843   15.36   <2e-16 ***
TV          0.047537   0.002691   17.67   <2e-16 ***

Residual standard error: 3.259 on 198 degrees of freedom
Multiple R-squared:  0.6119,	Adjusted R-squared:  0.6099 
F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

(Intercept)  9.31164    0.56290  16.542   <2e-16 ***
radio        0.20250    0.02041   9.921   <2e-16 ***

Residual standard error: 4.275 on 198 degrees of freedom
Multiple R-squared:  0.332,	Adjusted R-squared:  0.3287 
F-statistic: 98.42 on 1 and 198 DF,  p-value: < 2.2e-16

(Intercept) 12.35141    0.62142   19.88  < 2e-16 ***
newspaper    0.05469    0.01658    3.30  0.00115 ** 

Residual standard error: 5.092 on 198 degrees of freedom
Multiple R-squared:  0.05212,	Adjusted R-squared:  0.04733 
F-statistic: 10.89 on 1 and 198 DF,  p-value: 0.001148

Correlation Matrix

> cor(Advertising$sales, Advertising$TV)
[1] 0.7822244
> cor(Advertising$sales, Advertising$radio)
[1] 0.5762226
> cor(Advertising$sales, Advertising$newspaper)
[1] 0.228299
> cor(Advertising$TV, Advertising$radio)
[1] 0.05480866
> cor(Advertising$TV, Advertising$newspaper)
[1] 0.05664787
> cor(Advertising$radio, Advertising$newspaper)
[1] 0.3541038
> cor(Advertising)
                    X         TV       radio   newspaper       sales
X          1.00000000 0.01771469 -0.11068044 -0.15494414 -0.05161625
TV         0.01771469 1.00000000  0.05480866  0.05664787  0.78222442
radio     -0.11068044 0.05480866  1.00000000  0.35410375  0.57622257
newspaper -0.15494414 0.05664787  0.35410375  1.00000000  0.22829903
sales     -0.05161625 0.78222442  0.57622257  0.22829903  1.00000000

To get rid of the index column:

> cor(Advertising[,c("sales", "TV", "radio", "newspaper")])
              sales         TV      radio  newspaper
sales     1.0000000 0.78222442 0.57622257 0.22829903
TV        0.7822244 1.00000000 0.05480866 0.05664787
radio     0.5762226 0.05480866 1.00000000 0.35410375
newspaper 0.2282990 0.05664787 0.35410375 1.00000000

So newspaper advertising is a surrogate for radio advertising; newspaper gets “credit” for the association between radio on sales.

The correlation matrix can be visualised with heatmap:

Advertising = read.csv("Advertising.csv", header = T, na.strings = "?", stringsAsFactors = T)
svg("heatmap1.svg", width = 11, height = 12, pointsize = 12, family = "sans")
heatmap(cor(Advertising[,c("sales", "TV", "radio", "newspaper")]))
dev.off()

Heatmap

Measuring the Quality of a fit

mean squared error (MSE)

K-nearest neighbours

Chapter 2 and 4

Parametric Methods

Chapter 6

Non-Parametric Methods

Chapter 5 & 7

Subset Selection

Lasso

Chapter 6

Least Squares

Chapter 3

Generalized Additive Models

Chapter 7

Trees

Bagging, Boosting

Chapter 8

Support Vector Machines

Chapter 9

Deep Learning

Chapter 10

Jargon

Excluding the index column

Index columns can be identified using table, eg table(Advertising$X) which will show the frequency of each entry is 1. They can then be removed from calculations with rownames.

Advertising = read.csv("Advertising.csv", header = T, na.strings = "?", stringsAsFactors = T)
rownames(Advertising) <- Advertising[,c("X")]