Advertising
p 121
This is the first linear regression example from Statistical Learning covered in Chapter 2 and 3.
Linear Regression vs K-Nearest Neighbors
7 Questions
- Is there a relationship between sales and advertising budget?
- How strong is the relationship?
- Which media are associated with sales?
- How large is the association between each medium and sales?
- How accurately can we predict future sales?
- Is the relationship linear?
- Is there synergy among the advertising media?
Advertising = read.csv("Advertising.csv", header = T, na.strings = "?", stringsAsFactors = T)
head(Advertising)
X TV radio newspaper sales
1 1 230.1 37.8 69.2 22.1
2 2 44.5 39.3 45.1 10.4
3 3 17.2 45.9 69.3 9.3
4 4 151.5 41.3 58.5 18.5
5 5 180.8 10.8 58.4 12.9
6 6 8.7 48.9 75.0 7.2
Advertising = read.csv("Advertising.csv", header = T, na.strings = "?", stringsAsFactors = T)
svg("advertising1.svg", width = 11, pointsize = 12, family = "sans")
plot(sales ~ TV, data = Advertising, col = 2, xlab = "$1000", ylab = "Units Sold")
abline(lm(formula = sales ~ TV, data = Advertising), col = 2)
text(x = 10, y = 7.032594, "7.032594")
points(sales ~ radio, data = Advertising, col = 3)
abline(lm(formula = sales ~ radio, data = Advertising), col = 3)
text(x = 10, y = 9.31164, "9.31164")
points(sales ~ newspaper, data = Advertising, col = 4)
abline(lm(formula = sales ~ newspaper, data = Advertising), col = 4)
text(x = 10, y = 12.35141, "12.35141")
legend("bottomright", legend = c("TV", "Radio", "Newspaper"), fill = c(2,3,4))
dev.off()
Multiple Linear Regression
Advertising = read.csv("Advertising.csv", header = T, na.strings = "?", stringsAsFactors = T)
model = lm(sales ~ TV + radio + newspaper, data = Advertising)
summary(model)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.938889 0.311908 9.422 <2e-16 ***
TV 0.045765 0.001395 32.809 <2e-16 ***
radio 0.188530 0.008611 21.893 <2e-16 ***
newspaper -0.001037 0.005871 -0.177 0.86
Residual standard error: 1.686 on 196 degrees of freedom
Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
The higher the F-statistic and lower the p-value, the more likely there is a relationship between the Response and Predictor.
Advertising = read.csv("Advertising.csv", header = T, na.strings = "?", stringsAsFactors = T)
svg("advertising2.svg", width = 11, pointsize = 12, family = "sans")
plot(sales ~ TV, data = Advertising, col = 2, xlab = "$1000", ylab = "Units Sold")
points(sales ~ radio, data = Advertising, col = 3)
points(sales ~ newspaper, data = Advertising, col = 4)
abline(a = 2.938889, b = 0.045765, col = 2)
abline(a = 2.938889, b = 0.188530, col = 3)
abline(a = 2.938889, b = -0.001037, col = 4)
legend("bottomright", legend = c("TV", "Radio", "Newspaper"), fill = c(2,3,4))
dev.off()
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.032594 0.457843 15.36 <2e-16 ***
TV 0.047537 0.002691 17.67 <2e-16 ***
Residual standard error: 3.259 on 198 degrees of freedom
Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
(Intercept) 9.31164 0.56290 16.542 <2e-16 ***
radio 0.20250 0.02041 9.921 <2e-16 ***
Residual standard error: 4.275 on 198 degrees of freedom
Multiple R-squared: 0.332, Adjusted R-squared: 0.3287
F-statistic: 98.42 on 1 and 198 DF, p-value: < 2.2e-16
(Intercept) 12.35141 0.62142 19.88 < 2e-16 ***
newspaper 0.05469 0.01658 3.30 0.00115 **
Residual standard error: 5.092 on 198 degrees of freedom
Multiple R-squared: 0.05212, Adjusted R-squared: 0.04733
F-statistic: 10.89 on 1 and 198 DF, p-value: 0.001148
Correlation Matrix
> cor(Advertising$sales, Advertising$TV)
[1] 0.7822244
> cor(Advertising$sales, Advertising$radio)
[1] 0.5762226
> cor(Advertising$sales, Advertising$newspaper)
[1] 0.228299
> cor(Advertising$TV, Advertising$radio)
[1] 0.05480866
> cor(Advertising$TV, Advertising$newspaper)
[1] 0.05664787
> cor(Advertising$radio, Advertising$newspaper)
[1] 0.3541038
> cor(Advertising)
X TV radio newspaper sales
X 1.00000000 0.01771469 -0.11068044 -0.15494414 -0.05161625
TV 0.01771469 1.00000000 0.05480866 0.05664787 0.78222442
radio -0.11068044 0.05480866 1.00000000 0.35410375 0.57622257
newspaper -0.15494414 0.05664787 0.35410375 1.00000000 0.22829903
sales -0.05161625 0.78222442 0.57622257 0.22829903 1.00000000
To get rid of the index column:
> cor(Advertising[,c("sales", "TV", "radio", "newspaper")])
sales TV radio newspaper
sales 1.0000000 0.78222442 0.57622257 0.22829903
TV 0.7822244 1.00000000 0.05480866 0.05664787
radio 0.5762226 0.05480866 1.00000000 0.35410375
newspaper 0.2282990 0.05664787 0.35410375 1.00000000
So newspaper advertising is a surrogate for radio advertising; newspaper gets “credit” for the association between radio on sales.
The correlation matrix can be visualised with heatmap:
Advertising = read.csv("Advertising.csv", header = T, na.strings = "?", stringsAsFactors = T)
svg("heatmap1.svg", width = 11, height = 12, pointsize = 12, family = "sans")
heatmap(cor(Advertising[,c("sales", "TV", "radio", "newspaper")]))
dev.off()
Measuring the Quality of a fit
mean squared error (MSE)
K-nearest neighbours
Chapter 2 and 4
Parametric Methods
Chapter 6
Non-Parametric Methods
Chapter 5 & 7
Subset Selection
Lasso
Chapter 6
Least Squares
Chapter 3
Generalized Additive Models
Chapter 7
Trees
Bagging, Boosting
Chapter 8
Support Vector Machines
Chapter 9
Deep Learning
Chapter 10
Jargon
- Cross-validation
Excluding the index column
Index columns can be identified using table, eg table(Advertising$X) which will show the frequency of each entry is 1. They can then be removed from calculations with rownames.
Advertising = read.csv("Advertising.csv", header = T, na.strings = "?", stringsAsFactors = T)
rownames(Advertising) <- Advertising[,c("X")]