Advertising
This is the first linear regression example from Statistical Learning covered in Chapter 2 and 3.
Advertising = read.csv("Advertising.csv", header = T, na.strings = "?", stringsAsFactors = T)
head(Advertising)
X TV radio newspaper sales
1 1 230.1 37.8 69.2 22.1
2 2 44.5 39.3 45.1 10.4
3 3 17.2 45.9 69.3 9.3
4 4 151.5 41.3 58.5 18.5
5 5 180.8 10.8 58.4 12.9
6 6 8.7 48.9 75.0 7.2
Advertising = read.csv("Advertising.csv", header = T, na.strings = "?", stringsAsFactors = T)
svg("advertising1.svg", width = 11, pointsize = 12, family = "sans")
plot(sales ~ TV, data = Advertising, col = 2, xlab = "$1000", ylab = "Units Sold")
abline(lm(formula = sales ~ TV, data = Advertising), col = 2)
text(x = 10, y = 7.032594, "7.032594")
points(sales ~ radio, data = Advertising, col = 3)
abline(lm(formula = sales ~ radio, data = Advertising), col = 3)
text(x = 10, y = 9.31164, "9.31164")
points(sales ~ newspaper, data = Advertising, col = 4)
abline(lm(formula = sales ~ newspaper, data = Advertising), col = 4)
text(x = 10, y = 12.35141, "12.35141")
legend("bottomright", legend = c("TV", "Radio", "Newspaper"), fill = c(2,3,4))
dev.off()
Multiple Linear Regression
Advertising = read.csv("Advertising.csv", header = T, na.strings = "?", stringsAsFactors = T)
model = lm(sales ~ TV + radio + newspaper, data = Advertising)
summary(model)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.938889 0.311908 9.422 <2e-16 ***
TV 0.045765 0.001395 32.809 <2e-16 ***
radio 0.188530 0.008611 21.893 <2e-16 ***
newspaper -0.001037 0.005871 -0.177 0.86
Advertising = read.csv("Advertising.csv", header = T, na.strings = "?", stringsAsFactors = T)
svg("advertising2.svg", width = 11, pointsize = 12, family = "sans")
plot(sales ~ TV, data = Advertising, col = 2, xlab = "$1000", ylab = "Units Sold")
points(sales ~ radio, data = Advertising, col = 3)
points(sales ~ newspaper, data = Advertising, col = 4)
abline(a = 2.938889, b = 0.045765, col = 2)
abline(a = 2.938889, b = 0.188530, col = 3)
abline(a = 2.938889, b = -0.001037, col = 4)
legend("bottomright", legend = c("TV", "Radio", "Newspaper"), fill = c(2,3,4))
dev.off()
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.032594 0.457843 15.36 <2e-16 ***
TV 0.047537 0.002691 17.67 <2e-16 ***
Residual standard error: 3.259 on 198 degrees of freedom
Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
(Intercept) 9.31164 0.56290 16.542 <2e-16 ***
radio 0.20250 0.02041 9.921 <2e-16 ***
Residual standard error: 4.275 on 198 degrees of freedom
Multiple R-squared: 0.332, Adjusted R-squared: 0.3287
F-statistic: 98.42 on 1 and 198 DF, p-value: < 2.2e-16
(Intercept) 12.35141 0.62142 19.88 < 2e-16 ***
newspaper 0.05469 0.01658 3.30 0.00115 **
Residual standard error: 5.092 on 198 degrees of freedom
Multiple R-squared: 0.05212, Adjusted R-squared: 0.04733
F-statistic: 10.89 on 1 and 198 DF, p-value: 0.001148
Correlation Matrix
> cor(Advertising$sales, Advertising$TV)
[1] 0.7822244
> cor(Advertising$sales, Advertising$radio)
[1] 0.5762226
> cor(Advertising$sales, Advertising$newspaper)
[1] 0.228299
> cor(Advertising$TV, Advertising$radio)
[1] 0.05480866
> cor(Advertising$TV, Advertising$newspaper)
[1] 0.05664787
> cor(Advertising$radio, Advertising$newspaper)
[1] 0.3541038
> cor(Advertising)
X TV radio newspaper sales
X 1.00000000 0.01771469 -0.11068044 -0.15494414 -0.05161625
TV 0.01771469 1.00000000 0.05480866 0.05664787 0.78222442
radio -0.11068044 0.05480866 1.00000000 0.35410375 0.57622257
newspaper -0.15494414 0.05664787 0.35410375 1.00000000 0.22829903
sales -0.05161625 0.78222442 0.57622257 0.22829903 1.00000000
To get rid of the index column:
> cor(Advertising[,c("sales", "TV", "radio", "newspaper")])
sales TV radio newspaper
sales 1.0000000 0.78222442 0.57622257 0.22829903
TV 0.7822244 1.00000000 0.05480866 0.05664787
radio 0.5762226 0.05480866 1.00000000 0.35410375
newspaper 0.2282990 0.05664787 0.35410375 1.00000000
The correlation matrix can be visualised with heatmap:
Advertising = read.csv("Advertising.csv", header = T, na.strings = "?", stringsAsFactors = T)
svg("heatmap1.svg", width = 11, pointsize = 12, family = "sans")
heatmap(cor(Advertising[,c("sales", "TV", "radio", "newspaper")]))
dev.off()
Measuring the Quality of a fit
mean squared error (MSE)
K-nearest neighbours
Chapter 2 and 4
Parametric Methods
Chapter 6
Non-Parametric Methods
Chapter 5 & 7
Subset Selection
Lasso
Chapter 6
Least Squares
Chapter 3
Generalized Additive Models
Chapter 7
Trees
Bagging, Boosting
Chapter 8
Support Vector Machines
Chapter 9
Deep Learning
Chapter 10
Jargon
- Cross-validation
Excluding the index column
Index columns can be identified using table, eg table(Advertising$X) which will show the frequency of each entry is 1. They can then be removed from calculations with rownames.
Advertising = read.csv("Advertising.csv", header = T, na.strings = "?", stringsAsFactors = T)
rownames(Advertising) <- Advertising[,c("X")]