---
title: "Scatterplot examples"
output: html_notebook
runtime: shiny
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(shiny)
```
This demonstration will use some random data. Lets start by generating the random data. So give a [random seed][seed] and pick a sample size for your sample.
```{r seed, echo=FALSE}
inputPanel(
selectInput("N", label = "Sample Size:",
choices = c(25, 50, 100, 250, 500, 1000), selected = 100),
numericInput("seed", label = "Random number Seed (integer)",
min = 0, max = .Machine$integer.max,
value = floor(runif(1)*.Machine$integer.max),
step = 157)
)
renderText({
N <<- as.numeric(input$N)
set.seed(input$seed)
X <<- rnorm(N)
Err <<- rnorm(N)
paste("Sample Size =",N,"Random Seed = ",input$seed,"\n")
})
```
# Linear Relationships
All three of these example are indications that linear regression is a reasonable to way to summarize the relationship between $X$ and $Y$.
## Mostly linear
This happens when we have a moderately high to strong correlation.
```{r highCorrelation, echo=FALSE}
inputPanel(
sliderInput("rho", label = "Correlation Coefficient:",
min = .75, max = 1, value = .85, step = 0.05),
checkboxInput("sign","Negative Correlation",FALSE)
)
renderPlot({
rho <<- input$rho*ifelse(input$sign,-1,1)
Y <<- rho*X + sqrt(1-rho*rho)*Err
plot(X,Y,main=paste("Correlation =",rho))
abline(a=0,b=rho,col="red")
},width=288,height=288)
```
## Blobby Elipse
As the correlation coefficient gets lower, the scatterplot looks more blobby, but you can still tell that there is a slope. This is a weak to moderate correlation.
```{r lowCorrelation, echo=FALSE}
inputPanel(
sliderInput("rho1", label = "Correlation Coefficient:",
min = .25, max = .75, value = .5, step = 0.05),
checkboxInput("sign1","Negative Correlation",FALSE)
)
renderPlot({
rho <<- input$rho1*ifelse(input$sign1,-1,1)
Y <<- rho*X + sqrt(1-rho*rho)*Err
plot(X,Y,main=paste("Correlation =",rho))
abline(a=0,b=rho,col="red")
},width=288,height=288)
```
## No Relationship
Not much is going on here. One thing that confuses people is the idea that linear regression doesn't work here. Actually, it gives a quite accurate picture: it tells you that not much is going on, which is what is actually happening. The prediction from the regression will be that $\bar Y$ is the best predicted value for $Y$.
```{r noCorrelation, echo=FALSE}
inputPanel(
sliderInput("rho0", label = "Correlation Coefficient:",
min = -.25, max = .25, value = .0, step = 0.05),
checkboxInput("sign0","Negative Correlation",FALSE)
)
renderPlot({
rho <<- input$rho0*ifelse(input$sign0,-1,1)
Y <<- rho*X + sqrt(1-rho*rho)*Err
plot(X,Y,main=paste("Correlation =",rho))
abline(a=0,b=rho,col="red")
},width=288,height=288)
```
# Signs that the linear model doesn't work.
The challenge to using regression (and correlation) to summarize the relationship between $X$ and $Y$ is when the relationship is non-linear. Here the correlation/regression will tell about the linear part of the relationship, but missing the non-linear part. If the non-linear part is small, this might not be too bad. But if it is big, then _linear_ regression could be misleading. (There are various types of non-linear regression that are covered in more advanced classes).
## Curve
A curved relationship doesn't look like a line.
Consider a quadradic relationship:
$$ Y = b_2 X^2 + b_1 X + b_0 + \epsilon$$
This is a multiple (or quadradic) regression. You can adjust the coefficients in the plot below.
```{r curve, echo=FALSE}
inputPanel(
sliderInput("b2", label = "Quadradic Term Slope:",
min = -1, max = 1, value = .5, step = 0.05),
sliderInput("b1", label = "Linear Term Slope:",
min = -1, max = 1, value = 0, step = 0.05),
sliderInput("b0", label = "Intercept:",
min = -1, max = 1, value = 0, step = 0.05),
sliderInput("tau", label = "Error Standard Deviation:",
min = 0, max = 1, value = .5, step = 0.05)
)
renderPlot({
Y <<- input$b2*X*X + input$b1*X + input$b0 + input$tau*Err
rho <<- cor(X,Y)
plot(X,Y,main=paste("Correlation =",rho))
abline(a=input$b0,b=rho,col="red")
lines(lowess(X,Y),col="blue",lty=2)
},width=288,height=288)
```
If we try to run a _linear_ regression when the relationship is curved, it will only tell us part of the story. The story it will tell is the red line, and not the blue curve.
## Broken Lines
Sometimes the reltionship changes somewhere through the range of the data. Often this is a ceiling effect: the effect of $X$ on $Y$ hits a ceiling. For example, in the first couple of years of teaching, the ability of new teachers rises very rapidly as they gain experience. But after 3--5 years, the effect levels out and the teachers grow much more slowly.
Ideally we would fit two linear regression to these data splitting at a certain value of $X$, $x_0$. So,
$$ Y = \begin{cases}
b_{11} X + b_{01} + \epsilon & \text {when} X \leq x_0 \\
b_{12} X + b_{02} + \epsilon & \text {when} X \ge x_0
\end{cases}
$$
```{r ceiling, echo=FALSE}
inputPanel(
sliderInput("b11", label = "First Slope:",
min = -1, max = 1, value = .5, step = 0.05),
sliderInput("b12", label = "Second Slope:",
min = -1, max = 1, value = 0, step = 0.05),
sliderInput("x0", label = "Crossover Point (x[0])",
min = -1, max = 1, value = 0, step = 0.05),
sliderInput("tau1", label = "Error Standard Deviation:",
min = 0, max = 1, value = .5, step = 0.05)
)
renderPlot({
b11 <<- input$b11
b12 <<- input$b12
x0 <<- input$x0
b02 <<- (b11-b12)*x0
Y <<- ifelse(X