--- title: "Scatterplot examples" output: html_notebook runtime: shiny --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(shiny) ``` This demonstration will use some random data. Lets start by generating the random data. So give a [random seed][seed] and pick a sample size for your sample. ```{r seed, echo=FALSE} inputPanel( selectInput("N", label = "Sample Size:", choices = c(25, 50, 100, 250, 500, 1000), selected = 100), numericInput("seed", label = "Random number Seed (integer)", min = 0, max = .Machine$integer.max, value = floor(runif(1)*.Machine$integer.max), step = 157) ) renderText({ N <<- as.numeric(input$N) set.seed(input$seed) X <<- rnorm(N) Err <<- rnorm(N) paste("Sample Size =",N,"Random Seed = ",input$seed,"\n") }) ``` # Linear Relationships All three of these example are indications that linear regression is a reasonable to way to summarize the relationship between $X$ and $Y$. ## Mostly linear This happens when we have a moderately high to strong correlation. ```{r highCorrelation, echo=FALSE} inputPanel( sliderInput("rho", label = "Correlation Coefficient:", min = .75, max = 1, value = .85, step = 0.05), checkboxInput("sign","Negative Correlation",FALSE) ) renderPlot({ rho <<- input$rho*ifelse(input$sign,-1,1) Y <<- rho*X + sqrt(1-rho*rho)*Err plot(X,Y,main=paste("Correlation =",rho)) abline(a=0,b=rho,col="red") },width=288,height=288) ``` ## Blobby Elipse As the correlation coefficient gets lower, the scatterplot looks more blobby, but you can still tell that there is a slope. This is a weak to moderate correlation. ```{r lowCorrelation, echo=FALSE} inputPanel( sliderInput("rho1", label = "Correlation Coefficient:", min = .25, max = .75, value = .5, step = 0.05), checkboxInput("sign1","Negative Correlation",FALSE) ) renderPlot({ rho <<- input$rho1*ifelse(input$sign1,-1,1) Y <<- rho*X + sqrt(1-rho*rho)*Err plot(X,Y,main=paste("Correlation =",rho)) abline(a=0,b=rho,col="red") },width=288,height=288) ``` ## No Relationship Not much is going on here. One thing that confuses people is the idea that linear regression doesn't work here. Actually, it gives a quite accurate picture: it tells you that not much is going on, which is what is actually happening. The prediction from the regression will be that $\bar Y$ is the best predicted value for $Y$. ```{r noCorrelation, echo=FALSE} inputPanel( sliderInput("rho0", label = "Correlation Coefficient:", min = -.25, max = .25, value = .0, step = 0.05), checkboxInput("sign0","Negative Correlation",FALSE) ) renderPlot({ rho <<- input$rho0*ifelse(input$sign0,-1,1) Y <<- rho*X + sqrt(1-rho*rho)*Err plot(X,Y,main=paste("Correlation =",rho)) abline(a=0,b=rho,col="red") },width=288,height=288) ``` # Signs that the linear model doesn't work. The challenge to using regression (and correlation) to summarize the relationship between $X$ and $Y$ is when the relationship is non-linear. Here the correlation/regression will tell about the linear part of the relationship, but missing the non-linear part. If the non-linear part is small, this might not be too bad. But if it is big, then _linear_ regression could be misleading. (There are various types of non-linear regression that are covered in more advanced classes). ## Curve A curved relationship doesn't look like a line. Consider a quadradic relationship: $$ Y = b_2 X^2 + b_1 X + b_0 + \epsilon$$ This is a multiple (or quadradic) regression. You can adjust the coefficients in the plot below. ```{r curve, echo=FALSE} inputPanel( sliderInput("b2", label = "Quadradic Term Slope:", min = -1, max = 1, value = .5, step = 0.05), sliderInput("b1", label = "Linear Term Slope:", min = -1, max = 1, value = 0, step = 0.05), sliderInput("b0", label = "Intercept:", min = -1, max = 1, value = 0, step = 0.05), sliderInput("tau", label = "Error Standard Deviation:", min = 0, max = 1, value = .5, step = 0.05) ) renderPlot({ Y <<- input$b2*X*X + input$b1*X + input$b0 + input$tau*Err rho <<- cor(X,Y) plot(X,Y,main=paste("Correlation =",rho)) abline(a=input$b0,b=rho,col="red") lines(lowess(X,Y),col="blue",lty=2) },width=288,height=288) ``` If we try to run a _linear_ regression when the relationship is curved, it will only tell us part of the story. The story it will tell is the red line, and not the blue curve. ## Broken Lines Sometimes the reltionship changes somewhere through the range of the data. Often this is a ceiling effect: the effect of $X$ on $Y$ hits a ceiling. For example, in the first couple of years of teaching, the ability of new teachers rises very rapidly as they gain experience. But after 3--5 years, the effect levels out and the teachers grow much more slowly. Ideally we would fit two linear regression to these data splitting at a certain value of $X$, $x_0$. So, $$ Y = \begin{cases} b_{11} X + b_{01} + \epsilon & \text {when} X \leq x_0 \\ b_{12} X + b_{02} + \epsilon & \text {when} X \ge x_0 \end{cases} $$ ```{r ceiling, echo=FALSE} inputPanel( sliderInput("b11", label = "First Slope:", min = -1, max = 1, value = .5, step = 0.05), sliderInput("b12", label = "Second Slope:", min = -1, max = 1, value = 0, step = 0.05), sliderInput("x0", label = "Crossover Point (x[0])", min = -1, max = 1, value = 0, step = 0.05), sliderInput("tau1", label = "Error Standard Deviation:", min = 0, max = 1, value = .5, step = 0.05) ) renderPlot({ b11 <<- input$b11 b12 <<- input$b12 x0 <<- input$x0 b02 <<- (b11-b12)*x0 Y <<- ifelse(X