Skip to content

Linear regression mark ^0.5.1

The linear regression mark draws linear regression lines with confidence bands, representing the estimated linear relation of a dependent variable (typically y) on an independent variable (typically x). Below we can see that, in this example dataset at least, the weight of a car is a good linear predictor of its power.

6080100120140160180200220↑ power (hp)2,0002,5003,0003,5004,0004,5005,000weight (lb) →Fork
js
Plot.plot({
  marks: [
    Plot.dot(cars, {x: "weight (lb)", y: "power (hp)"}),
    Plot.linearRegressionY(cars, {x: "weight (lb)", y: "power (hp)", stroke: "red"})
  ]
})

A linear model posits that y is determined by an underlying affine function y = ab x, where a is a constant (intercept of the line on the y-axis when x = 0) and b is the slope. Given a set of points in x and y, the linear regression method returns the most likely parameters a and b for the linear model, as well as a confidence band showing the range where these parameters lie with a certain probability, called ci (for confidence interval), which defaults to 0.95.

INFO

The regression line is fit using the least squares approach. See Torben Jansen’s “Linear regression with confidence bands” and this StatExchange question for details.

Use the slider below to build a linear model from a subset of the data with m points. As you can see, the model gives a line as soon as two points are available, and gets more refined and stable as the size of the subset increases.

6080100120140160180200220↑ power (hp)2,0002,5003,0003,5004,0004,5005,000weight (lb) →Fork
js
Plot.plot({
  marks: [
    Plot.dot(cars, {x: "weight (lb)", y: "power (hp)", fill: "currentColor", fillOpacity: 0.2}),
    Plot.dot(cars.slice(0, m), {x: "weight (lb)", y: "power (hp)"}),
    Plot.linearRegressionY(cars.slice(0, m), {x: "weight (lb)", y: "power (hp)", stroke: "red"})
  ]
})

TIP

When operating on a subset of the data (the “training dataset”, in machine learning parlance), randomly shuffling the data can reduce bias.

This type of model is regularly criticized for pushing people to the wrong conclusions about their data when the actual underlying structure or process is nonlinear. For example, if you measure the relationship between culmen depth and length in a mixed population of penguins, it is positively correlated in each of the three species (bigger penguins with the longer culmens also tend to have the deeper ones); however, the Gentoo population has a smaller aspect ratio of depth against length, and the overall correlation across the three species is negative. This is called Simpson’s paradox, and it applies to any data that contains underlying populations with different properties or outcomes.

AdelieChinstrapGentoo
1415161718192021↑ culmen_depth_mm3540455055culmen_length_mm →
Fork
js
Plot.plot({
  grid: true,
  color: {legend: true},
  marks: [
    Plot.dot(penguins, {x: "culmen_length_mm", y: "culmen_depth_mm", fill: "species"}),
    Plot.linearRegressionY(penguins, {x: "culmen_length_mm", y: "culmen_depth_mm", stroke: "species"}),
    Plot.linearRegressionY(penguins, {x: "culmen_length_mm", y: "culmen_depth_mm"})
  ]
})

Finally, note that regression is not a symmetric method: the model computed to express y as a function of x (linearRegressionY) doesn’t give the same result as the regression of x as a function of y (linearRegressionX) unless the points are all perfectly aligned. In the worst case, where the two variables are statistically independent, the linear regression of y against x is an horizontal line, whereas the linear regression of x against y is a vertical line.

6080100120140160180200220↑ power (hp)2,0002,5003,0003,5004,0004,5005,000weight (lb) →Fork
js
Plot.plot({
  marks: [
    Plot.dot(cars, {x: "weight (lb)", y: "power (hp)", strokeOpacity: 0.5, r: 2}),
    Plot.linearRegressionY(cars, {x: "weight (lb)", y: "power (hp)", stroke: "steelblue"}),
    Plot.linearRegressionX(cars, {x: "weight (lb)", y: "power (hp)", stroke: "orange"})
  ]
})

Linear regression options

The given options are passed through to these underlying marks, with the exception of the following options:

  • stroke - the stroke color of the regression line; defaults to currentColor
  • fill - the fill color of the confidence band; defaults to the line’s stroke
  • fillOpacity - the fill opacity of the confidence band; defaults to 0.1
  • ci - the confidence interval in [0, 1), or 0 to hide bands; defaults to 0.95
  • precision - the distance (in pixels) between samples of the confidence band; defaults to 4

Multiple regressions can be defined by specifying z, fill, or stroke.

linearRegressionX(data, options)

js
Plot.linearRegressionX(mtcars, {y: "wt", x: "hp"})

Returns a linear regression mark where x is the dependent variable and y is the independent variable. (This is the uncommon orientation.)

linearRegressionY(data, options)

js
Plot.linearRegressionY(mtcars, {x: "wt", y: "hp"})

Returns a linear regression mark where y is the dependent variable and x is the independent variable. (This is the common orientation.)