Bin transform
TIP
The bin transform is for aggregating quantitative or temporal data. For ordinal or nominal data, use the group transform. See also the hexbin transform.
The bin transform groups quantitative or temporal data — continuous measurements such as heights, weights, or temperatures — into discrete bins. You can then compute summary statistics for each bin, such as a count, sum, or proportion. The bin transform is most often used to make histograms or heatmaps with the rect mark.
For example, here is a histogram showing the distribution of weights of Olympic athletes.
ForkPlot.plot({
y: {grid: true},
marks: [
Plot.rectY(olympians, Plot.binX({y: "count"}, {x: "weight"})),
Plot.ruleY([0])
]
})
The binX transform takes x as input and outputs x1 and x2 representing the extent of each bin in x. The outputs argument (here {y: "count"}
) declares additional output channels (y) and the associated reducer (count). Hence the height of each rect above represents the number of athletes in the corresponding bin, i.e., the number of athletes with a similar weight.
While the binX transform is often used to generate y, it can output any channel. Below, the fill channel represents count per bin, resulting in a one-dimensional heatmap.
ForkPlot
.rect(olympians, Plot.binX({fill: "count"}, {x: "weight"}))
.plot({color: {scheme: "YlGnBu"}})
You can partition bins using z. If z is undefined, it defaults to fill or stroke, if any. In conjunction with the rectY mark’s implicit stackY transform, this will produce a stacked histogram.
ForkPlot.plot({
y: {grid: true},
color: {legend: true},
marks: [
Plot.rectY(olympians, Plot.binX({y: "count"}, {x: "weight", fill: "sex"})),
Plot.ruleY([0])
]
})
TIP
You can invoke the stack transform explicitly as Plot.stackY(Plot.binX({y: "count"}, {x: "weight", fill: "sex"}))
to produce an identical chart.
You can opt-out of the implicit stackY transform by having binX generate y1 or y2 instead of y (and similarly x1 or x2 for stackX and binY). When overlapping marks, use either opacity or blending to make the overlap visible.
ForkPlot.plot({
y: {grid: true},
marks: [
Plot.rectY(olympians, Plot.binX({y2: "count"}, {x: "weight", fill: "sex", mixBlendMode: "multiply"})),
Plot.ruleY([0])
]
})
CAUTION
While the mixBlendMode option is useful for mitigating occlusion, it can be slow to render if there are many elements. More than two overlapping histograms may also be hard to read.
The bin transform comes in three orientations:
- binX bins on x, and often outputs y as in a histogram with vertical↑ rects;
- binY bins on y, and often outputs x as in a histogram with horizontal→ rects; and
- bin bins on both x and y, and often outputs to fill or r as in a heatmap.
As you might guess, the binY transform with the rectX mark produces a histogram with horizontal→ rects.
ForkPlot.plot({
x: {grid: true},
marks: [
Plot.rectX(olympians, Plot.binY({x: "count"}, {y: "weight"})),
Plot.ruleX([0])
]
})
You can produce a two-dimensional heatmap with bin transform and a rect mark by generating a fill output channel. Below, color encodes the number of athletes in each bin (of similar height and weight).
ForkPlot
.rect(olympians, Plot.bin({fill: "count"}, {x: "weight", y: "height"}))
.plot({color: {scheme: "YlGnBu"}})
The bin transform also outputs x and y channels representing bin centers. These can be used to place a dot mark whose size again represents the number of athletes in each bin.
ForkPlot.plot({
r: {range: [0, 6]}, // generate slightly smaller dots
marks: [
Plot.dot(olympians, Plot.bin({r: "count"}, {x: "weight", y: "height"}))
]
})
We can add the stroke channel to show overlapping distributions by sex.
ForkPlot.plot({
r: {range: [0, 6]},
marks: [
Plot.dot(olympians, Plot.bin({r: "count"}, {x: "weight", y: "height", stroke: "sex"}))
]
})
Similarly the binX and binY transforms generate respective x and y channels for one-dimensional binning.
ForkPlot.plot({
r: {range: [0, 14]},
marks: [
Plot.dot(olympians, Plot.binX({r: "count"}, {x: "weight"}))
]
})
In addition to rect and dot, you can even use continuous marks such as area and line. In this case you should set the bin transform’s filter option to null so that empty bins are included in the output; otherwise, the area or line would mislead by interpolating over missing bins.
ForkPlot.plot({
y: {grid: true},
marks: [
Plot.areaY(olympians, Plot.binX({y: "count", filter: null}, {x: "weight", fillOpacity: 0.2})),
Plot.lineY(olympians, Plot.binX({y: "count", filter: null}, {x: "weight"})),
Plot.ruleY([0])
]
})
The cumulative option produces a cumulative distribution. Below, each bin represents the number of athletes with the given weight or less. To have each bin represent the number of athletes with the given weight or more, set cumulative to −1.
Cumulative:
Plot.plot({
marginLeft: 60,
y: {grid: true},
marks: [
Plot.rectY(olympians, Plot.binX({y: "count"}, {x: "weight", cumulative})),
Plot.ruleY([0])
]
})
The bin transform works with Plot’s faceting system, partitioning bins by facet. Below, we compare the weight distributions of athletes within each sport using the proportion-facet reducer. Sports are sorted by median weight: gymnasts tend to be the lightest, and basketball players the heaviest.
Plot.plot({
marginLeft: 100,
padding: 0,
x: {grid: true},
fy: {domain: d3.groupSort(olympians.filter((d) => d.weight), (g) => d3.median(g, (d) => d.weight), (d) => d.sport)},
color: {scheme: "YlGnBu"},
marks: [Plot.rect(olympians, Plot.binX({fill: "proportion-facet"}, {x: "weight", fy: "sport", inset: 0.5}))]
})
The bin transform sets default insets for a one-pixel gap between rects. You can set explicit insets if you prefer, say if you want the rects to touch. In this case we recommend rounding on the x scale to avoid antialiasing artifacts between rects.
Plot.plot({
x: {round: true},
y: {grid: true},
marks: [
Plot.rectY(olympians, Plot.binX({y: "count"}, {x: "weight", inset: 0})),
Plot.ruleY([0])
]
})
Bin options
Given input data = [d₀, d₁, d₂, …], by default the resulting binned data is an array of arrays where each inner array is a subset of the input data [[d₁, d₂, …], [d₀, …], …]. Each inner array is in input order. The outer array is in ascending order according to the associated dimension (x then y).
By specifying a reducer for the data output, as described below, you can change how the binned data is computed. The outputs may also include filter and sort options specified as reducers, and a reverse option to reverse the order of generated bins. By default, empty bins are omitted, and non-empty bins are generated in ascending threshold order.
In addition to data, the following channels are automatically output:
- x1 - the starting horizontal position of the bin
- x2 - the ending horizontal position of the bin
- x - the horizontal center of the bin
- y1 - the starting vertical position of the bin
- y2 - the ending vertical position of the bin
- y - the vertical center of the bin
- z - the first value of the z channel, if any
- fill - the first value of the fill channel, if any
- stroke - the first value of the stroke channel, if any
The x1, x2, and x output channels are only computed by the binX and bin transform; similarly the y1, y2, and y output channels are only computed by the binY and bin transform. The x and y output channels are lazy: they are only computed if needed by a downstream mark or transform. Conversely, the x1 and x2 outputs default to undefined if x is explicitly defined; and the y1 and y2 outputs default to undefined if y is explicitly defined.
You can declare additional output channels by specifying the channel name and desired reducer in the outputs object which is the first argument to the transform. For example, to use binX to generate a y channel of bin counts as in a frequency histogram:
Plot.binX({y: "count"}, {x: "culmen_length_mm"})
The following named reducers are supported:
- first - the first value, in input order
- last - the last value, in input order
- count - the number of elements (frequency)
- distinct - the number of distinct values
- sum - the sum of values
- proportion - the sum proportional to the overall total (weighted frequency)
- proportion-facet - the sum proportional to the facet total
- min - the minimum value
- min-index - the zero-based index of the minimum value
- max - the maximum value
- max-index - the zero-based index of the maximum value
- mean - the mean value (average)
- median - the median value
- mode - the value with the most occurrences
- pXX - the percentile value, where XX is a number in [00,99]
- deviation - the standard deviation
- variance - the variance per Welford’s algorithm
- identity - the array of values
- x - the middle of the bin’s x extent (when binning on x)
- x1 - the lower bound of the bin’s x extent (when binning on x)
- x2 - the upper bound of the bin’s x extent (when binning on x)
- y - the middle of the bin’s y extent (when binning on y)
- y1 - the lower bound of the bin’s y extent (when binning on y)
- y2 - the upper bound of the bin’s y extent (when binning on y)
- z ^0.6.14 - the bin’s z value (z, fill, or stroke)
In addition, a reducer may be specified as:
- a function to be passed the array of values for each bin and the extent of the bin
- an object with a reduceIndex method, and optionally a scope
In the last case, the reduceIndex method is repeatedly passed three arguments: the index for each bin (an array of integers), the input channel’s array of values, and the extent of the bin (an object {data, x1, x2, y1, y2}); it must then return the corresponding aggregate value for the bin.
If the reducer object’s scope is data, then the reduceIndex method is first invoked for the full data; the return value of the reduceIndex method is then made available as a third argument (making the extent the fourth argument). Similarly if the scope is facet, then the reduceIndex method is invoked for each facet, and the resulting reduce value is made available while reducing the facet’s bins. (This optional scope is used by the proportion and proportion-facet reducers.)
Most reducers require binding the output channel to an input channel; for example, if you want the y output channel to be a sum (not merely a count), there should be a corresponding y input channel specifying which values to sum. If there is not, sum will be equivalent to count.
Plot.binX({y: "sum"}, {x: "culmen_length_mm", y: "body_mass_g"})
You can control whether a channel is computed before or after binning. If a channel is declared only in options (and it is not a special group-eligible channel such as x, y, z, fill, or stroke), it will be computed after binning and be passed the binned data: each datum is the array of input data corresponding to the current bin.
Plot.binX({y: "count"}, {x: "economy (mpg)", title: (data) => data.map((d) => d.name).join("\n")})
This is equivalent to declaring the channel only in outputs.
Plot.binX({y: "count", title: (data) => data.map((d) => d.name).join("\n")}, {x: "economy (mpg)"})
However, if a channel is declared in both outputs and options, then the channel in options is computed before binning and can then be aggregated using any built-in reducer (or a custom reducer function) during the bin transform.
Plot.binX({y: "count", title: (names) => names.join("\n")}, {x: "economy (mpg)", title: "name"})
To control how the quantitative dimensions x and y are divided into bins, the following options are supported:
- thresholds - the threshold values; see below
- interval - an alternative method of specifying thresholds
- domain - values outside the domain will be omitted
- cumulative - if positive, each bin will contain all lesser bins
These options may be specified either on the options or outputs object. If the domain option is not specified, it defaults to the minimum and maximum of the corresponding dimension (x or y), possibly niced to match the threshold interval to ensure that the first and last bin have the same width as other bins. If cumulative is negative (-1 by convention), each bin will contain all greater bins rather than all lesser bins, representing the complementary cumulative distribution.
To pass separate binning options for x and y, use an object with the options above and a value option to specify the input channel values.
Plot.binX({y: "count"}, {x: {thresholds: 20, value: "culmen_length_mm"}})
The thresholds option may be specified as a named method or a variety of other ways:
- auto (default) - Scott’s rule, capped at 200
- freedman-diaconis - the Freedman–Diaconis rule
- scott - Scott’s normal reference rule
- sturges - Sturges’ formula
- a count (hint) representing the desired number of bins
- an array of n threshold values for n - 1 bins
- an interval or time interval (for temporal binning; see below)
- a function that returns an array, count, or time interval
If the thresholds option is specified as a function, it is passed three arguments: the array of input values, the domain minimum, and the domain maximum. If a number, d3.ticks or d3.utcTicks is used to choose suitable nice thresholds. If an interval, it must expose an interval.floor(value), interval.ceil(value), interval.offset(value), and interval.range(start, stop) methods. If the interval is a time interval such as "day" (equivalently, d3.utcDay), or if the thresholds are specified as an array of dates, then the binned values are implicitly coerced to dates. Time intervals are intervals that are also functions that return a Date instance when called with no arguments.
If the interval option is used instead of thresholds, it may be either an interval, a time interval, or a number. If a number n, threshold values are consecutive multiples of n that span the domain; otherwise, the interval option is equivalent to the thresholds option. When the thresholds are specified as an interval, and the default domain is used, the domain will automatically be extended to start and end to align with the interval.
The bin transform supports grouping in addition to binning: you can subdivide bins by up to two additional ordinal or categorical dimensions (not including faceting). If any of z, fill, or stroke is a channel, the first of these channels will be used to subdivide bins. Similarly, binX will group on y if y is not an output channel, and binY will group on x if x is not an output channel. For example, for a stacked histogram:
Plot.binX({y: "count"}, {x: "body_mass_g", fill: "species"})
Lastly, the bin transform changes the default mark insets: binX changes the defaults for insetLeft and insetRight; binY changes the defaults for insetTop and insetBottom; bin changes all four.
bin(outputs, options)
Plot.rect(olympians, Plot.bin({fill: "count"}, {x: "weight", y: "height"}))
Bins on x and y. Also groups on the first channel of z, fill, or stroke, if any.
binX(outputs, options)
Plot.rectY(olympians, Plot.binX({y: "count"}, {x: "weight"}))
Bins on x. Also groups on y and the first channel of z, fill, or stroke, if any.
binY(outputs, options)
Plot.rectX(olympians, Plot.binY({x: "count"}, {y: "weight"}))
Bins on y. Also groups on x and first channel of z, fill, or stroke, if any.