Apache Arrow
Apache Arrow “defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations.” You will probably not consume it directly, but it is used by Arquero, DuckDB, and other libraries to handle data efficiently.
To load an Arrow IPC file, use FileAttachment
.
const flights = FileAttachment("flights-200k.arrow").arrow();
This returns a promise to an Arrow table.
flights
This table records
[...flights]
Or using Inputs.table
:
Inputs.table(flights)
We can visualize the distribution of flight delays with a Plot rect mark and bin transform:
Plot.plot({
y: {
transform: (d) => d / 1000,
label: "Flights (thousands)"
},
marks: [
Plot.rectY(flights, Plot.binX({y: "count"}, {x: "delay", interval: 5, fill: "var(--theme-blue)"})),
Plot.ruleY([0])
]
})
You can also work directly with the Apache Arrow API to create in-memory tables. Apache Arrow is available by default as Arrow
in Markdown, but you can import it explicitly like so:
import * as Arrow from "npm:apache-arrow";
For example, to create a table representing a year-long random walk:
const date = d3.utcDay.range(new Date("2023-01-01"), new Date("2024-01-02"));
const random = d3.randomNormal.source(d3.randomLcg(42))(); // seeded random
const value = d3.cumsum(date, random);
const table = Arrow.tableFromArrays({date, value});
Visualized with Plot’s difference mark:
Plot.plot({
x: {type: "utc"},
marks: [
Plot.ruleY([0]),
Plot.differenceY(table, {x: "date", y: "value"})
]
})
The chart above specifies x as a UTC scale because Apache Arrow represents dates as numbers (milliseconds since Unix epoch) rather than Date objects; without this hint, Plot would assume that date column is quantitative rather than temporal and produce a less legible axis.
Apache Parquet
The Apache Parquet format is optimized for storage and transfer. To load a Parquet file — such as this sample of 250,000 stars from the Gaia Star Catalog — use FileAttachment
. This is implemented using Kyle Barron’s parquet-wasm library.
const gaia = FileAttachment("gaia-sample.parquet").parquet();
Like file.arrow
, this returns an Arrow table.
gaia
Inputs.table(gaia)
We can plot these stars binned by intervals of 2° to reveal the Milky Way.
Plot.plot({
aspectRatio: 1,
marks: [
Plot.frame({fill: 0}),
Plot.rect(gaia, Plot.bin({fill: "count"}, {x: "ra", y: "dec", interval: 2, inset: 0}))
]
})
Parquet files work especially well with DuckDB for in-process SQL queries. The Parquet format is optimized for this use case: data is compressed in a columnar format, allowing DuckDB to load only the subset of data needed (via range requests) to execute the current query. This can give a huge performance boost when working with larger datasets.