Elias Fano Encoding / Yuri Vishnevsky

Yuri Vishnevsky

Software engineer and visualization enthusiast. Lives in Cambridge, Massachusetts, with no dogs.

Workspace

Unlisted

Edited

Sep 30, 2023

Say we have an array of *n* sorted integers, with maximum integer *u* and repetitions allowed.

For example, for the array below we have _n_ = ${n} elements and the largest one is _u_ = ${u}

```

${JSON.stringify(a)}

```

In a language with integer types these would usually be represented in a fixed-width encoding such as `u8` or `u32` that represents each number in base 2. If you know your numbers to span the entire range of the encoding (ie. for `u8` your maximum number is indeed 2^8-1 = 255) this can be a reasonable choice, since on average it takes 8 bits to communicate the value of an 8-bit integer chosen uniformly at random. The amount of information in a random 8 bit integer is 8 bits, so the amount of information in an array of random 8-bit integers is _n_ * 8 bits.

But in our case we know for a fact that our integers are sorted, and in addition our _u_ value might not be a power of 2. both of these special cases suggest that we might be able to do better than a fixed-width encoding.

That's where Elias Fano comes in – it's an encoding that allows us to represent our sorted array more compactly.

The basic idea is that the sorted nature of our array lets us take advantage of _bucketing_ –

The Elias-Fano exploits the sorted nature of our sequence by *bucketing* – it groups the sequence into contiguous buckets and stores the counts for each bucket, instead of storing the numbers themselves.

It stores the count (high bits) and an offset (low bits) for each entry in the bucket.

Draw an analogy to the same idea in base 10

// todo: check how this lines up with what I was doing

image = FileAttachment("image.png").image()

FileAttachment("ma2021.pdf").url() // image is from this paper

data = a.flatMap((d, i) =>

Array.from(d.toString(), (c, ci) => ({ d, i, c, ci }))

)

Plot.plot({

marks: [

Plot.text(data, { text: "c", x: "ci", y: "i", fontSize: 14 })

width: 3z0,

margin: 15,

y: { reverse: true },

axis: null

})

a = [0, 2, 5, 5, 10, 11, 15, 17]

2 ** 4

n = a.length

u = a[n - 1]

lo = floor(log2(u / n))

hi = ub - lo

ub = ceil(log2(u)) // u bits

n * ub

nb = n + (u >>> lo) + n * lo // number of bits for the entire sequence

bpe = nb / n

floor = Math.floor

ceil = Math.ceil

log2 = Math.log2

- https://rossanoventurini.github.io/papers/CPM17.pdf

- https://en.wikipedia.org/wiki/Stars_and_bars_%28combinatorics%29

- https://vigna.di.unimi.it/ftp/papers/QuasiSuccinctIndices.pdf

- https://shonan.nii.ac.jp/archives/seminar/029/wp-content/uploads/sites/12/2013/07/Sebastiano_Shonan.pdf

Purpose-built for displays of data

Observable is your go-to platform for exploring data and creating expressive data visualizations. Use reactive JavaScript notebooks for prototyping and a collaborative canvas for visual data exploration and dashboard creation.

Learn more