Dummy data for visualization projects / Romain Vuillemot

Romain Vuillemot

dataviz & research

Workspace

Published

visualisation-interactive

Edited

Dec 31, 2021

1 star

visualisation-interactive

Dummy data for visualization projects

# Dummy data for visualization projects

_Using dummy (or mock or fake) data helps to bootstrap a visualization project, to focus on design/visual parts rather than data loading and parsing. It also helps understand the shape of expected data if this part is not settled, yet. I'll report on some ideas I've used to generate such dummy dataset. Please keep in mind there is no intention to arm or counterfeit anything, but rather provide a technical heuristic._

Working on a data(-driven) visualization project requires a plethora of skills, from user studies, design, JavaScript, D3, deployement, etc.. and most importantly data!

But data storage, collection preparation, acessibility (API, remote server) are as many distractions that prevent you from focusing on early design decisions that are important especially usually as they will stick for the rest of your project.

From my experience building and mentoring a couple of visualization projects, you may need to dummy the dataset you work on (unless of course if is already available, and if so I suggest to copy/paste a couple of lines)

What are the benefits of faking data for an early project?

- It gets you started _right now_ without waiting for thre actual data

- Helps you shape the data schema, types, distribution.

- No privacy issues! You can share dummy data without any concern regarding data copyright or anonymity

- Useful to test to benchmark with larger datasets, different distributions, data types and eventually missing data

Better getting started please make sure you stick to the _[tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html)_ principles:

<pre>

1. Each variable forms a column.

2. Each observation forms a row.

3. Each type of observational unit forms a table.

</pre>

iris = d3.csv("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv")

Array.from({length: 10}).map((d, i) => i)

Array.from({length: 10}).map(d3.randomNormal(0.4, 0.1))

d3 = require("d3")

**Sample datasets**

_Such dataset are very common in the data science community_

- [Iris Flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) ([csv](https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv))

<pre>

sepal_length,sepal_width,petal_length,petal_width,species

5.1,3.5,1.4,0.2,setosa

4.9,3,1.4,0.2,setosa

4.7,3.2,1.3,0.2,setosa

4.6,3.1,1.5,0.2,setosa

5,3.6,1.4,0.2,setosa

</pre>

- [Cars](https://ai.stanford.edu/~jkrause/cars/car_dataset.html.) ([csv](https://gist.githubusercontent.com/noamross/e5d3e859aa0c794be10b/raw/b999fb4425b54c63cab088c0ce2c0d6ce961a563/cars.csv))

<pre>

"","mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"

"Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4

"Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4

"Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1

"Hornet 4 Drive",21.4,6,258,110,3.08,3.215,19.44,1,0,3,1

"Hornet Sportabout",18.7,8,360,175,3.15,3.44,17.02,0,0,3,2

"Valiant",18.1,6,225,105,2.76,3.46,20.22,1,0,3,1

"Duster 360",14.3,8,360,245,3.21,3.57,15.84,0,0,3,4

</pre>

- Individuals [https://www.mockaroo.com/](https://www.mockaroo.com/)

<pre>

id,first_name,last_name,email,gender,ip_address

1,Gale,Bernardini,gbernardini0@flickr.com,Female,236.165.167.229

2,Ravid,Magnar,rmagnar1@indiegogo.com,Male,127.20.137.234

3,Courtney,Simcox,csimcox2@hp.com,Male,25.250.201.67

4,Farlay,Killeley,fkilleley3@behance.net,Male,211.236.251.254

5,Ambros,Godier,agodier4@i2i.jp,Male,198.226.197.211

6,Melicent,Ahren,mahren5@thetimes.co.uk,Female,128.171.235.98

7,Freedman,Paullin,fpaullin6@posterous.com,Male,10.133.34.122

8,Jabez,Jonsson,jjonsson7@comsenz.com,Male,173.77.112.108

</pre>

- [Stocks data](https://raw.githubusercontent.com/LyonDataViz/MOS5.5-Dataviz/master/data/stocks.csv)

<pre>

symbol,date,price

MSFT,Jan 2000,39.81

MSFT,Feb 2000,36.35

MSFT,Mar 2000,43.22

MSFT,Apr 2000,28.37

MSFT,May 2000,25.45

MSFT,Jun 2000,32.54

MSFT,Jul 2000,28.4

MSFT,Aug 2000,28.4

MSFT,Sep 2000,24.53

</pre>

**What not to do**

- Stick too long to the same dataset, then the visualization might get too specific

- Forget to explore the actual dataset being used to find interesting patterns, properties, semantic, ..

- The generated dataset can then be aggregated, filtered, reduced, etc. as a regular dataset

**Other**

- https://www.kelp.nyc/

- https://tuftsvalt.github.io/snowcat/

- https://amnesia.openaire.eu/

- https://github.com/jiananlu/faked_csv

- API testing https://reqres.in/

md`

TODO

- Add Les miserables for node/link example

- Nested datasets

- Aggregated datasets

- Temporal data

- De-aggregate data

- Int32Array.of(1, 2, 3, 4),

- Float64Array.of(5, 6, 7, 8)

- d3.group

- d3.rollup

- Time parse/format

Purpose-built for displays of data

Observable is your go-to platform for exploring data and creating expressive data visualizations. Use reactive JavaScript notebooks for prototyping and a collaborative canvas for visual data exploration and dashboard creation.

Learn more