Most companies store their data in databases that house a large number of related tables. To answer business questions using that data, data analysts combine records from different tables using joins to consolidate variables needed for their analyses. That makes joins one of the most essential and frequently used operations in all of data wrangling. 

Given joins’ prevalence in data work, we prioritized joins as one of the first operations needed in Observable Canvases — our browser-based collaborative whiteboards for data analysis and visualization. Building out the UI for such a common data transformation should be straightforward…right? 

Not quite. It turns out that implementing UI-based joins is challenging. 

To learn why joins are surprisingly hard to get right and how we handle them in canvases, I spoke with Observable Software Engineer Tony Sullivan. Read on to learn why the engineering work needed to deliver fast, friendly joins in canvases was less than straightforward, and how we made it happen! 

Avoiding SELECT *

Observable Canvases work by sending optimized queries to a connected database. Those queries are constructed based on user selections and changes to the canvas. So, you might imagine a UI where a user simply selects two tables (items and prices), indicates a column to join on, clicks a “Join” button, then the following query is run to return all fields from both tables using a full join: 

SELECT *
FROM items i
FULL JOIN prices p ON i.id = p.item_id

That may seem simple enough. But unfortunately, this query doesn’t work well as a general approach to UI-based joins for several reasons. 

First, for company databases where analysts have access to some but not all fields (also called column-level security, or column-level access control), using  SELECT * would return an error because the query requests columns that a user does not have permission to access. 

Second, problems can emerge when joining tables that contain identically-named columns. This can result in an error, or in some cases, with auto-generated column names (e.g. state and state_1 for a variable named “state” returned from a left and right table in the join, respectively). 

For these reasons, we avoid using SELECT * in all canvas queries. We take the harder (but more robust) approach of constructing queries that explicitly point to columns in both tables by name. 

In canvases, when a user selects two tables to join, we need to know two things: (1) which column names to include in the SELECT statement, and (2) in the case of duplicate column names, how to alias them. In other (SQL) words, we want to build a query that includes a SELECT statement looking more like this: 

SELECT 
  i.id,
  i.name, 
  i.category as category_left,
  p.price, 
  p.discount,
  p.category as category_right
FROM items i
JOIN prices p ON i.id = p.item_id
...

Aliasing is built into the query to de-duplicate matching column names (here, to return category_left and category_right, since a column named category exists in both tables). 

That query doesn’t look too complex. So why does building it from a user’s UI selections pose a challenge in canvases? 

The answer, in a nutshell, is because joins might be done at any point in the data analysis process. By the time a user chooses to join tables in a canvas, they may have wrangled the data — selecting or deselecting columns, renaming things, deriving new fields, etc. — in ways that complicate the process of building queries that run successfully against the original database. 

Read on to learn how we construct queries to join tables amidst other data transformations and on a changing, interactive canvas. 

Handling changes in a reactive canvas

Canvases are reactive. Changes you make to data anywhere on the canvas — like filtering rows, selecting columns, grouping values by category, etc. — are automatically propagated through to downstream nodes so your work stays in sync. 

When building canvas nodes, we considered how to keep reactivity efficient. “To keep things fast,” Tony shares, “we want the canvas to be synchronous. When something changes upstream in a canvas, we want every node to know what columns to request from the database so the query isn’t delayed. That way, you don't see canvas updates walk their way through downstream nodes one at a time.”

For that to happen, each node needs to update automatically when any node upstream is changed. Each node receives its parent node’s query, wrapped in a common table expression, or CTE. As canvases grow in size and complexity, the queries constructed from input CTEs and the current node operations become quite involved — even when a node has a single input! 

Joins compound this further because they have to watch and respond to two different inputs. Here's Observable Software Engineer Tony Sullivan on why this poses a challenge: 

With joins, we have two inputs — the two different tables that are coming into the join node. That node has to look at all of the column names and column types that were computed up the chain, through every single operation in the canvas. Then inside of join, we have to specify which tables are left and right, join them based on a field you picked, then figure out the full list of columns and de-duplicate before sending the query to the database.

Tony Sullivan

One caveat here: if upstream changes invalidate any of the join inputs (the two connected tables, and the variable they’re joined on), we want to show users a helpful error message before the misconfigured query is ever sent to the database.

For example, in the video below, the CATEGORY column used in the Join node downstream is renamed to TYPE. Instead of assuming that the name change should automatically carry through to the Join node, a warning icon appears and the join query isn’t attempted, since CATEGORY no longer exists in the left table. After correcting the join variable to the new variable name (TYPE), the query is successfully re-run.

Tony describes our thinking behind this approach: 

If upstream you change the name of a column used in the join, the operation should recognize that the column is no longer there and show you the list of the new columns (including the one you renamed). It shouldn’t try to rerun the query because it knows it's not valid. In general, we are trying not to magically fix things for users. We just try to make it really clear if the join (or our other operations) experience an issue that requires changes from the user.

Tony Sullivan

Extra design considerations for the Join node

Joins are the only data wrangling operation in canvases that expect two inputs, and that came with some unique design challenges. For example, we needed to decide if the precise connection port implied which input table would be the left and right table in the join. In other words, should connecting an upstream table to the top port of the Join node make it the left table?  

We landed on adding two ambiguous input ports to the Join node. A user connects two tables to the node without needing to specify at that point which is left or right. Within the join node a user can more easily see controls showing which is which, with a button to swap the two.

The result is a Join node that doesn’t require precise connector positioning or forethought about table order before connecting the two input tables, and that still makes it clear how tables are understood by the join:

Go behind the scenes with Observable Canvases

For one of the most common methods in all of data wrangling, UI-based joins are surprisingly hard to implement well. By constructing explicit queries that avoid issues with SELECT *, tailoring joins to react to upstream changes, and designing for simplicity and clarity, we’re making joins straightforward and robust in canvases. 

We continue to share behind the scenes views into how we’re designing and building Observable Canvases as we add new features, optimize user experience, and improve performance.

Check out recent pieces in the series: 

Canvases enable fast, flexible AI-driven data analysis, and put state-of-the art interactive data visualizations and dashboards at your fingertips. Learn more and request a demo today.