SQL Practice - Baby Names / Alex Wein

Alex Wein

Workspace

Published

Edited

Apr 30, 2020

import {query} from '@visnup/baby-names-by-birth-year'

import {printTable} from '@uwdata/data-utilities'

table = "\`bigquery-public-data.usa_names.usa_1910_current\`"

q = `select count(*) as cnt from ${table}`

result = query(q) // do I really want reactive execution for queries?

query(`select * from ${table} limit 1`)

printTable(result)

md`## What names saw the biggest one year drop-off?

1. Aggregate by state level to countrywide (missed this step at first)

1. use a window to get the lag of number,

1. calc absolute and % drop

1. get top 10 by gender

https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts

OVER (PARTITION BY division ORDER BY finish_time ASC)

Possible iterations:

1. names with biggest 5 year drop.

1. names with biggest 1 year rise

Other good query questions:

1. What name was most popular with both genders in the most recent year of data? (define "most popular with both genders meaningfully")

1. Is the the percentage of girls names ending in vowels increasing or decreasing over the last 30 years?

q1 = `

WITH group_by_gender as (

select gender, year, name, sum(number) as number

from ${table}

group by gender, year, name

)

---

select * from

(select *

, lag(number) over w as lag_number

, number-lag(number) over w as net_change_yoy

, -1+number/lag(number) over w as yoy_pct_change

from group_by_gender

window w as (partition by gender,name order by year)) q

where net_change_yoy is not null

-- and year >= 1980 and year <=1989 -- Let's restrict the analysis to my fellow 80's babies

and number >= 1000

order by yoy_pct_change

limit 50

printTable(res1)

res1 = query(q1)

q2 = `

WITH q_filter as (

select

, REGEXP_CONTAINS(name, r'[aeiouy]$') as ends_with_vowel

FROM ${table}

where gender = 'F' and year >= extract(year from current_date()) - 50

)

select year, sum(cast(ends_with_vowel as int64) * number) / sum(number) as pct_ends_with_e

from q_filter

group by year

order by year desc

printTable(res2)

res2 = query(q2)

import {vl} from '@vega/vega-lite-api'

vl.markLine()

.data(res2)

.encode(vl.x().fieldQ('year'),

vl.y({scale:{domain:[.5, 1]}}).fieldQ('pct_ends_with_e'))

.width(width)

.render()

md`# Names that are popular with both genders

I think a useful metric is the max of the ranks by gender. So the we'll find:

1. The most popular names for boys that are more popular for girls

1. The most popular names for girls that are more popular for boys

Not sure converting to ranks will change things much, but it'll facilitate comparison across decades and other groupings where the number of people born is different in each grouping.

---

I ended up using the sum of the boy and girls ranks. This seems to get results that make sense, but it seems sketchy.

q3 = `

WITH

q as (

select floor(year/10)*10 as decade, name, gender, sum(number) as number

from ${table}

group by decade, name, gender

w_rank as (

select *

, rank() over (partition by cast(decade as string), gender order by number desc) as rank

from q

boys as (select * from w_rank where gender = 'M'),

girls as (select * from w_rank where gender = 'F')

select

boys.decade

, boys.name as name

, boys.rank as rank_boy

, girls.rank as rank_girl

, girls.rank+boys.rank as sum_rank

from boys

inner join girls

on boys.name = girls.name and boys.decade = girls.decade

order by sum_rank

limit 100

printTable(res3)

res3 = query(q3)

query("EXPLAIN " + q3)

One platform to build and deploy the best data apps

Experiment and prototype by building visualizations in live JavaScript notebooks. Collaborate with your team and decide which concepts to build out.

Use Observable Framework to build data apps locally. Use data loaders to build in any language or library, including Python, SQL, and R.

Seamlessly deploy to Observable. Test before you ship, use automatic deploy-on-commit, and ensure your projects are always up-to-date.

Learn more