Published
Edited
Apr 17, 2020
1 fork
1 star
Insert cell
md`# The informality of InDel Genome Variation in SARS-CoV-2`
Insert cell
Insert cell
md`### Figures

#### (Figure 1)

Demonstrates the distribution of Whole Genome lengths (in bp) we observe in sequences submitted for the SARS-CoV-2 entries in GISAID. This dataset compromises XXX whole genomes which were pulled down from the GISAID repository on **04-14-2020**. Additionally, sequences were filtered to include whole genomes (> 29,000 bp) and represent samples from Human hosts only in order to examine intrahost evolution. This dataset is comprised of **n=8,064** whole genome sequences`
Insert cell
Insert cell
md`#### (Figure 2)
Breakdown of Whole Genome length sequence distributions by Collection Date and colored by submitting Country
`
Insert cell
vl.markPoint().data(flt_data2).width(width - 200).encode(
vl.x().fieldT('Date'),
vl.y().fieldQ('WG_Length')
.scale({domain: [29000, 30400]}).scale({domain: [29000, 30400]}),
vl.color().fieldN('Locale'),
vl.tooltip(['ID', 'Locale', 'Date', 'WG_Length']) // show the Name and Origin fields in a tooltip
).render()
Insert cell
md`#### (Figure 3)
What does whole genome length variability mean in the context of an individual gene?

For this, we examine a single gene within SARS-CoV-2, the spike protein as an example

The figure below shows spike gene size (in bp) distributed across the submission date to GISAID.
`
//Additionally, whole genomes display wide variability but genes, such as spike, have relatively low sequence length variance. Question is what are the hot spots?
Insert cell
viewof layeredTrend = vl.data(spikeData).width(width - 200)
.encode(
vl.x().fieldT(" Date"),
vl.y().fieldQ(" Nucleotides")
.scale({domain: [3780
, 3840]}),
vl.tooltip([' ID', ' Date', ' Nucleotides'])
)
.layer(
vl.markCircle()
,
vl.markErrorband({ extent: "iqr" , interpolate: "basis"}),
vl.markLine()
.encode(
vl.y().mean("Nucleotides")
)
)
.render()
Insert cell
md`#### (Figure 4)
Inferred Indel Sites from protein alignments.

Note: Site 265 was filtered out of this alignment, it seems to be the result of an insertion in a single sequence and warrants further investigation. It also forces the rest of the sequences to incur a gap at that particlar site in the alignment and overstates the reporting in the table below.

Table of Sites and Frequencies.
`

//Can point to alignment errors
Insert cell
table(flt_spikeInDels, {
rank: true,
style: 'compact',
columns: {
Track: {
formatter(val, i) {
return html`<strong>${val[0]}</strong> by ${val[1]}`;
}
},
Streams: {
formatter: d3.format(',')
}
}
})
Insert cell
// md`### Figure 5. InDel rate and Genome coverage. maybe it is an issue with long read seqs or low coverage? `
Insert cell
md`### Data `
Insert cell
data = Object.assign(d3.csvParse(await FileAttachment("test_wg_variation_daybyday_coloredbycountry.csv").text(), ({WG_Length}) => +WG_Length), {x: "WG Length (bp)", y: "Occurences (#)"})
Insert cell
data2 = d3.csvParse(await FileAttachment("test_wg_variation_daybyday_coloredbycountry.csv").text(), d3.autoType)
Insert cell
//flt_data2 = data2.filter(d => d.Date > 2018)
flt_data2 = data2.filter(d => d.Date > new Date(2018, 0, 1))

Insert cell
//Nucleotide length variability
spikeData = d3.csvParse(await FileAttachment("gisaid_cov2020_sequences.fasta.S_nuc.fas@3.csv").text(), d3.autoType)
Insert cell
spikeInDels = d3.csvParse(
await FileAttachment("gisaid_cov2020_sequences.fasta.S.protein_all_withref.fas_table.csv").text()
)

Insert cell
//Filter SpikeIndels
flt_spikeInDels = spikeInDels.filter(d => d.SiteDeletion != 265)
Insert cell
md`### Plotting `
Insert cell
bins = d3.histogram()
.domain(x.domain())
.thresholds(x.ticks(40))
(data)
Insert cell
x = d3.scaleLinear()
.domain(d3.extent(data)).nice()
.range([margin.left, width - margin.right])
Insert cell
y = d3.scaleLinear()
.domain([0, d3.max(bins, d => d.length)]).nice()
.range([height - margin.bottom, margin.top])
Insert cell
md`### Declares `
Insert cell
margin = ({top: 25, right: 20, bottom: 35, left: 40})
Insert cell
height = 600
Insert cell
grid = g => g
.attr("stroke", "currentColor")
.attr("stroke-opacity", 0.1)
.call(g => g.append("g")
.selectAll("line")
.data(x.ticks())
.join("line")
.attr("x1", d => 0.5 + x(d))
.attr("x2", d => 0.5 + x(d))
.attr("y1", margin.top)
.attr("y2", height - margin.bottom))
.call(g => g.append("g")
.selectAll("line")
.data(y.ticks())
.join("line")
.attr("y1", d => 0.5 + y(d))
.attr("y2", d => 0.5 + y(d))
.attr("x1", margin.left)
.attr("x2", width - margin.right));
Insert cell
xAxis = g => g
.attr("transform", `translate(0,${height - margin.bottom})`)
.call(d3.axisBottom(x).ticks(width / 80 ).tickSizeOuter(0))
.call(g => g.append("text")
.attr("x", width - margin.right)
.attr("y", -4)
.attr("fill", "currentColor")
.attr("font-weight", "bold")
.attr("text-anchor", "end")
.text(data.x))
Insert cell
yAxis = g => g
.attr("transform", `translate(${margin.left},0)`)
.call(d3.axisLeft(y).ticks(height / 40))
.call(g => g.select(".domain").remove())
.call(g => g.select(".tick:last-of-type text").clone()
.attr("x", 4)
.attr("text-anchor", "start")
.attr("font-weight", "bold")
.text(data.y))
Insert cell
md`### Dependencies `
Insert cell
import {table} from "@tmcw/tables@513"
Insert cell
import {vl} from '@vega/vega-lite-api'
Insert cell
import {printTable} from '@uwdata/data-utilities'
Insert cell
d3 = require("d3@5")
Insert cell
embed = require("vega-embed@3")
Insert cell

One platform to build and deploy the best data apps

Experiment and prototype by building visualizations in live JavaScript notebooks. Collaborate with your team and decide which concepts to build out.
Use Observable Framework to build data apps locally. Use data loaders to build in any language or library, including Python, SQL, and R.
Seamlessly deploy to Observable. Test before you ship, use automatic deploy-on-commit, and ensure your projects are always up-to-date.
Learn more