Published
Edited
Apr 17, 2020
1 fork
1 star
Insert cell
md`# The informality of InDel Genome Variation in SARS-CoV-2`
Insert cell
Insert cell
md`### Figures

#### (Figure 1)

Demonstrates the distribution of Whole Genome lengths (in bp) we observe in sequences submitted for the SARS-CoV-2 entries in GISAID. This dataset compromises XXX whole genomes which were pulled down from the GISAID repository on **04-14-2020**. Additionally, sequences were filtered to include whole genomes (> 29,000 bp) and represent samples from Human hosts only in order to examine intrahost evolution. This dataset is comprised of **n=8,064** whole genome sequences`
Insert cell
Insert cell
md`#### (Figure 2)
Breakdown of Whole Genome length sequence distributions by Collection Date and colored by submitting Country
`
Insert cell
vl.markPoint().data(flt_data2).width(width - 200).encode(
vl.x().fieldT('Date'),
vl.y().fieldQ('WG_Length')
.scale({domain: [29000, 30400]}).scale({domain: [29000, 30400]}),
vl.color().fieldN('Locale'),
vl.tooltip(['ID', 'Locale', 'Date', 'WG_Length']) // show the Name and Origin fields in a tooltip
).render()
Insert cell
md`#### (Figure 3)
What does whole genome length variability mean in the context of an individual gene?

For this, we examine a single gene within SARS-CoV-2, the spike protein as an example

The figure below shows spike gene size (in bp) distributed across the submission date to GISAID.
`
//Additionally, whole genomes display wide variability but genes, such as spike, have relatively low sequence length variance. Question is what are the hot spots?
Insert cell
viewof layeredTrend = vl.data(spikeData).width(width - 200)
.encode(
vl.x().fieldT(" Date"),
vl.y().fieldQ(" Nucleotides")
.scale({domain: [3780
, 3840]}),
vl.tooltip([' ID', ' Date', ' Nucleotides'])
)
.layer(
vl.markCircle()
,
vl.markErrorband({ extent: "iqr" , interpolate: "basis"}),
vl.markLine()
.encode(
vl.y().mean("Nucleotides")
)
)
.render()
Insert cell
md`#### (Figure 4)
Inferred Indel Sites from protein alignments.

Note: Site 265 was filtered out of this alignment, it seems to be the result of an insertion in a single sequence and warrants further investigation. It also forces the rest of the sequences to incur a gap at that particlar site in the alignment and overstates the reporting in the table below.

Table of Sites and Frequencies.
`

//Can point to alignment errors
Insert cell
table(flt_spikeInDels, {
rank: true,
style: 'compact',
columns: {
Track: {
formatter(val, i) {
return html`<strong>${val[0]}</strong> by ${val[1]}`;
}
},
Streams: {
formatter: d3.format(',')
}
}
})
Insert cell
// md`### Figure 5. InDel rate and Genome coverage. maybe it is an issue with long read seqs or low coverage? `
Insert cell
md`### Data `
Insert cell
data = Object.assign(d3.csvParse(await FileAttachment("test_wg_variation_daybyday_coloredbycountry.csv").text(), ({WG_Length}) => +WG_Length), {x: "WG Length (bp)", y: "Occurences (#)"})
Insert cell
data2 = d3.csvParse(await FileAttachment("test_wg_variation_daybyday_coloredbycountry.csv").text(), d3.autoType)
Insert cell
//flt_data2 = data2.filter(d => d.Date > 2018)
flt_data2 = data2.filter(d => d.Date > new Date(2018, 0, 1))

Insert cell
//Nucleotide length variability
spikeData = d3.csvParse(await FileAttachment("gisaid_cov2020_sequences.fasta.S_nuc.fas@3.csv").text(), d3.autoType)
Insert cell
spikeInDels = d3.csvParse(
await FileAttachment("gisaid_cov2020_sequences.fasta.S.protein_all_withref.fas_table.csv").text()
)

Insert cell
//Filter SpikeIndels
flt_spikeInDels = spikeInDels.filter(d => d.SiteDeletion != 265)
Insert cell
md`### Plotting `
Insert cell
bins = d3.histogram()
.domain(x.domain())
.thresholds(x.ticks(40))
(data)
Insert cell
x = d3.scaleLinear()
.domain(d3.extent(data)).nice()
.range([margin.left, width - margin.right])
Insert cell
y = d3.scaleLinear()
.domain([0, d3.max(bins, d => d.length)]).nice()
.range([height - margin.bottom, margin.top])
Insert cell
md`### Declares `
Insert cell
margin = ({top: 25, right: 20, bottom: 35, left: 40})
Insert cell
height = 600
Insert cell
grid = g => g
.attr("stroke", "currentColor")
.attr("stroke-opacity", 0.1)
.call(g => g.append("g")
.selectAll("line")
.data(x.ticks())
.join("line")
.attr("x1", d => 0.5 + x(d))
.attr("x2", d => 0.5 + x(d))
.attr("y1", margin.top)
.attr("y2", height - margin.bottom))
.call(g => g.append("g")
.selectAll("line")
.data(y.ticks())
.join("line")
.attr("y1", d => 0.5 + y(d))
.attr("y2", d => 0.5 + y(d))
.attr("x1", margin.left)
.attr("x2", width - margin.right));
Insert cell
xAxis = g => g
.attr("transform", `translate(0,${height - margin.bottom})`)
.call(d3.axisBottom(x).ticks(width / 80 ).tickSizeOuter(0))
.call(g => g.append("text")
.attr("x", width - margin.right)
.attr("y", -4)
.attr("fill", "currentColor")
.attr("font-weight", "bold")
.attr("text-anchor", "end")
.text(data.x))
Insert cell
yAxis = g => g
.attr("transform", `translate(${margin.left},0)`)
.call(d3.axisLeft(y).ticks(height / 40))
.call(g => g.select(".domain").remove())
.call(g => g.select(".tick:last-of-type text").clone()
.attr("x", 4)
.attr("text-anchor", "start")
.attr("font-weight", "bold")
.text(data.y))
Insert cell
md`### Dependencies `
Insert cell
import {table} from "@tmcw/tables@513"
Insert cell
import {vl} from '@vega/vega-lite-api'
Insert cell
import {printTable} from '@uwdata/data-utilities'
Insert cell
d3 = require("d3@5")
Insert cell
embed = require("vega-embed@3")
Insert cell

Purpose-built for displays of data

Observable is your go-to platform for exploring data and creating expressive data visualizations. Use reactive JavaScript notebooks for prototyping and a collaborative canvas for visual data exploration and dashboard creation.
Learn more