The informality of InDel Genome Variation in SARS-CoV-2 / Alex Lucaci

Alex Lucaci

Workspace

Published

Edited

Apr 17, 2020

1 fork

1 star

md`# The informality of InDel Genome Variation in SARS-CoV-2`

md`### Figures

#### (Figure 1)

Demonstrates the distribution of Whole Genome lengths (in bp) we observe in sequences submitted for the SARS-CoV-2 entries in GISAID. This dataset compromises XXX whole genomes which were pulled down from the GISAID repository on **04-14-2020**. Additionally, sequences were filtered to include whole genomes (> 29,000 bp) and represent samples from Human hosts only in order to examine intrahost evolution. This dataset is comprised of **n=8,064** whole genome sequences`

md`#### (Figure 2)

Breakdown of Whole Genome length sequence distributions by Collection Date and colored by submitting Country

vl.markPoint().data(flt_data2).width(width - 200).encode(

vl.x().fieldT('Date'),

vl.y().fieldQ('WG_Length')

.scale({domain: [29000, 30400]}).scale({domain: [29000, 30400]}),

vl.color().fieldN('Locale'),

vl.tooltip(['ID', 'Locale', 'Date', 'WG_Length']) // show the Name and Origin fields in a tooltip

).render()

md`#### (Figure 3)

What does whole genome length variability mean in the context of an individual gene?

For this, we examine a single gene within SARS-CoV-2, the spike protein as an example

The figure below shows spike gene size (in bp) distributed across the submission date to GISAID.

//Additionally, whole genomes display wide variability but genes, such as spike, have relatively low sequence length variance. Question is what are the hot spots?

viewof layeredTrend = vl.data(spikeData).width(width - 200)

.encode(

vl.x().fieldT(" Date"),

vl.y().fieldQ(" Nucleotides")

.scale({domain: [3780

, 3840]}),

vl.tooltip([' ID', ' Date', ' Nucleotides'])

)

.layer(

vl.markCircle()

vl.markErrorband({ extent: "iqr" , interpolate: "basis"}),

vl.markLine()

.encode(

vl.y().mean("Nucleotides")

)

.render()

md`#### (Figure 4)

Inferred Indel Sites from protein alignments.

Note: Site 265 was filtered out of this alignment, it seems to be the result of an insertion in a single sequence and warrants further investigation. It also forces the rest of the sequences to incur a gap at that particlar site in the alignment and overstates the reporting in the table below.

Table of Sites and Frequencies.

//Can point to alignment errors

table(flt_spikeInDels, {

rank: true,

style: 'compact',

columns: {

Track: {

formatter(val, i) {

return html`<strong>${val[0]}</strong> by ${val[1]}`;

}

Streams: {

formatter: d3.format(',')

}

})

// md`### Figure 5. InDel rate and Genome coverage. maybe it is an issue with long read seqs or low coverage? `

md`### Data `

data = Object.assign(d3.csvParse(await FileAttachment("test_wg_variation_daybyday_coloredbycountry.csv").text(), ({WG_Length}) => +WG_Length), {x: "WG Length (bp)", y: "Occurences (#)"})

data2 = d3.csvParse(await FileAttachment("test_wg_variation_daybyday_coloredbycountry.csv").text(), d3.autoType)

//flt_data2 = data2.filter(d => d.Date > 2018)

flt_data2 = data2.filter(d => d.Date > new Date(2018, 0, 1))

//Nucleotide length variability

spikeData = d3.csvParse(await FileAttachment("gisaid_cov2020_sequences.fasta.S_nuc.fas@3.csv").text(), d3.autoType)

spikeInDels = d3.csvParse(

await FileAttachment("gisaid_cov2020_sequences.fasta.S.protein_all_withref.fas_table.csv").text()

)

//Filter SpikeIndels

flt_spikeInDels = spikeInDels.filter(d => d.SiteDeletion != 265)

md`### Plotting `

bins = d3.histogram()

.domain(x.domain())

.thresholds(x.ticks(40))

(data)

x = d3.scaleLinear()

.domain(d3.extent(data)).nice()

.range([margin.left, width - margin.right])

y = d3.scaleLinear()

.domain([0, d3.max(bins, d => d.length)]).nice()

.range([height - margin.bottom, margin.top])

md`### Declares `

margin = ({top: 25, right: 20, bottom: 35, left: 40})

height = 600

grid = g => g

.attr("stroke", "currentColor")

.attr("stroke-opacity", 0.1)

.call(g => g.append("g")

.selectAll("line")

.data(x.ticks())

.join("line")

.attr("x1", d => 0.5 + x(d))

.attr("x2", d => 0.5 + x(d))

.attr("y1", margin.top)

.attr("y2", height - margin.bottom))

.call(g => g.append("g")

.selectAll("line")

.data(y.ticks())

.join("line")

.attr("y1", d => 0.5 + y(d))

.attr("y2", d => 0.5 + y(d))

.attr("x1", margin.left)

.attr("x2", width - margin.right));

xAxis = g => g

.attr("transform", `translate(0,${height - margin.bottom})`)

.call(d3.axisBottom(x).ticks(width / 80 ).tickSizeOuter(0))

.call(g => g.append("text")

.attr("x", width - margin.right)

.attr("y", -4)

.attr("fill", "currentColor")

.attr("font-weight", "bold")

.attr("text-anchor", "end")

.text(data.x))

yAxis = g => g

.attr("transform", `translate(${margin.left},0)`)

.call(d3.axisLeft(y).ticks(height / 40))

.call(g => g.select(".domain").remove())

.call(g => g.select(".tick:last-of-type text").clone()

.attr("x", 4)

.attr("text-anchor", "start")

.attr("font-weight", "bold")

.text(data.y))

md`### Dependencies `

import {table} from "@tmcw/tables@513"

import {vl} from '@vega/vega-lite-api'

import {printTable} from '@uwdata/data-utilities'

d3 = require("d3@5")

embed = require("vega-embed@3")

Purpose-built for displays of data

Observable is your go-to platform for exploring data and creating expressive data visualizations. Use reactive JavaScript notebooks for prototyping and a collaborative canvas for visual data exploration and dashboard creation.

Learn more