COVID-19 PubSeq: Public SARS-CoV-2 Sequence Resource

public sequences ready for download!

May 2021 update: we are now at 86,377 sequences with normalized metadata on AWS OpenData!

All sequences project

All sequences (FASTA) relabled and deduplicated

Metadata (RDF) for all sequences

SPARQL endpoint - Sample query for accessions

Edit text!

Download

1 Workflow runs

The last runs can be viewed here. If you click on a run you can see the workflows that ran under Processes. Output (also intermediate) is listed under Data collections. All current data is listed here. Note that it takes time for a run to complete and show.

2 FASTA files

The public sequence resource provides all uploaded sequences as FASTA files. They can be referred to from metadata individually. We also provide a single file FASTA download.

3 Metadata

Metadata can be downloaded as Turtle RDF as a mergedmetadat.ttl which can be loaded into any RDF triple-store. We provide a Virtuoso SPARQL endpoint ourselves which can be queried from http://sparql.genenetwork.org/sparql/. Query examples can be found in the DOCS

The Swiss Institute of Bioinformatics has included this data in https://covid-19-sparql.expasy.org/ and made it part of Uniprot.

An RDF file that includes the sequences themselves in a variation graph can be downloaded from below Pangenome RDF format.

4 Pangenome

Pangenome data is made available in multiple guises. Variation graphs (VG) provide a succinct encoding of the sequences of many genomes.

4.1 Pangenome GFA format

GFA is a standard for graphical fragment assembly and consumed by tools such as vgtools.

4.2 Pangenome in ODGI format

ODGI is a format that supports an optimised dynamic genome/graph implementation.

4.3 Pangenome RDF format

An RDF file that includes the sequences themselves in a variation graph can be downloaded from relabeledSeqs-dedup-relabeledSeqs-dedup.ttl.xz.

4.4 Pangenome Browser format

The many JSON files that are named as results/1/chunk001200.bin1.schematic.json are consumed by the Pangenome browser.

5 Log of workflow output

Including in below link is a log file of the last workflow runs.

7 Planned

We are planning the add the following output (see also

7.1 Raw sequence data

7.2 Multiple Sequence Alignment (MSA)

7.3 Phylogenetic tree

7.4 Protein prediction

We aim to make protein predictions available.

8 Source code

All source code for this website and tooling is available from https://github.com/arvados/bh20-seq-resource

9 Citing PubSeq

See the FAQ.

Edit text!