COVID-19 PubSeq: Public SARS-CoV-2 Sequence Resource

public sequences ready for download!

May 2021 update: we are now at 86,377 sequences with normalized metadata on AWS OpenData!

In Nature Scientists call for fully open sharing of coronavirus genome data. PubSeq has the goal of providing federated open data with permanent links in the form of uniform resource identifiers with creative commons data sharing licenses that can be used in research publications and for reproducible workflows using free and open source software. Federated data means that there is no central authority. Free software means that anyone can run the PubSeq website and replicate PubSeq workflows.

The Institute of Environmental Science and Research (Māori: Te Whare Manaaki Tangata, Taiao hoki) actively participates in PubSeq because our SARS-CoV-2 sequencing and accompanying results should be available to everyone on the planet. Likewise New Zealand benefits from sequences uploaded by individual labs world-wide and we contribute to PubSeq metadata efforts and online sequence uploading using the Oxford Nanopore sequencer, with the goal of enabling sequencing everywhere — Joep de Ligt PhD, ESR, Wellington NZ.

PubSeq exists because we believe that (anonymised) Pandemic viral data should be out in the open for everyone with sufficient metadata to trace strains across countries.

At the University of Tennessee Health Science Center we support FAIR data initiatives and we actively use PubSeq to build phylogenetic trees. In the context of PubSeq we also contribute to new pangenome methods that benefit from open data exchange — Pjotr Prins PhD, Memphis TN, USA

PubSeq is also an online bioinformatics public computational resource with unique metadata that provides on-the-fly analysis of sequenced SARS-CoV-2 samples and allows a quick turnaround in identification of new viral strains. PubSeq allows anyone to upload sequence material in the form of FASTA or FASTQ files with accompanying metadata through the web interface or REST API. For more information see the FAQ!.

In recent months, leading genomic researchers have called for the submission of SARS-CoV-2 genomic sequence to open genomic data repositories as a key factor to winning the battle against COVID-19. PubSeq, an open access sequence repository, currently contains over 33,000 SARS-CoV-2 viral genomes with rich associated metadata (sample location, type, submitting lab information) that can be queried using Amazon Simple Storage Service (Amazon S3) Select or Amazon Athena. PubSeq also integrates seamlessly with Arvados for on-the-fly analysis of sequenced SARS-CoV-2 samples and rapid identification of novel viral strains. It joins the Coronavirus Genome Sequence Dataset provided by the National Center for Biotechnology Information in the Registry of Open Data on AWS, making this one of the richest sources of coronavirus genome sequence data freely available to the public. From the AWS BLOG

Make your sequence data FAIR. Upload your SARS-CoV-2 sequence (FASTA or FASTQ formats) with simple metadata (JSONLD) to the public sequence resource. The upload will trigger a recompute with all available sequences into a Pangenome available for download!

Your uploaded sequence will automatically be processed and incorporated into the public pangenome with metadata using worklows from the High Performance Open Biology Lab defined here. All data is published under a Creative Commons license You can take the published (GFA/RDF/FASTA) data and store it in a triple store for further processing. Clinical data can be stored securely at REDCap.

Data can be uploaded from any sequencing platform in FASTA format. We give special attention to workflows for the Oxford Nanopore - see also pubmed - because it offers an affordable platform that is great for SARS-CoV-2 sequencing and identification. In New Zealand the Oxford Nanopore is used for all tracing.

Note that form fields contain web ontology URI's for disambiguation and machine readable metadata. For examples of use, see the docs.