COVID-19 PubSeq: Public SARS-CoV-2 Sequence Resource

public sequences ready for download!

May 2021 update: we are now at 86,377 sequences with normalized metadata on AWS OpenData!

Edit text!


COVID-19 PubSeq - Arvados

1 The Arvados Web Server

We are using Arvados to run common workflow language (CWL) pipelines. The most recent output is on display on a web page (with time stamp) and a full output list is generated here.

Arvados has a web front which allows navigation through input and output data, workflows and the output of analysis pipelines (here CWL workflows).

2 The Arvados file interface

Arvados has the web server, but it also has a REST API and associated command line tools. We are already using the API to upload data. If you follow the pip or ../ GNU Guix instructions for installing Arvados API you'll find the following command line tools (also documented here):

Command Description
arv-ls list files in Arvados
arv-put upload a file to Arvados
arv-get get a textual representation of Arvados objects from the command line. The output can be limited to a subset of the object’s fields. This command can be used with only the knowledge of an object’s UUID

Now, this is a public instance so we can use the tokens from the uploader.

export ARVADOS_API_TOKEN='2fbebpmbo3rw3x05ueu2i6nx70zhrsb1p22ycu3ry34m4x4462'
arv-ls lugli-4zz18-z513nlpqm03hpca

will list all files (the UUID we got from the Arvados results page). To get the UUID of the files

curl | jq .Users.AnonymousUserToken
env ARVADOS_API_TOKEN=5o42qdxpxp5cj15jqjf7vnxx5xduhm4ret703suuoa3ivfglfh \
  arv-get lugli-4zz18-z513nlpqm03hpca

and fetch one listed JSON file chunk001_bin4000.schematic.json with its listed UUID:

arv-get 2be6af7b4741f2a5c5f8ff2bc6152d73+1955623+Ab9ad65d7fe958a053b3a57d545839de18290843a@5ed7f3c5

3 The PubSeq Arvados shell

When you login to Arvados (you can request permission from us) it is possible to upload an ssh key in your profile and get an shell prompt with

Linux ip-10-255-0-202 4.19.0-9-cloud-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64

It is a small Debian VM hosted on AWS somewhere. The PubSeq material is mounted on /data/pubseq. The log is in nohup.out. Update/edit the code (bh20-seq-resource git checkout) and restart the service (the run script). The log says

you should have permission to read the log (nohup.out) update / edit the code (bh20-seq-resource git checkout) and restart the service (the run script)

which means it will trigger the run on upload. The service is running as a Python virtualenv:

/data/pubseq/bh20-seq-resource/venv3/bin/python3 /data/pubseq/bh20-seq-resource/venv3/bin/bh20-seq-analyzer --no-start-analysis

and is restarted by a run script:

/data/pubseq/run [options]

The run script kills the old process, sets up the API tokens, pulls the git repo and starts a new run calling into /data/pubseq/bh20-seq-resource/venv3/bin/bh20-seq-analyzer which is essentially monitoring for uploads.

On run --help

optional arguments:
  -h, --help            show this help message and exit
  --uploader-project UPLOADER_PROJECT
  --pangenome-analysis-project PANGENOME_ANALYSIS_PROJECT
  --fastq-project FASTQ_PROJECT
  --validated-project VALIDATED_PROJECT
  --workflow-def-project WORKFLOW_DEF_PROJECT
  --pangenome-workflow-uuid PANGENOME_WORKFLOW_UUID
  --fastq-workflow-uuid FASTQ_WORKFLOW_UUID
  --exclude-list EXCLUDE_LIST
  --latest-result-collection LATEST_RESULT_COLLECTION
  --print-status PRINT_STATUS

4 Wiring up CWL

In above script bh20-seq-analyzer you can see that the Common Workflow Language (CWL) gets triggered; for example fastq2fasta which is part of the main repo. The actual script is in fastq2fasta.cwl and runs the following tools in sequence: bwa-mem, samtools-view, samtools-sort, and bam2fasta.

It probably pays to familiarize yourself with CWL and its concepts. We believe it has a lot going for it though CWL is some steps removed from traditional shell scripts for running work flows. Main thing to understand is that CWL is a separation of concerns, i.e.,

  1. Data
  2. Tools
  3. Flow

and each of these are described separately. This contrasts largely with shell scripts (though you can invoke shell scripts from CWL). Also, CWL is written in JSON/YAML, which means everything can be parsed as a tree and you can easily get visualisations such as

For more see Creating a reproducible workflow with CWL by Pjotr Prins.

5 Using the Arvados API

Arvados provides a rich API for accessing internals of the Cloud infrastructure.

In above script bh20-seq-analyzer there are examples of querying the Arvados API using the Python Arvados client and libraries. For example get a list of projects in Arvados. Main thing is to get the ARVADOS-API-HOST and ARVADOS-API-TOKEN right as is shown above.

6 Troubleshooting

When workflows have errors we should check the logs in Arvados.

Go to the project page for 'COVID-19-BH20 Shared Project' -> 'Public Sequence Resource'. Click on analysis runs and 'Subprojects'. Click one of the runs and then on 'Processes' and you'll see what parts failed.

Edit text!

Other documents

We fetch sequence data and metadata. We query the metadata in multiple ways using SPARQL and onthologies
We submit a sequence to the database. In this BLOG we fetch a sequence from GenBank and add it to the database.
We modify a workflow to get new output
We modify metadata for all to use! In this BLOG we add a field for a creative commons license.
Dealing with PubSeq localisation data
We explore the Arvados command line and API
Generate the files needed for uploading to EBI/ENA
Documentation for PubSeq REST API