COVID-19 PubSeq: Public SARS-CoV-2 Sequence Resource

public sequences ready for download!

May 2021 update: we are now at 86,377 sequences with normalized metadata on AWS OpenData!

Edit text!

COVID-19 PubSeq (part 4)

1 Modify Metadata

The public sequence resource uses multiple data formats listed on the download page. One of the most exciting features is the full support for RDF and semantic web/linked data ontologies. This technology allows for querying data in unprescribed ways - that is, you can formulate your own queries without dealing with a preset model of that data (which is how one has to approach CSV files and SQL tables). Examples of exploring data are listed here.

In this BLOG we are going to look at the metadata entered on the COVID-19 PubSeq website (or command line client). It is important to understand that anyone, including you, can change that information!

2 What is the schema?

The default metadata schema is listed here.

3 How is the website generated?

Using the schema we use pyshex shex expressions and schema salad to generate the input form, validate the user input and to build RDF! All from that one metadata schema.

4 Changing the license field

4.1 Modifying the schema

One of the first things we want to do is to add a field for the data license. Initially we only supported CC-4.0 as a license, but we wanted to give uploaders the option to use an even more liberal CC0 license. The first step is to find a good ontology term for the field. Searching for `creative commons cc0 rdf' rendered this useful page. We also find an overview where CC0 is represented as URI https://creativecommons.org/publicdomain/zero/1.0/. Meanwhile the attribution license https://creativecommons.org/licenses/by/4.0/. According to this document we should really also add fields for attributionName and attributionURL.

A minimal triple should be

id  xhtml:license  <http://creativecommons.org/licenses/by/4.0/> .

Other suggestions are

id  dc:title "Description" .
id  cc:attributionName "Your Name" .
id  cc:attributionURL <http://resource.org/id>

and 'dc:source' which indicates the original source of any modified work, specified as a URI. The prefix 'cc:' is an abbreviation for http://creativecommons.org/ns#.

Going back to the schema, where does it fit? Under host, sample, virus, technology or submitter block? It could fit under sample, but actually the license concerns the whole metadata block and sequence, so I think we can fit under its own license tag. For example

id: placeholder

license:
    license_type: http://creativecommons.org/licenses/by/4.0/
    attribution_title: "Sample ID"
    attribution_name: "John doe, Joe Boe, Jonny Oe"
    attribution_url: http://covid19.genenetwork.org/id
    attribution_source: https://www.ncbi.nlm.nih.gov/pubmed/323088888

So, let's update the example. Notice the license info is optional - if it is missing we just assume the default CC-4.0.

One thing that is interesting is that in the name space https://creativecommons.org/ns there is no mention of a title. I think it is useful, however, because we have no such field. So, we'll add it simply as a title field. Now the draft schema is

- name: licenseSchema
  type: record
  fields:
    license_type:
      doc: License types as refined in https://wiki.creativecommons.org/images/d/d6/Ccrel-1.0.pdf
      type: string?
      jsonldPredicate:
          _id: https://creativecommons.org/ns#License
    title:
      doc: Attribution title related to license
      type: string?
      jsonldPredicate:
          _id: http://semanticscience.org/resource/SIO_001167
    attribution_url:
      doc: Attribution URL related to license
      type: string?
      jsonldPredicate:
          _id: https://creativecommons.org/ns#Work
    attribution_source:
      doc: Attribution source URL
      type: string?
      jsonldPredicate:
          _id: https://creativecommons.org/ns#Work

Now, we are no ontology experts, right? So, next we submit a patch to our source tree and ask for feedback before wiring it up in the data entry form. The pull request was submitted here and reviewed on the gitter channel and I merged it.

4.2 Adding fields to the form

To add the new fields to the form we have to modify it a little. If we go to the upload form we need to add the license box. The schema is loaded in main.py in the 'generate-form' function.

With this patch the website adds the license input fields on the form.

Finally, to make RDF output work we need to add expressions to bh20seq-shex.rdf. This was done with this patch. In the end we decided to use the Dublin core title, http://purl.org/metadata/dublin_core_elements#Title:

:licenseShape{
    cc:License xsd:string;
    dc:Title xsd:string ?;
    cc:attributionName xsd:string ?;
    cc:attributionURL xsd:string ?;
    cc:attributionSource xsd:string ?;
}

Note that cc:AttributionSource is not really defined in the cc standard.

When pushing the license info we discovered the workflow broke because the existing data had no licensing info. So we changed the license field to be optional - a missing license assumes it is CC-BY-4.0.

4.3 TODO Testing the license fields

5 Changing GEO or location field

When fetching information from GenBank and EBI/ENA we also translate the location into an unambiguous identifier. We opted for the wikidata tag. E.g. for New York city it is https://www.wikidata.org/wiki/Q60 and for New York state it is https://www.wikidata.org/wiki/Q1384. If everyone uses these metadata URIs it is easy to group when making queries. Note that we should be using http://www.wikidata.org/entity/Q60 in the dataset (http instead of https and entitity instead of wiki).

Unfortunately the main repositories of SARS-CoV-2 have variable strings of text for location and/or GPS coordinates. For us to support our schema we had to translate all options and this proves expensive.

5.1 Relaxing the shex constraint

So we decide to relax the enforcement of this type of metadata and to allow for a free form string.

The schema already used http://purl.obolibrary.org/obo/GAZ_00000448 which states:

Class: geographic
  location
  Term IRI: http://purl.obolibrary.org/obo/GAZ_00000448
Definition: A reference to a place on
  the Earth, by its name or by its geographical location.

and when you check count by location in the DEMO it lists a free format.

So, why does the validation step balk when importing GenBank? The problem was in the shex check for RDF generation. Removing the wikidata requirement relaxed the imports with this patch.

Edit text!


Other documents

We fetch sequence data and metadata. We query the metadata in multiple ways using SPARQL and onthologies
We submit a sequence to the database. In this BLOG we fetch a sequence from GenBank and add it to the database.
We modify a workflow to get new output
We modify metadata for all to use! In this BLOG we add a field for a creative commons license.
Dealing with PubSeq localisation data
We explore the Arvados command line and API
Generate the files needed for uploading to EBI/ENA
Documentation for PubSeq REST API