COVID-19 PubSeq: Public SARS-CoV-2 Sequence Resource

public sequences ready for download!

May 2021 update: we are now at 86,377 sequences with normalized metadata on AWS OpenData!

Edit text!

COVID-19 PubSeq Location Data

1 Introduction

Normalized location metadata is one of the most valuable features of PubSeq. In this document we explain how we create that metadata and why it is not so easy!

2 Location data

When data gets uploaded to central repositories some metadata needs to be provided about sampling location and uploader (lab) location. This should be straightforward. Yeah, right.

The main problem, probably, is that central repositories can not be strict about these parameters - otherwise uploads fail and uploaders get annoyed (and may stop uploading). This basic problem is about human expectation.

The other problem is that not everything is fixed to a site. What to to make of the location `USA: Cruise_Ship_1, California' in MT810507. From the full record we best place the sample on San Diego.

There exist many opportunities for ambiguous entries, such as Memphis TN and Memphis IN. We decided to settle for Wikidata URIs because (1) they are unambiguous and (2) allow for fetching more information, such as population size, density, county, country, and GPS coordinates. Click on the wikidata entries for Memphis Tennessee (not the song Memphis Tennessee), Memphis Indiana, Memphis Texas Memphis Michigan and Memphis Nebraska. And then there is the original Memphis Egypt, ancient capital of Aneb-Hetch. All listed on Wikidata. Wikidata URIs are great at disambiguation of locality while grouping data in a sensible way for research!

2.1 Why use WikiData URIs instead of GPS coordinates?

A wikidata URI gives access to a lot of extra information that can be queried about an entry. It is fairly straightforward, for example, to find all entries that relate to one county or country. And these are typical queries for researchers. Wikidata URIs allows fetching GPS coordinates - we do that to display samples on the map. For example

select ?gps  where {
  <http://www.wikidata.org/entity/Q16563> wdt:P625 ?gps .
}

execute query

returns

Point(-89.971111111 35.1175)

To get a list of all directly available items related to Memphis TN:

select ?p ?s  where {
  <http://www.wikidata.org/entity/Q16563> ?p ?s .
}

[execute query]

The other way round, querying samples and getting relevant information directly from GPS coordinates, is much harder. Wikidata URIs are very useful.

3 Normalizing location data

Wikidata is the stable and trusted backend for Wikipedia, a massive online encyclopedia. It is also one of largest RDF-based databases on the planet and won't disappear any time soon. Wikidata contains URIs for almost all geographical entities of interest. Wikidata is therefore a good choice for normalizing geographical location data.

PubSeq gets data from different sources and needs to convert that information to Wikidata URIs. So, how would we go about fixing the location `USA: Cruise_Ship_1, California' in MT810507. We can tokenize the address information into `USA' and `California'. Also `San Diego' is listed as an address in the record. We should aim for the smallest Geo description `San Diego' and relate that back to Wikidata to find the URI. Now Wikidata won't be pleased if we hit it with 100 thousand queries, so best is to fetch location related info and turn that into a small database using Python Pandas, or similar.

We do not want to find all these links by hand. Initially, when we started out, we did just that. But at 100K entries it is not feasible. So we end up with a stepped approach:

  1. Search a database with all known Wikidata entries
  2. Search a separate database with 'vetted' items including Regex
  3. If GPS coordinates are provided we can use them for location
  4. Tokenize the record and use the Wikidata or Wikipedia search engine

Hopefully that covers over 90% of cases correctly. We still may end up with misses and wrong data. E.g. it is easy to imagine a sample taken on Antartica that got sequenced in the USA and gets placed in the USA. In Wiki-spirit the best solution is to leave it to end-users to re-place those entries. End-users scale, unlike us. For this we will create a simple API.

Note that normalization has to be fully automated to execute regular updates from the upstream data repositories. There is no way we can annotate these by hand ourselves.

Also data fetched from wikidata is not stored in the github repository, but online on IPFS.

3.1 Simple search Wikidata

We can dump all known contries/states/states/cities from Wikidata into files.

3.1.1 Fetch all 180 odd countries with aliases

SELECT DISTINCT * WHERE {
  ?place wdt:P17 ?country .
  ?place (a|wdt:P31|wdt:P279) wd:Q6256 .
  minus { ?place wdt:P31 wd:Q3024240 } .
  ?country rdfs:label ?countryname .
  FILTER( LANG(?countryname)="en") .
  OPTIONAL {
      ?country wdt:P30 ?continent.
      ?continent rdfs:label ?continent_label
      FILTER (lang(?continent_label)='en')
    }
  }

run query

Save the file as data/countries.tsv.

Just adding ?place skos:altLabel ?alias for all country aliases gets 15K results:

SELECT DISTINCT * WHERE {
  ?place wdt:P17 ?country ;
    (a|wdt:P31|wdt:P279) wd:Q6256 .
  minus { ?place wdt:P31 wd:Q3024240 } .
  ?place skos:altLabel ?alias .
}

run query

Save the file as data/wikidata/country_aliases.tsv.

3.1.2 Fetch all places with coordinates (no aliases)

select DISTINCT * where {
  ?place wdt:P17 ?country ;
         wdt:P625 ?coor ;
         wdt:P1082 ?population .
}

518076 results in 20247 ms

select DISTINCT * where {
  ?place wdt:P17 ?country ;
       wdt:P625 ?coor ;
       wdt:P1082 ?population ;
       rdfs:label ?placename .
}

run query — 520255 results in 32151 ms

I ended up chunking the query using OFFSET in the script wikidata-fetch-places.rb

select DISTINCT ?place ?placename ?country ?coor ?population where {
  ?place wdt:P17 ?country ;
         wdt:P625 ?coor ;
         wdt:P1082 ?population .
  FILTER (?population > 9999)
  # ?place rdfs:label ?placename .
  # FILTER (lang(?placename)='en')
  ?place rdfs:label ?placename filter (lang(?placename) = "en").
} limit 10000

run query — 10000 results in 6253 ms

curl -G https://query.wikidata.org/sparql -H "Accept: text/tab-separated-values; charset=utf-8" --data-urlencode query="
SELECT DISTINCT ?placename ?place ?country ?coor ?population where {
  ?place wdt:P17 ?country ;
         wdt:P625 ?coor ;
         wdt:P1082 ?population .
  FILTER (?population > 9999)
  ?place rdfs:label ?placename .
  FILTER (lang(?placename)='en')
}
"

Save file as data/wikidata/places.tsv.

3.1.3 Fetch all states

This query lists all USA states:

select DISTINCT ?place ?placename ?country ?coor ?population where {
  ?place wdt:P17 ?country ;
         wdt:P31 wd:Q35657 ;
         wdt:P625 ?coor ;
         wdt:P1082 ?population .
  ?place rdfs:label ?placename .
  FILTER (lang(?placename)='en')

}

List any area that has a capital.

select DISTINCT ?place ?placename ?country ?coor ?population where {
  ?place wdt:P17 ?country ;
         wdt:P36 ?capital ;
         wdt:P625 ?coor ;
         wdt:P1082 ?population .
  ?place rdfs:label ?placename .
  FILTER (lang(?placename)='en')

} limit 1000

But I think this is the best option which lists all larger states and cities that are a member of the subclass 'federal state' or 'region'

  select distinct ?placename ?place ?country ?population ?coor where {
  VALUES ?v { wd:Q82794 wd:Q107390 wd:Q34876 wd:Q9316670 wd:Q515 }
  ?statetype wdt:P279+ ?v .
  ?place wdt:P31 ?statetype ;
         wdt:P17 ?country ;
         wdt:P625 ?coor;
         wdt:P1082 ?population .
  FILTER (?population > 99999)
  ?place rdfs:label ?placename .
  FILTER (lang(?placename)='en')
}

The script 'fetch-regions.sh' with curl takes over a minute with about 20K results.

Save file as data/wikidata/regions.csv.gz

3.2 Start normalizing geo-location

3.2.1 Normalize by country

Now we have three data files from wikidata with place information we can start normalizing the data we have in the records that are coming from other sources.

The first step is to normalize by country. We take above country_aliases file which consists of 4 fields

place country countryname alias

and we search "collection_location": "USA: MD" and "submitter_address": "Maryland Department of Health, 1, 1, MD 1, USA" fields for matches. That, likely, will get the right country and we plug in the Wikidata URI, in this case http://www.wikidata.org/entity/Q30. If that fails we look for country aliases from the second wikidata file.

"collection_location" gets updated to the new country URI. We also set the "country" to the same entity at this stage. Arguably that is not necessary - as the refined location can be queried in wikidata to provide the country - but we are adding it as a convenience. This also allows quick visual checks of mismatches in the JSON records.

Of course, we immediately hit snags like "original_collection_location"=>"USA: Utah" -> "country"=>"SA" which is South Africa instead and it meant we have to add some logic. The logic can be found in normalize_geo.rb.

3.2.2 Normalize by state

3.2.3 Refine location

The above step should have normalized country information towards a wikidata URI. Next we try to refine the location using the places and regions files downloaded earlier (latest queries are in ../../workflows/pubseq/wikidata scripts). They contain for fields for

placename,place,country,coor,population

The current heuristic simply takes all GEO location and resolves by length of matches. This is not perfect, but works because we only use places that have a population abvoe 100K. Future refinements should be by country/state/province/council and distinguish between the different locations with the same name.

3.2.4 Track location coordinates

Every location comes with GPS coordinates in wikidata. We capture those in triples in our RDF store so we don't have to visit wikidata for every query. A triple for Washington looks like

<http://www.wikidata.org/entity/Q1223> rdfs:label "Washington" ;
    wdt:P17 <http://www.wikidata.org/entity/Q30> ;
    wdt:P625 "Point(-120.5 47.5)" .

The intermediate JSON record is stored in a file geo.json which is used to generate RDF triples and can also be used directly by JSON parsers:

{
  "http://www.wikidata.org/entity/Q30": {
    "name": "United States of America",
    "geo": "Point(-98.5795 39.828175)"
  },
  "http://www.wikidata.org/entity/Q110739": {
    "name": "Santa Clara County",
    "geo": "Point(-121.97 37.36)"
  },
  "http://www.wikidata.org/entity/Q99": {
    "name": "California",
    "geo": "Point(-120.0 37.0)"
  },
    ...
}

3.3 RegEx search known items

Hand-filtered search

3.4 Fuzzy search Wikidata

You wonder how Wikimedia's search box works? Wikimedia which powers Wikipedia and has this amazing `fuzzy' search facility called MWAPI that can be used from Wikidata SPARQL. The following query finds all places named Memphis - the wdt:P625 forces anything that has map GPS coordinates(!)

SELECT * WHERE {
    SERVICE wikibase:mwapi
            { bd:serviceParam wikibase:api "EntitySearch" .
              bd:serviceParam wikibase:endpoint "www.wikidata.org" .
              bd:serviceParam mwapi:search "memphis" .
              bd:serviceParam mwapi:language "en" .
              ?place wikibase:apiOutputItem mwapi:item .
              ?num wikibase:apiOrdinal true .
            }
    ?place wdt:P625 ?coordinates .
  }

[run query]

Above query renders 15 results. And this lists Memphis uniquely as a city:

SELECT * WHERE {
  VALUES ?type { wd:Q1093829 } .
  SERVICE wikibase:mwapi {
      bd:serviceParam wikibase:api "Search" .
      bd:serviceParam wikibase:endpoint "www.wikidata.org" .
      bd:serviceParam mwapi:srsearch "memphis te*essee usa" .
      ?place wikibase:apiOutputItem mwapi:title .
  }
  ?place wdt:P625 ?coordinates .
  ?place wdt:P31|wdt:P279 ?type .
} limit 10

But this is the query that is the most flexible as long as the city/county/country is in there. Note that subclassing should be possible but I have not figured that out yet:

SELECT * WHERE {
  # VALUES ?type { wd:Q1093829 wd:Q52511956 wd:Q515 wd:Q6256 wd:Q35657} .
  SERVICE wikibase:mwapi {
      bd:serviceParam wikibase:api "Search" .
      bd:serviceParam wikibase:endpoint "www.wikidata.org" .
      bd:serviceParam mwapi:srsearch "napels italie" .
      ?place wikibase:apiOutputItem mwapi:title .
  }
  ?place wdt:P625 ?coordinates .
  ?place rdfs:label ?placename .
  FILTER( LANG(?placename)="en") .
  ?place wdt:P17 ?country .
  ?place wdt:P1082 ?population .
  # ?place (a|wdt:P31|wdt:P279) ?type .
}
ORDER by DESC(?population)
LIMIT 10

[run query]

4 Shape expressions (ShEx) and validation

PubSeq uses ShEx to validate RDF data (see also the primer). This is particularly useful for data coming from outside. Before data gets accepted by PubSeq it first needs to go through a ShEx validation step.

To be continued

5 Missing data

After normalization, the original data we maintain in a field `original_collection_location'. If we can normalize to a Wikidata entry URI that is stored in `collection_location'. Otherwise it is missing (RDF favours treating missing data as really missing data).

Edit text!


Other documents

We fetch sequence data and metadata. We query the metadata in multiple ways using SPARQL and onthologies
We submit a sequence to the database. In this BLOG we fetch a sequence from GenBank and add it to the database.
We modify a workflow to get new output
We modify metadata for all to use! In this BLOG we add a field for a creative commons license.
Dealing with PubSeq localisation data
We explore the Arvados command line and API
Generate the files needed for uploading to EBI/ENA
Documentation for PubSeq REST API