Table of Contents
- 1. Workflow runs
- 2. FASTA files
- 3. Metadata
- 4. Pangenome
- 5. Log of workflow output
- 6. All files
- 7. Planned
- 8. Source code
- 9. Citing PubSeq
1 Workflow runs
2 FASTA files
The public sequence resource provides all uploaded sequences as FASTA files. They can be referred to from metadata individually. We also provide a single file FASTA download.
Metadata can be downloaded as Turtle RDF as a mergedmetadat.ttl which can be loaded into any RDF triple-store. We provide a Virtuoso SPARQL endpoint ourselves which can be queried from http://sparql.genenetwork.org/sparql/. Query examples can be found in the DOCS
An RDF file that includes the sequences themselves in a variation graph can be downloaded from below Pangenome RDF format.
Pangenome data is made available in multiple guises. Variation graphs (VG) provide a succinct encoding of the sequences of many genomes.
4.1 Pangenome GFA format
4.2 Pangenome in ODGI format
ODGI is a format that supports an optimised dynamic genome/graph implementation.
4.3 Pangenome RDF format
An RDF file that includes the sequences themselves in a variation graph can be downloaded from relabeledSeqs-dedup-relabeledSeqs-dedup.ttl.xz.
4.4 Pangenome Browser format
The many JSON files that are named as results/1/chunk001200.bin1.schematic.json are consumed by the Pangenome browser.
5 Log of workflow output
Including in below link is a log file of the last workflow runs.
We are planning the add the following output (see also
7.2 Multiple Sequence Alignment (MSA)
See MSA tracker.
7.3 Phylogenetic tree
See Phylo tracker.
7.4 Protein prediction
We aim to make protein predictions available.
8 Source code
All source code for this website and tooling is available from https://github.com/arvados/bh20-seq-resource
9 Citing PubSeq
See the FAQ.