These are the required files and metadata to allow a new assembly to be imported into a GenomeHub.
- Scaffold sequence FASTA file
- Gene annotation GFF3 file
- Protein sequence FASTA file
- Assembly metadata TXT file
- Species image PNG file
- Assembly description HTML file
Scaffold sequence FASTA file
This is the primary assembly file. The scaffold names used in this file will be used as the primary sequence identifiers after importing.
If required, scaffold name synonyms can be provided in a tab-delimited file with one row per scaffold to provide alternative scaffold names. If available, an AGP format file can be imported to show the mapping between contigs and scaffolds in the assembly.
Gene annotation GFF3 file
Gene annotations should be provided in a valid GFF3 format file. It is common for GFF3 files to deviate from the formal specification, particularly if individual genes have been manually edited or the results of multiple gene prediction programs have been combined into a single file. While many common problems can be detected and corrected automatically during the import process, it is preferable to use a validator to ensure that your file is correctly formatted prior to submission.
In order to be included in comparative analyses (i.e. to be included in orthology predictions and gene trees), each gene name must be unique across all assemblies within a GenomeHub. In practice, this means choosing a unique prefix for the genes in your assembly prior to submission, which would ideally be an INSDC-registered locus tag. Registering a locus tag is a required step before submission of an assembly to the major public databases (ENA/NCBI/DDBJ) so its use as a unique prefix at this stage helps to avoid a proliferation of synonyms for each gene.
The final column in a GFF3 file contains a set of key=value pairs, several of which (such as ‘ID’ and ‘Name’) can logically be used to contain the primary identifiers for genes, transcripts and proteins. If multiple key values are included in the GFF3 file then please make it clear to the curator which set should be used as the primary identifiers for genes, transcripts and proteins and also if it would be desirable for any alternatives to be imported as synonyms.
Protein sequence FASTA file
Protein sequences used throughout a GenomeHub and associated analyses are based on the annotations provided in the GFF3 file. The simplest way to ensure that these sequences match the protein sequences that you may have already used in other analyses is to provide a protein sequences file that the output proteins can be automatically checked against. This is particularly useful to ensure that frame/phase information in the GFF3 is interpreted correctly and to identify predicted sequences in which consecutive exons are out of phase with each other, which is a relatively common artefact of some gene prediction software. The gene names in this file should match the identifiers used in the GFF3 file or be relatable by a simple text substitution, e.g. transcript names ending ‘-RA’ in the GFF3 file corresponding with protein names ending ‘-PA’ in the protein FASTA file.
Assembly metadata TXT file
A small amount of metadata is required in order to import and display an assembly. Please supply this information in a TXT format file with appropriate values for each of the keys below. Lines beginning with ‘#’ are comments and do not need to be included and blank lines may be omitted.
# Binomial species name (optionally including subspecies/strain) SPECIES.SCIENTIFIC_NAME = # Species common name (or repeat scientific name if no common name) SPECIES.COMMON_NAME = # NCBI taxonomy ID number SPECIES.TAXONOMY_ID = # Short assembly name (must be unique across assemblies within a GenomeHub) ASSEMBLY.NAME = # Date of assembly in YYYY-MM-DD format ASSEMBLY.DATE = # Assembly GCA accession number (if available) ASSEMBLY.ACCESSION = # Assembly BioBroject accession (if available) ASSEMBLY.BIOPROJECT = # INSDC locus tag prefix (if available) ASSEMBLY.LOCUS_TAG = # Name/institution name to be credited for the assembly PROVIDER.NAME = # Web address of provider (if available) PROVIDER.URL = # Date of gene annotation in YYYY-MM-DD format GENEBUILD.START_DATE = # Gene annotation version (number, usually 1) GENEBUILD.VERSION =
If ASSEMBLY.BIOPROJECT
and ASSEMBLY.LOCUS_TAG
are specified, an EMBL format feature file can be generated once the assembly and gene models have been imported, which can be used as part of the INSDC submission process to add gene models to an assembly.
Search terms
In addition to the above metadata, providing some suggested search terms will enhance the accessibility of your data. One or more search terms should be included in the same file as the metadata above.
# A gene name matching one of the gene names in the GFF3 file SAMPLE.GENE_TEXT = # A transcript name matching one of the transcript names in the GFF3 file SAMPLE.TRANSCRIPT_TEXT = # A location of the format 'scaffold_name:start-end' SAMPLE.LOCATION_TEXT = # Search text matching a functional annotation (functional annotations will be added as part of the import process, based on BLAST hits to UniProt and an InterProScan analysis) SAMPLE.SEARCH_TEXT =
Species image PNG file
A species image file can be displayed in various locations in a GenomeHubs Ensembl browser and needs to be square (or suitable to be cropped as a square) and file size should ideally be at least 512x512px. Make sure that you hold or have obtained rights to use the image and include a picture credit in the assembly description (see next section).
Assembly description HTML file
An assembly description should be provided as an HTML-format file following the template below. The structure of this document is important as it allows the sections to be rendered in the appropriate area of the assembly homepage in the Ensembl browser. Edit the document by replacing the text enclosed by ‘p’ tags with relevant details for your assembly. If you are unsure how to edit an HTML file, please supply the information for the four sections (about, assembly, annotation, references) in a text file.
<!-- {about} -->
<p>
A paragraph about the species/subspecies and why it is interesting.
</p>
<p>
Picture credit (including a link to the original source URL, if applicable).
</p>
<!-- {about} -->
<!-- {assembly} -->
<p>
Details of the assembly process.
</p>
<!-- {assembly} -->
<!-- {annotation} -->
<p>
Details of the annotation process.
</p>
<!-- {annotation} -->
<!-- {references} -->
<p>
Any relevant references. Please use a list for multiple references, e.g.:
<ol>
<li>
Reference 1.
</li>
<li>
Reference 2.
</li>
</ol>
</p>
<!-- {references} -->