Brainstorming for better eukaryotic transcriptome annotation


I shared a workshop with Tessa Pierce. She mentioned an interesting approach to solve this exact problem. She use a text mining approach to pick up the gene name that show up more frequently in let us say top 10 hits.

She also presented an interesting talk about Annotate transcripts with KEGG.


This problem is not limited to non-model species. It affects all species including human. I remember genes names after mutations in Drosophila or diseases in several animals. Quick search show many links for that:


Those genes are usually annotated by sequence conservation and given these qualifiers to indicate that no functional studies were done to confirm the function.


Some quick thoughts, motivated by my experiences. They may be completely banal to everybody, but here they are:

Weak genomes in, weak annotations out. If the genome assembly’s badly fragmented or missing significant amounts of the euchromatic genome, one’s gene predictions and annotations will suffer, and analyses downstream of them will be degraded. This can have significant false-negative effects on all sorts of things, e.g., identifying signaling pathways of interest in non-model organisms.

For protein-coding gene predictions, there is a trade-off between being too conservative and too non-conservative about what one predicts; biologically, it is probably better to err on the side of non-conservativism, because that makes rare important positives easier to detect.

The “valley of death” is getting one’s genome fully into GenBank, with its gene annotations – many nematode genomes, even after being published, have failed to make it through that valley. This hampers their use.

Unpublished genomes can also pile up in public databases (e.g., ParaSite in WormBase has ~70! such genomes), in a state where they are difficult to use because they’re intellectually emcumbered for large-scale analysis.

ncRNA and small (<= 99-residue) protein-coding gene predictions tend to be neglected. This can prevent the discovery and analysis of potentially important genes.

Even if one tries very hard to get good gene predictions, reconfirmation of gene predictions with cDNA sequencing remains crucial (we learned this recently when making proteins as vaccine candidates from a hookworm genome); if PacBio / Oxford Nanopore et al. can get high-throughput cDNA to work, that should help a lot.

A lot of value can be mined out of analyses that are themselves rather turn-crank. Five programs (Phobius; hmmsearch/PFAM; InterProScan; OrthoMCL or OrthoFinder; and Blast2GO) can make a proteome really informative. In particular, OrthoMCL/OrthoFinder can identify new gene families that aren’t already in PFAM or InterPro.

Good biological analysis of a genome has two halves. The first half is blue-collar labor, really: getting the best possible assembly, the least unreliable possible gene predictions, and the most industrious possible protein annotations (PFAM etc.). In that first half, the best work is the most ploddingly diligent and thorough work. But the second half stands on the shoulders of the first half; it is made possible by that labor, but does not itself have the same nature. The second half starts when you ask, “OK, what about this particular genome do I really want to learn, biologically?” I think any two intelligent and curious biologists, given the same genome and annotations, will end up asking that question in two different ways and getting different answers from the same starting material.


[I wrote this stuff before but am finally getting onto the Brainstorm track…]


This is just a description for the process. The actual pipeline is not shared. I have a discussion once before with NCBI folks about sharing their scripts but they said they can’t afford doing “customer services” for those who would like to try the code.


This is when denovo transcriptome assembly performs better than reference based assembly. Unfortunately reference guided module of denovo transcriptome assembly in Trinity does not perform well with bad genomes. I believe there is a space here for a new tool or at least upgrade to Trinity to solve the problem.


I think you are talking about something like limiting ORFs to minimum length of 100 amino acids. I would agree if we can find a way to confirm these hits (e.g. high conservation for example) or if we are looking for specific gene family where we have ways to confirm those specific rare important positives. But I think relaxing such criteria in on the level of whole genome annotation would inflate the gene numbers significantly. This is actually one major difference between NCBI and ENSEMBL annotation pipelines where NCBI is on the conservative side keeping only high likely good hits while ENSEMBL is way more relaxed catching those rare evens but keeping a lot of unsupported predicted genes.


Totally agree. It is one of multiple limitations/shortages of the current scheme used by gene repositories. Others limitations/shortages may include:

  1. One can not upload an annotation for genome or transcriptome that he did not generate its sequence.
  2. Annotation made by main repositories (e.g. NCBI) are limited to small proportions of species and even these annotated species usually go on a very long cycle (years) for updates. One way to solve this problem is to make these annotation pipelines open and easily usable (e.g. ready made virtual machine or dockers) so that each community can run or update its own annotation.
  3. Annotation file formats (GTF/GFF) are loose error-prone formats e.g. with automatic pipelines of annotation you can easily get multiple genes with the same name in one genome and even on the same chromosomes. These files does not indicate unique features’ ids. Another example can be seen on the usage of NCBI to these files where they usually have a lot of transcripts with several changes compared to the genome. This comes with several errors in these files that would not complain until you start using them in downstream analyses.


This can also be moderated by the new superreads assembly function used before Hisat2.
Tamer, I think this is more your field, but are you going to comment/write up variant calling with GATK and then fixing these variants in the transcriptome?


I think the real back log seems to be functionally characterizing lncRNA. They are harder to characterize due to their lack of conservation, so then is it feasible for every species to functionally characterize all 20,000+ of their lncRNA? Or perhaps one species, presumably the mouse or human, needs to have some better general characterization of the lncRNA they are confident in to allow for other species to begin to characterize their lncRNA properly?


I think not only do they give high false positives, they also occasionally annotate such general pathways, such as cellular processes or metabolic processes, that they are not useful.
I think that some alternatives could be performing clustering analysis yourself to parse out robust pathways or perhaps having the option to better define your input in terms of tissue type, species or age to further refine the db search?


Discussion: What to do with mitochondrial gene expression detected by RNAseq? Transcript assemblers have a difficult time with mitochondrial genes due to their polycistronic nature, however in papers analyzing the human transcriptome and like what we saw in the horse transcriptome, a decent proportion of the RNAseq reads map to mitochondrial genome. One solution was to essentially just throw them out of the transcriptome, but this does seem like a lot of wasted sequencing resources. Are there any other solutions for mitochondrial gene expression?




There is also a new pipeline HiSat2+StringTie


I would also add the importance of strand information. E.g. Cufflinks produces many transcripts in both directions when non-stranded data is used.


I would also mention the impact of unplaced chromosomes


There is the FEELnc pipeline but based on Cufflinks…


Introduction/Discussion: Improving the gene structure with CAGE data

Many non-model organisms lack the proper annotation of the first exon, CAGE-seq should help resolving this issue


We have to remember that these pathways are also usually biased for human/mouse data and that they are biased for “hot topic” pathways, e.g. cancer (so cell cycle etc)

And again, the gene names problem…