Brainstorming for better eukaryotic transcriptome annotation


This topic is to share ideas and brainstorm about all interesting questions related to Transcriptome annotation.
a) If you want to make a new idea, please use the button at the bottom of the topic. You should make a new reply for each idea. Ideas could be:

    1. Literature review points: Topics that we should cover by comprehensive literature search
    2. Discussion topics: Experiences, frustrations, doubts and suggestions

b) If you want to interact with any idea in the topic, please use the button on each individual post. You can also quote specific text by highlighting this text with your mouse in a post. You should see a button appears above the highlighted text. Click that, then begin. (You can do this many times to multi-quote a bunch of posts.)


Literature: Reference based transcriptome annotation by major genome annotation databases (e.g. RefSeq and ensemble). Compare and contrast


Literature: Genome annotation pipelines: e.g. Maker & PASA


Discussion: Effect of genome assembly errors on reference based transcriptome assembly using RNAseq: Genome errors might prevent accurate mapping of RNAseq reads causing fragmented or chimeric transcripts. More frequently, it prevents proper prediction of ORFs. The later problem happens because current assemblers produce their transcripts in GTF/GFF format and downstream annotation programs have to use the reference genomes to retrieve the transcript sequences. One way to improve transcriptome assembly from erroneous genomes is to use Genome-guided Trinity De novo Transcriptome Assembly


Discussion: Effect of near-by paralogs on reference based transcriptome assembly using RNAseq: They can cause chimeric transcripts. This might be happening because of PCR chimeric artifacts or because of mapping error due to sequence similarity specially if the RNAseq reads are not perfectly matching to the reference genome (reference errors, sequence errors, or biological variants)


Discussion: Effect of different library preps on the quality of reference based transcriptomes using RNAseq: I noticed that ribosomal depletion libraries are likely to inflate the no of isoforms per gene locus because of intronic retention. On the other hand it is useful for assembly of non-polyadenylated transcripts e.g. lncRNA. Do we have literature supporting or contradicting this idea? Does any body know a publicly available dataset where the same biological samples were sequenced using different library preps?


Discussion: Filtration of RNAseq transcriptomes: Transcriptomes assembled by RNAseq (either reference based or denovo) are noisy. In our article of horse transcriptome, we excluded single exons transcripts falling completely within the introns of multi-exonic transcripts because they likely to be leftover fragments of primary transcripts. Also we removed isoforms with minimal share to their loci expression because they seemed to be artifactual (e.g. isoforms with intronic retentions or chimeric transcripts).
I think judging any isoform to be correct or artifactual should consider several factors at the same time. Expression level, sequence conservation, codon bias, canonical or non-canonical splicing sites, position in relation to other transcripts, and may be other factors should be used in one model to predict if a given transcript is good or bad. I think this is very doable in reference based transcriptome assembly and may be also in denovo assembly.


Discussion: Functional annotations of transcripts relay on ORF predictions and conservation. Current pipelines do not make enough - if any - use of genomic context or additional transcriptional evidences. One exception to this was the R package made by Kreutz et al Bioinformatics. 2012


Discussion: Lack of reliable evidence in the available pipelines predicting non-coding RNA (either lncRNA, miRNA, …)


Literature: Updated list of softwares/pipelines used in transcriptome assembly and annotation


The dammit pipeline by Camille Scott (in prep) allows the user to specify a custom protein database (translated amino acids of the coding sequences from the reference genome) as evidence in addition to homology-based comparisons with OrthoDB, Pfam-A, and Rfam databases.


Discussion: Following de novo transcriptome assembly, annotation results end up assining multiple gene names per contig. Differential expression analysis requires quantification of reads for each contig. Ideally, we want one name per contig. What ist he best way to make a systematic decision about which gene name to accept as the top hit? Sorting then grabbing lowest E-value, only matches with most trusted genome reference, longest match? Manually going through and making decisions about each contig is tedious and low-throughput.


Transcript_19050	HMMER	protein_hmm_match	124	711	3.80E-12	.	.	ID=homology:61984	Name=His_Phos_1	Target=His_Phos_1 1 184 +	Note=Histidine phosphatase superfamily (branch 1)	accuracy=0.77	env_coords=124 732	Dbxref="Pfam:PF00300.18"
Transcript_19050	HMMER	protein_hmm_match	1753	1869	2.60E+03	.	.	ID=homology:61986	Name=His_Phos_1	Target=His_Phos_1 112 152 +	Note=Histidine phosphatase superfamily (branch 1)	accuracy=0.52	env_coords=1687 1908	Dbxref="Pfam:PF00300.18"
Transcript_19050	HMMER	protein_hmm_match	274	561	2.00E-22	.	.	ID=homology:61987	Name=Thioredoxin	Target=Thioredoxin 5 100 +	Note=Thioredoxin	accuracy=0.85	env_coords=259 573	Dbxref="Pfam:PF00085.16"
Transcript_19050	HMMER	protein_hmm_match	889	975	1.40E+02	.	.	ID=homology:61985	Name=His_Phos_1	Target=His_Phos_1 110 138 +	Note=Histidine phosphatase superfamily (branch 1)	accuracy=0.77	env_coords=820 984	Dbxref="Pfam:PF00300.18"
Transcript_19050	LAST	translated_nucleotide_match	336	566	1.80E-22	+	.	ID=homology:152759	Name=F4NYE1_BATDJ	Target=F4NYE1_BATDJ 17 95 +	database=OrthoDB			
Transcript_19050	transdecoder	CDS	3	584	.	+	.	ID=cds.Transcript_19050|m.62710	Parent=Transcript_19050|m.62710					
Transcript_19050	transdecoder	CDS	748	2667	.	-	.	ID=cds.Transcript_19050|m.62709	Parent=Transcript_19050|m.62709					
Transcript_19050	transdecoder	exon	1	3320	.	-	.	ID=Transcript_19050|m.62709.exon1	Parent=Transcript_19050|m.62709					
Transcript_19050	transdecoder	exon	1	3320	.	+	.	ID=Transcript_19050|m.62710.exon1	Parent=Transcript_19050|m.62710					
Transcript_19050	transdecoder	five_prime_UTR	1	2	.	+	.	ID=Transcript_19050|m.62710.utr5p1	Parent=Transcript_19050|m.62710					
Transcript_19050	transdecoder	five_prime_UTR	2668	3320	.	-	.	ID=Transcript_19050|m.62709.utr5p1	Parent=Transcript_19050|m.62709					
Transcript_19050	transdecoder	gene	1	3320	.	-	.	ID=Transcript_19050|g.62709	Name=ORF%20Transcript_19050%7Cg.62709%20Transcript_19050%7Cm.62709%20type%3Acomplete%20len%3A640%20(-)					
Transcript_19050	transdecoder	gene	1	3320	.	+	.	ID=Transcript_19050|g.62710	Name=ORF%20Transcript_19050%7Cg.62710%20Transcript_19050%7Cm.62710%20type%3A5prime_partial%20len%3A194%20(%2B)					
Transcript_19050	transdecoder	mRNA	1	3320	.	-	.	ID=Transcript_19050|m.62709	Parent=Transcript_19050|g.62709	Name=ORF%20Transcript_19050%7Cg.62709%20Transcript_19050%7Cm.62709%20type%3Acomplete%20len%3A640%20(-)				
Transcript_19050	transdecoder	mRNA	1	3320	.	+	.	ID=Transcript_19050|m.62710	Parent=Transcript_19050|g.62710	Name=ORF%20Transcript_19050%7Cg.62710%20Transcript_19050%7Cm.62710%20type%3A5prime_partial%20len%3A194%20(%2B)				
Transcript_19050	transdecoder	three_prime_UTR	1	747	.	-	.	ID=Transcript_19050|m.62709.utr3p1	Parent=Transcript_19050|m.62709					
Transcript_19050	transdecoder	three_prime_UTR	585	3320	.	+	.	ID=Transcript_19050|m.62710.utr3p1	Parent=Transcript_19050|m.62710					


Transcript_10000	DUF1754
Transcript_10000	gi|831566245|ref|XP_012733304.1| PREDICTED: protein FAM32A [Fundulus heteroclitus]
Transcript_10000	ORF%20Transcript_10000%7Cg.12203%20Transcript_10000%7Cm.12203%20type%3Acomplete%20len%3A116%20%28-%29
Transcript_10000	ORF%20Transcript_10000%7Cg.12203%20Transcript_10000%7Cm.12203%20type%3Acomplete%20len%3A116%20%28-%29
Transcript_10000	DUF1754
Transcript_10000	ENSXMAP00000003504
Transcript_100000	Myosin-VI_CBD
Transcript_100000	gi|831555656|ref|XP_012729791.1| PREDICTED: unconventional myosin-VI-like isoform X1 [Fundulus heteroclitus]
Transcript_100000	gi|831537298|ref|XP_012723266.1| PREDICTED: unconventional myosin-VI-like isoform X4 [Fundulus heteroclitus]
Transcript_100000	gi|831537292|ref|XP_012723263.1| PREDICTED: unconventional myosin-VI-like isoform X2 [Fundulus heteroclitus]
Transcript_100000	gi|831537289|ref|XP_012723262.1| PREDICTED: unconventional myosin-VI-like isoform X1 [Fundulus heteroclitus]
Transcript_100000	gi|831537295|ref|XP_012723264.1| PREDICTED: unconventional myosin-VI-like isoform X3 [Fundulus heteroclitus]
Transcript_100000	gi|831555680|ref|XP_012729800.1| PREDICTED: unconventional myosin-VI-like isoform X9 [Fundulus heteroclitus]
Transcript_100000	gi|831555677|ref|XP_012729799.1| PREDICTED: unconventional myosin-VI-like isoform X8 [Fundulus heteroclitus]
Transcript_100000	gi|831555662|ref|XP_012729794.1| PREDICTED: unconventional myosin-VI-like isoform X3 [Fundulus heteroclitus]
Transcript_100000	gi|831555674|ref|XP_012729798.1| PREDICTED: unconventional myosin-VI-like isoform X7 [Fundulus heteroclitus]
Transcript_100000	ENSXMAP00000008020
Transcript_100000	gi|831555671|ref|XP_012729797.1| PREDICTED: unconventional myosin-VI-like isoform X6 [Fundulus heteroclitus]
Transcript_100000	gi|831555665|ref|XP_012729795.1| PREDICTED: unconventional myosin-VI-like isoform X4 [Fundulus heteroclitus]
Transcript_100000	gi|831555668|ref|XP_012729796.1| PREDICTED: unconventional myosin-VI-like isoform X5 [Fundulus heteroclitus]
Transcript_100000	ORF%20Transcript_100000%7Cg.119190%20Transcript_100000%7Cm.119190%20type%3A5prime_partial%20len%3A131%20%28-%29
Transcript_100000	ORF%20Transcript_100000%7Cg.119190%20Transcript_100000%7Cm.119190%20type%3A5prime_partial%20len%3A131%20%28-%29
Transcript_100000	gi|831555659|ref|XP_012729793.1| PREDICTED: unconventional myosin-VI-like isoform X2 [Fundulus heteroclitus]


Discussion: For nonmodel species, gene names from homology-based annotations sometimes do not make sense in an evolutionary context. For example:

breast cancer genes in corals,

autism susceptibility genes in fish:

In these cases, describing functions within the context of the organism may require digging into the primary papers to describe the protein’s functions and/or collecting additional experimental evidence. This is not a high-throughput approach.


Literature: Homology-based annotation “the most important factor in the evolution of function is not amino acid sequence, but rather the cellular context in which proteins act”

“ortholog conjecture”: orthologs are more likely to be functionally similar than paralogs

“paralogous genes from the same species are often a much better predictor of functional divergence than are orthologs or paralogs from different species, even at lower sequence identities”

  • PASA is quite old, developed for EST technology. RNA-Seq has replaced all EST sequencing as far as I can tell, but I don’t know how much tools like Cufflinks are used for genome annotation. Cufflinks was definitely too noisy to use in our wasp genome annotation.
  • Should also mention EVidenceModeler, which is more in the spirit of “annotation integrator” tools like Maker than PASA is.
  • Besides Maker and EVM, I’m not aware of any other annotators that are in frequent use by the community.
    • GNOMON is good, but is only accessible to the NCBI annotation team(s).
    • JIGSAW is pretty old and I never hear about it.
    • GLEAN has been used for many insect annotations, and apparently fish, but I don’t know how widely it is used.



In general, the practice of naming genes seems to be a mess. Downstream functional annotation tools heavily rely on gene names and accessions as input, e.g. DAVID and IPA. Some annotation pipelines add a “-like” and/or “PREDICTED” after the name, making it difficult to search databases such as UNIPROT to find homologues.

What is it about these genes that makes the names “PREDICTED”, “probable” or “-like”?

gi|831481386|ref|XP_012727344.1| PREDICTED: chemokine-like receptor 1 [Fundulus heteroclitus]
gi|768954940|ref|XP_011616165.1| PREDICTED: probable serine/threonine-protein kinase kinX isoform X1 [Takifugu rubripes]

It seems to be more reliable to use sequence-based identification rather than relying on names, e.g. Ankyrin-2, Ank, Ank_2, ANK2, Ank2_human, ANK-2. By "sequence-based identification", I mean using the nucleic acid sequence or translated amino acid sequence to align and confirm homology. Rather than relying on names of genes. In effect, this is like another translation that has to occur and there is the potential to lose information.

See [Genecards](
And [biogps](


Discussion: Functional annotation, pipeline tools and GO terms:

Pathway analysis:

high false positives:


How does one mRNA get assigned two different “isoform” names?

1238	Transcript_100193	gi|831575638|ref|XP_012736486.1| PREDICTED: ubiquitin carboxyl-terminal hydrolase 40 isoform X3 [Fundulus heteroclitus]
1250	Transcript_100193	ORF%20Transcript_100193%7Cg.119534%20Transcript_100193%7Cm.119534%20type%3Acomplete%20len%3A243%20%28%2B%29
1239	Transcript_100193	gi|831575631|ref|XP_012736483.1| PREDICTED: ubiquitin carboxyl-terminal hydrolase 40 isoform X1 [Fundulus heteroclitus]


Info on NCBI Eukaryotic Genome Annotation Pipeline (because I’ve been trying to figure this out):


Exactly, all these evidence are based on direct testing of sequence conservation some how. Others ideas include for example selection of the ORF that fits the codon bias of a given species.