Some quick thoughts, motivated by my experiences. They may be completely banal to everybody, but here they are:
Weak genomes in, weak annotations out. If the genome assembly’s badly fragmented or missing significant amounts of the euchromatic genome, one’s gene predictions and annotations will suffer, and analyses downstream of them will be degraded. This can have significant false-negative effects on all sorts of things, e.g., identifying signaling pathways of interest in non-model organisms.
For protein-coding gene predictions, there is a trade-off between being too conservative and too non-conservative about what one predicts; biologically, it is probably better to err on the side of non-conservativism, because that makes rare important positives easier to detect.
The “valley of death” is getting one’s genome fully into GenBank, with its gene annotations – many nematode genomes, even after being published, have failed to make it through that valley. This hampers their use.
Unpublished genomes can also pile up in public databases (e.g., ParaSite in WormBase has ~70! such genomes), in a state where they are difficult to use because they’re intellectually emcumbered for large-scale analysis.
ncRNA and small (<= 99-residue) protein-coding gene predictions tend to be neglected. This can prevent the discovery and analysis of potentially important genes.
Even if one tries very hard to get good gene predictions, reconfirmation of gene predictions with cDNA sequencing remains crucial (we learned this recently when making proteins as vaccine candidates from a hookworm genome); if PacBio / Oxford Nanopore et al. can get high-throughput cDNA to work, that should help a lot.
A lot of value can be mined out of analyses that are themselves rather turn-crank. Five programs (Phobius; hmmsearch/PFAM; InterProScan; OrthoMCL or OrthoFinder; and Blast2GO) can make a proteome really informative. In particular, OrthoMCL/OrthoFinder can identify new gene families that aren’t already in PFAM or InterPro.
Good biological analysis of a genome has two halves. The first half is blue-collar labor, really: getting the best possible assembly, the least unreliable possible gene predictions, and the most industrious possible protein annotations (PFAM etc.). In that first half, the best work is the most ploddingly diligent and thorough work. But the second half stands on the shoulders of the first half; it is made possible by that labor, but does not itself have the same nature. The second half starts when you ask, “OK, what about this particular genome do I really want to learn, biologically?” I think any two intelligent and curious biologists, given the same genome and annotations, will end up asking that question in two different ways and getting different answers from the same starting material.