Question

50% Missed exons when trying to replicate annotation reference

0

Entering edit mode

2.9 years ago

Mike ▴ 20

Hello, I have an annotation file which I'm trying to replicate.

So far, the results are horrible:

enter image description here

I'm intentionally not using the reference annotation I have in order to get there "without help".

I'm building the index using STAR, clean my fastqs and then aligning with:

STARlong --runThreadN 4 --genomeDir {args.dest} --outSAMtype BAM SortedByCoordinate --readFilesCommand zcat --readFilesIn {fastqs} --outFileNamePrefix {args.org} --sjdbOverhang {max_len-1} --twopassMode Basic --outSAMattributes All

I then sort with samtools and then with stringTie:

stringtie -p 4 {sorted_bam} -o stringOutput{itr}.gtf

Finally I merge all gtf's with stringtie --merge

I compare the final merged gtf file to the annotation reference I have and the results are in the image above.

I don't have the SRA data used to make the reference annotation, so I tried downloading 2 different projects from NCBI but they both led me to these poor results (more or less)

What am I doing wrong? is the main problem I'm facing is not having the SRA inputs used to make the annotation I'm trying to replicate?
I'm planning to add/change the current parameters, but I think it won't have a significant effect, am I correct?
I'll be happy to receive any information you can share with me on this subject, the main goal is to make a gene prediction & annotation pipeline.

Thanks a lot!

star gffcompare stringtie annotation pipeline • 978 views

ADD COMMENT • link 2.8 years ago by Mike ▴ 20

score 1 · Answer 1 · 2022-01-12

Transcriptome assembly is notoriously difficult to get right.

Some reasons are objective and very straightforward -many of your transcripts are not expressed at sufficient levels.

Other reasons have to do with the complexity of the task at hand, sometimes it is quite difficult to tell transcripts apart.

Look at some of your missed exons ... see if you observe a pattern. Are these missed exons even covered?

By and large, the problem usually is that you get too many false positives.

score 1 · Answer 2 · 2022-01-12

Obtaining a high quality annotation usually requires multiple types of evidence, not just RNA-seq, e.g.:

Protein and/or full length transcript sequences from closely related species or accessions
Ab-initio gene predictions
Reference gene lift-over

Results from multiple alignment and prediction tools are then combined to produce gene models. I suggest you take a look at an annotation pipeline such as MAKER, or EvidenceModeler, which combines annotation results.

Having said that, when comparing to the reference annotation you should keep in mind that:
a. The reference annotation is not necessarily correct, and most gene models there had never been validated.
b. Reference annotations usually undergo some stage of manual curation, which improves the quality at the cost of much hard work.
Therefore, I wouldn't expect to be able to obtain an annotation that would be highly similar to the reference annotation, especially since you are not using the same evidence.