Question

Question about Trinity assembly QC?

0

Entering edit mode

2.1 years ago

pearl2070 ▴ 10

I have a question about some of the Trinity QC information. I'll use a tutorial dataset (found here:https://github.com/trinityrnaseq/KrumlovTrinityWorkshopJan2018/wiki/Home/1a23eb56a8857c3ed9595f9224367e25129f8f4b) for an example to help keep the question somewhat straightforward.

When TrinityStats.pl is run on the tutorial dataset, the result is 683 'genes' and 687 transcripts. Then, in the tutorial, under "Assess number of full-length coding transcripts," following BLAST-ing of transcripts and running analyze_blastPlus_topHit_coverage.pl on them, there is a chart generated of bins of percent length coverage of the best matching protein sequence, counts of proteins found in each bin, and a running total of proteins in all bins. It seems there's only 324 proteins in total. What happened to the rest/why is there a discrepancy between the number of proteins that have BLAST hits and the number of genes in the assembly?

QC Trinity RNA-seq transcriptomics metatranscriptomics • 711 views

ADD COMMENT • link updated 2.1 years ago by h.mon 35k • written 2.1 years ago by pearl2070 ▴ 10

score 2 · Accepted Answer · 2022-11-18

2

Entering edit mode

2.1 years ago

h.mon 35k

Because not all genes / transcripts will have blast hits, and default blast -outfmt 6 settings will omit sequences without hits. These should be tabulated in a 0% coverage line, which is not shown because blast doesn't output those sequences.

ADD COMMENT • link 2.1 years ago by h.mon 35k