Hello,
I am a phd candidate to bioninformatics and with (almost) 0 guidance. Seeking help here.. I was asked to do a de novo RNA transcriptome assembly from a total RNA sequencing. After fastqc i trimmed my original fastq and then ran trinity. So i got my trinity_trimmed.fasta. So, some of the things i was asked to do are:
1) fill out a table like this one :
| total number | total length(nt) | mean length(nt) | N50 | total consensus sequences | Distinct Clusters | Distinct Singletons
Contig
Unigene
I used TrinityStats.pl and got this :
Counts of transcripts, etc.
<h6>#</h6>Total trinity 'genes': 87177 Total trinity transcripts: 169974 Percent GC: 40.18
<h6>#</h6>Stats based on ALL transcript contigs:
<h6>#</h6>Contig N10: 3290
Contig N20: 2503
Contig N30: 2049
Contig N40: 1713
Contig N50: 1413
Median contig length: 529
Average contig: 869.67
Total assembled bases: 147821426
<h6>#</h6>
Stats based on ONLY LONGEST ISOFORM per 'GENE':
<h6>#</h6>Contig N10: 3087
Contig N20: 2301
Contig N30: 1816
Contig N40: 1414
Contig N50: 1029
Median contig length: 348
Average contig: 632.11
Total assembled bases: 55105774
My question has 2 parts : a) can i fill out this table with this information? b) Some people use cap3 assembly tool. I have already done that too in case i need it. Is that the way to go ? I need to check the quality of trinity_trimmed.fasta ?
for cap3 i also used TrinityStats.pl and got this :
for contigs:
Total trinity 'genes': 23017 Total trinity transcripts: 23017 Percent GC: 40.42
<h6>#</h6>Stats based on ALL transcript contigs:
<h6>#</h6>Contig N10: 3885
Contig N20: 3082
Contig N30: 2598
Contig N40: 2254
Contig N50: 1971
Median contig length: 1318
Average contig: 1522.23
Total assembled bases: 35037102
- note: not reporting gene-based longest isoform info since couldn't parse Trinity accession info.
for singletons:
Counts of transcripts, etc.
<h6>#</h6>Total trinity 'genes': 67695 Total trinity transcripts: 81478 Percent GC: 38.77
<h6>#</h6>Stats based on ALL transcript contigs:
<h6>#</h6>Contig N10: 1906
Contig N20: 1347
Contig N30: 1007
Contig N40: 751
Contig N50: 572
Median contig length: 333
Average contig: 490.70
Total assembled bases: 39981353
<h6>#</h6>
Stats based on ONLY LONGEST ISOFORM per 'GENE':
<h6>#</h6>Contig N10: 1853
Contig N20: 1284
Contig N30: 917
Contig N40: 671
Contig N50: 508
Median contig length: 317
Average contig: 461.01
Total assembled bases: 31207973
2) blastp/blastx in excel files.
i should use -outfmt 16 ?
( also hmmscan/pfam is needed for KEGG / GO terms ? )
3) Do a KEGG and GO analysis. I should annotate the assembly ( but which one the trinity_trimmed.fasta or the cap3 one ? ) using Trinotate and then go with GOseq for GO? Or i could use blast2go, using the blastx/blatp files with -outfmt 16? (7 days trial version ) . Kegg also in blast2go or i could something llike this : https://www.kegg.jp/blastkoala/ ?
i know i was long, sorry about that.
Please use the formatting bar (especially the
code
option) to present your post better.As you can see above biostars parser does not understand standard HTML code.
Yes that's the one. What kind of sequences do you want to extract? Any specific sequences or based on ID?
Please use
ADD COMMENT/ADD REPLY
when responding to existing threads to keep them logically organized.SUBMIT ANSWER
is only for new answers to original question.Which organism are you analyzing?