Hi guys,
we have RNA-seq data sequenced of an insect in 2012, and assembled them by using one of the Trinity 2011 versions at the time (got the trinity.fasta) . now I analyzed the sequence length distribution in this file , and got the result as follows:
kurban@kurban-X550VC:~/Downloads/bbmap$ sh stats.sh in=~/Downloads/gene.fa
stats.sh: 52: stats.sh: Bad substitution
stats.sh: 59: stats.sh: [[: not found
stats.sh: 59: stats.sh: [[: not found
stats.sh: 65: stats.sh: source: not found
stats.sh: 66: stats.sh: parseXmx: not found
A C G T N IUPAC Other GC GC_stdev
0.2875 0.2118 0.2067 0.2940 0.0000 0.0000 0.0000 0.4186 0.0894
Main genome scaffold total: 144777
Main genome contig total: 144777
Main genome scaffold sequence total: 67.067 MB
Main genome contig sequence total: 67.067 MB 0.000% gap
Main genome scaffold N/L50: 15033/1.075 KB
Main genome contig N/L50: 15033/1.075 KB
Max scaffold length: 24.081 KB
Max contig length: 24.081 KB
Number of scaffolds > 50 KB: 0
% main genome in scaffolds > 50 KB: 0.00%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 144,777 144,777 67,066,997 67,066,997 100.00%
100 144,777 144,777 67,066,997 67,066,997 100.00%
250 56,929 56,929 53,670,774 53,670,774 100.00%
500 30,137 30,137 44,518,044 44,518,044 100.00%
1 KB 16,207 16,207 34,757,505 34,757,505 100.00%
2.5 KB 4,183 4,183 15,894,549 15,894,549 100.00%
5 KB 588 588 3,942,668 3,942,668 100.00%
10 KB 28 28 353,549 353,549 100.00%
in the file the min seq. length is 101; the longest one is 22181.
past several days I used the latest trinity version- trinityrnaseq-2.0.6, assembled the same raw data again(after low quality reads teamed of course). this time the length distribution of the file is as follows:
kurban@kurban-X550VC:~/Downloads/bbmap$ sh stats.sh in=~/Desktop/data_from_server/2015_6_04_assembled_CD_and_CK/Trinity.fasta
stats.sh: 52: stats.sh: Bad substitution
stats.sh: 59: stats.sh: [[: not found
stats.sh: 59: stats.sh: [[: not found
stats.sh: 65: stats.sh: source: not found
stats.sh: 66: stats.sh: parseXmx: not found
A C G T N IUPAC Other GC GC_stdev
0.2932 0.2083 0.2114 0.2871 0.0000 0.0000 0.0000 0.4197 0.0823
Main genome scaffold total: 56130
Main genome contig total: 56130
Main genome scaffold sequence total: 57.963 MB
Main genome contig sequence total: 57.963 MB 0.000% gap
Main genome scaffold N/L50: 9036/1.861 KB
Main genome contig N/L50: 9036/1.861 KB
Max scaffold length: 30.733 KB
Max contig length: 30.733 KB
Number of scaffolds > 50 KB: 0
% main genome in scaffolds > 50 KB: 0.00%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 56,130 56,130 57,962,594 57,962,594 100.00%
100 56,130 56,130 57,962,594 57,962,594 100.00%
250 50,921 50,921 56,731,956 56,731,956 100.00%
500 29,025 29,025 49,248,962 49,248,962 100.00%
1 KB 18,003 18,003 41,494,038 41,494,038 100.00%
2.5 KB 5,541 5,541 21,499,015 21,499,015 100.00%
In this second trinity.fasta file the min sequence length is 224; the longest one is 30733.
My questions are:
- Why two assembly results are different,e.g. the former version of trinity assembled lots of sequences in length range from 101 to ~200 ? but the minimum length of the assembled sequence by using latest version of trinity is 224?
- Which trinity.fasta file should I use in the following analysis process ? and why?
Could u please give me little bit detailed explanation ?!
Thanks