Hi everyone,
I have a .fasta file that I have created by running trinity (so it is an transcript assembly). I want to know the CG% for each contig and keep those with low % of CG. My data belong to a bird infected by Plasmodium and and only want contig with low CG% (because Plasmodium spp have low %).
I have used BBMap for calculating the CG content for all the contigs:
stats.sh trinity_gmap.fa
A C G T N IUPAC Other GC GC_stdev
0.3719 0.1285 0.1326 0.3670 0.0000 0.0000 0.0000 0.2611 0.1979
Main genome scaffold total: 37974
Main genome contig total: 37974
Main genome scaffold sequence total: 32.330 MB
Main genome contig sequence total: 32.330 MB 0.000% gap
Main genome scaffold N/L50: 7241/1.313 KB
Main genome contig N/L50: 7241/1.313 KB
Main genome scaffold N/L90: 26130/354
Main genome contig N/L90: 26130/354
Max scaffold length: 14.632 KB
Max contig length: 14.632 KB
Number of scaffolds > 50 KB: 0
% main genome in scaffolds > 50 KB: 0.00%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 37,974 37,974 32,329,550 32,329,550 100.00%
100 37,974 37,974 32,329,550 32,329,550 100.00%
250 33,820 33,820 31,398,555 31,398,555 100.00%
500 19,529 19,529 26,343,945 26,343,945 100.00%
1 KB 10,520 10,520 19,945,249 19,945,249 100.00%
2.5 KB 1,873 1,873 6,692,233 6,692,233 100.00%
5 KB 197 197 1,290,084 1,290,084 100.00%
10 KB 5 5 57,511 57,511 100.00%
So, the toal % of CG is 26% which is not too bad :)
however, I would like to know any way of calculating the CG content for each contig.
Any idea?
Thank you so much!
Thank you so much!. Your comment has been very helpful.
I just read that when I run
I get :
I have been reading about what means each column :
ts transition ie. adacent A<=>G or C<=>T tv transversion - the other possible [AGTC]<=>[AGTC] ajdacent pairs CpG CG pair (revcom aware) CpG-ts CG pair (revcom aware) but allowing transitions in 1st (and/or 2nd) base 2 number of ambiguous IUPAC bases with 2 possibile values 3 ditto with 3 4 ditto with 4 (I assume this means "N")
I think that the one I am interesting in is CpG. Or may be I should just focus on C and G columns?
If you want CG content, focus on G and C column and divide the sum of these two by the sum of all A, C, T, G.