Hi everyone,
I have some confused results regarding GC content in transcriptome, I have made a transcriptome de novo assembly by Trinity, and exploited some prebuild scripts (i.e TrinityStats) in Trinity package to calculate basic statistics of assembled transcriptome, it returned GC% = 46.64 %.
Meanwhile, I tried to use Prinseq to get some summaries (although it is designed to use with reads), and get the result like: GC Content Distribution Mean GC content: 46.04 ± 6.75 %
Minimum GC content: 19 %
Maximum GC content: 74 %
GC content range: 56 %
Mode GC content: 45 % with 4,275 sequences
On the other hands, when I count number of G and C base in multifasta transcriptome (using simple 'grep' while omitting every fasta header), then calculate GC content by fomula (G+C)/(G+C+A+T), I got 49.96%.
Could anyone help me clarify these differences and suggest which GC content result is accurate? Thank you very much.
maybe low quality nucleotides or reads are not included (or some very short reads)
Do the sequences contain T's and not U's? Are the TrinityStats based on all contigs or just longest isoforms?
Thanks, It doesn't contain U, below is the full output of TrinityStats, I think it based on all contigs for GC content calculation:
Some GC% programs are stupid and do
(G+C)/(G+C+A+T+N)
which is to essentially call all Ns C or T. They often don't mention N specifically, they just divide by the length of the sequence, which is the same thing.You can usually spot this because it gives you slightly lower GC percentages, which appears to be what you're seeing.
The 46.64 and 46.06 is probably the sum of some rounding errors, but I don't know for sure obviously. The only way to know for sure is to read the code. Can you read the code or?
Thanks for your suggestion, I would check the code to be sure
John, could you elaborate what you mean with this formula "essentially call[s] all Ns C or T", please?
Thank you, however I was referring to what you mean with "to essentially call all Ns C or T"?
Above you stated
, which I understand as "those programs treat Ns as C or T" and that doesn't make sense to me regarding the formula you provided.
Also, which would be the correct way to compute GC when ambiguity codes are involved? Assign probabilities of G/C observations (N,M,K,R,Y=0.5, W=0, S=1, B,V=2/3, D,H=1/3)? Or, like in your example 1) treat ambiguities as non-GC or 2) just calculate GC as fraction of non-ambiguous bases?
I just calculate GC from the unambiguous bases. Normal sequence doesn't usually contain ambiguous bases other than N; those tend to just be used to describe motifs and primers. When an N is present in normal sequence, it's best to ignore it because treating it as 0.5 will bias the calculation toward 50%.
Ah, you're right, i meant to say A or t, not C. Good spot. I won't fix it so your comment makes sense.
Regarding GC% with ambiguity codes, I think it's a recipe for troubles. I suppose if it's the C+G or A+T ambiguity code then it's OK, but it's a mistake to think "A or T or C" means "there is an equal probability of A/T/C". It doesn't. It means N. It's better to throw bad data in the bin than try and make allowances for it. If you try and squeeze all the milk out of a cow you end up with bad milk and a sore cow.