GC content calculation
1
1
Entering edit mode
8.1 years ago
pbigbig ▴ 250

Hi everyone,

I have some confused results regarding GC content in transcriptome, I have made a transcriptome de novo assembly by Trinity, and exploited some prebuild scripts (i.e TrinityStats) in Trinity package to calculate basic statistics of assembled transcriptome, it returned GC% = 46.64 %.

Meanwhile, I tried to use Prinseq to get some summaries (although it is designed to use with reads), and get the result like: GC Content Distribution Mean GC content: 46.04 ± 6.75 %

Minimum GC content: 19 %

Maximum GC content: 74 %

GC content range: 56 %

Mode GC content: 45 % with 4,275 sequences

On the other hands, when I count number of G and C base in multifasta transcriptome (using simple 'grep' while omitting every fasta header), then calculate GC content by fomula (G+C)/(G+C+A+T), I got 49.96%.

Could anyone help me clarify these differences and suggest which GC content result is accurate? Thank you very much.

GC content trinity transcriptome • 13k views
ADD COMMENT
0
Entering edit mode

maybe low quality nucleotides or reads are not included (or some very short reads)

ADD REPLY
0
Entering edit mode

Do the sequences contain T's and not U's? Are the TrinityStats based on all contigs or just longest isoforms?

ADD REPLY
0
Entering edit mode

Thanks, It doesn't contain U, below is the full output of TrinityStats, I think it based on all contigs for GC content calculation:

Counts of transcripts, etc.
Total trinity 'genes':  68518
Total trinity transcripts:  72229
Percent GC: 46.64

Stats based on ALL transcript contigs:

    Contig N10: 5470
    Contig N20: 4094
    Contig N30: 3301
    Contig N40: 2700
    Contig N50: 2201

    Median contig length: 606
    Average contig: 1175.68
    Total assembled bases: 84918026

Stats based on ONLY LONGEST ISOFORM per 'GENE':

    Contig N10: 5265
    Contig N20: 3946
    Contig N30: 3187
    Contig N40: 2593
    Contig N50: 2114

    Median contig length: 574
    Average contig: 1124.04
    Total assembled bases: 77017207
ADD REPLY
0
Entering edit mode

Some GC% programs are stupid and do (G+C)/(G+C+A+T+N) which is to essentially call all Ns C or T. They often don't mention N specifically, they just divide by the length of the sequence, which is the same thing.

You can usually spot this because it gives you slightly lower GC percentages, which appears to be what you're seeing.

The 46.64 and 46.06 is probably the sum of some rounding errors, but I don't know for sure obviously. The only way to know for sure is to read the code. Can you read the code or?

ADD REPLY
1
Entering edit mode

Thanks for your suggestion, I would check the code to be sure

ADD REPLY
1
Entering edit mode

John, could you elaborate what you mean with this formula "essentially call[s] all Ns C or T", please?

ADD REPLY
1
Entering edit mode
seq = 'ATCGTGATGANAACT'
a = seq.count('A')
c = seq.count('C')
g = seq.count('G')
t = seq.count('T')
float(c + g) / len(seq) # returns 0.3333333 - note we didn't have to count a or t
float(c + g) / (a+c+g+t) # returns 0.3571428
ADD REPLY
1
Entering edit mode

Thank you, however I was referring to what you mean with "to essentially call all Ns C or T"?

Above you stated

Some GC% programs are stupid and do (G+C)/(G+C+A+T+N) which is to essentially call all Ns C or T.

, which I understand as "those programs treat Ns as C or T" and that doesn't make sense to me regarding the formula you provided.

Also, which would be the correct way to compute GC when ambiguity codes are involved? Assign probabilities of G/C observations (N,M,K,R,Y=0.5, W=0, S=1, B,V=2/3, D,H=1/3)? Or, like in your example 1) treat ambiguities as non-GC or 2) just calculate GC as fraction of non-ambiguous bases?

ADD REPLY
3
Entering edit mode

I just calculate GC from the unambiguous bases. Normal sequence doesn't usually contain ambiguous bases other than N; those tend to just be used to describe motifs and primers. When an N is present in normal sequence, it's best to ignore it because treating it as 0.5 will bias the calculation toward 50%.

ADD REPLY
1
Entering edit mode

Ah, you're right, i meant to say A or t, not C. Good spot. I won't fix it so your comment makes sense.

Regarding GC% with ambiguity codes, I think it's a recipe for troubles. I suppose if it's the C+G or A+T ambiguity code then it's OK, but it's a mistake to think "A or T or C" means "there is an equal probability of A/T/C". It doesn't. It means N. It's better to throw bad data in the bin than try and make allowances for it. If you try and squeeze all the milk out of a cow you end up with bad milk and a sore cow.

ADD REPLY
4
Entering edit mode
8.1 years ago

The BBMap package has a tool called "stats" which will accurately calculate GC content. You can use that as a tie-breaker. Usage:

stats.sh assembly.fasta
ADD COMMENT
1
Entering edit mode

Thanks, It is really a tie-breaker, here's what I have got:

A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.2773  0.2281  0.2383  0.2563  0.0000  0.0000  0.0000  0.4664  0.0675
ADD REPLY
2
Entering edit mode

Looks like Trinity was correct, then!

ADD REPLY
0
Entering edit mode

Looking forward to it generating a fourth GC% value :P heheh

ADD REPLY
4
Entering edit mode

4 would be bad. With 3 or 5 he could use the median.

ADD REPLY
0
Entering edit mode

Hahahahaha xD

ADD REPLY

Login before adding your answer.

Traffic: 4097 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6