Calculate CG content in a fastq file
3
0
Entering edit mode
6.0 years ago
luzglongoria ▴ 50

Hi there,

I would like to know the CG content of some fastq file. I have this script:

This gives character count for all characters except As, Ts and Ns (and new lines):

cat file.fasta | grep -v ">" | tr -d aAtTnN"\n" | wc -c
24642235

This gives character count for all letters except Ns:

cat file.fasta | grep -v ">" | tr -d nN"\n" | wc -c
49100855

Mean overall GC content is therefore = 24642235 / 49100855 = 50%

But it is for fasta file exclusively.

how can I modified for .fastq files?

Thank you

RNA-Seq bash • 5.0k views
ADD COMMENT
0
Entering edit mode

Hi, In case you don't know it already, I can recommend the tool FASTQC to you, which gives you the distribution of GC contents in your reads, distribution of read lengths, average Phred score per base and much more!

Edit: In case you are looking at multiple FASTQ files at once, give the tool multiQC a try, it summarized FASTQC reports.

ADD REPLY
0
Entering edit mode

thanks! I had run FASTQC in my files before.

I just wanted to know a way of counting the % by using commands in bash :)

ADD REPLY
0
Entering edit mode

Ok, that's what I thought :) I'll move my answer to the comment section

ADD REPLY
0
Entering edit mode

Simply convert fq to fa. Plenty of posts both on Biostars and the web on that available.

ADD REPLY
0
Entering edit mode

Yes. I have been looking for this too but I only got answer for phyton but nothing for bash :(

ADD REPLY
1
Entering edit mode
6.0 years ago
awk '(NR%4==2) {gsub(/[ATnNat]/,"");N+=length($0);}END{print N;}' in.fastq
ADD COMMENT
0
Entering edit mode

Thanks Pierre. This command will calculate directly the CG content?

So, in my case, if my fastq file is called myfile.fastq the command will be:

awk '(NR%4==2) {gsub(/[ATnNat]/,"");N+=length($0);}END{print N;}' myfile.fastq

Is it that correct?

ADD REPLY
1
Entering edit mode

no, it only print the number of GC. to get the GC% that would be

 awk '(NR%4==2) {N1+=length($0);gsub(/[AT]/,"");N2+=length($0);}END{print N2/N1;}' in.fq
ADD REPLY
0
Entering edit mode
6.0 years ago
thackl ★ 3.0k

sed -n '2~4p' file.fastq | tr -d aAtTnN"\n" | wc -c should do if you want to stick as close as possible to your fasta example. Probably not the best way though.

ADD COMMENT
0
Entering edit mode
6.0 years ago

The fastQC program will do this for you.

fastqc file.fastq

ADD COMMENT
0
Entering edit mode

Do you know if fastqc does that for the entire dataset or just the subset of data it samples.

ADD REPLY
0
Entering edit mode

It is using all reads, so to speed things up you could potentially create a subset with 1 million reads, but I prefer running it using all availible reads.

ADD REPLY

Login before adding your answer.

Traffic: 1654 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6