Question

Calculate CG content in a fastq file

0

Entering edit mode

6.7 years ago

luzglongoria ▴ 50

Hi there,

I would like to know the CG content of some fastq file. I have this script:

This gives character count for all characters except As, Ts and Ns (and new lines):

cat file.fasta | grep -v ">" | tr -d aAtTnN"\n" | wc -c
24642235

This gives character count for all letters except Ns:

cat file.fasta | grep -v ">" | tr -d nN"\n" | wc -c
49100855

Mean overall GC content is therefore = 24642235 / 49100855 = 50%

But it is for fasta file exclusively.

how can I modified for .fastq files?

Thank you

RNA-Seq bash • 5.7k views

ADD COMMENT • link updated 6.7 years ago by finswimmer 16k • written 6.7 years ago by luzglongoria ▴ 50

0

Entering edit mode

Hi, In case you don't know it already, I can recommend the tool FASTQC to you, which gives you the distribution of GC contents in your reads, distribution of read lengths, average Phred score per base and much more!

Edit: In case you are looking at multiple FASTQ files at once, give the tool multiQC a try, it summarized FASTQC reports.

ADD REPLY • link 6.7 years ago by caggtaagtat ★ 1.9k

0

Entering edit mode

thanks! I had run FASTQC in my files before.

I just wanted to know a way of counting the % by using commands in bash :)

ADD REPLY • link 6.7 years ago by luzglongoria ▴ 50

0

Entering edit mode

Ok, that's what I thought :) I'll move my answer to the comment section

ADD REPLY • link 6.7 years ago by caggtaagtat ★ 1.9k

0

Entering edit mode

Simply convert fq to fa. Plenty of posts both on Biostars and the web on that available.

ADD REPLY • link 6.7 years ago by ATpoint 89k

0

Entering edit mode

Yes. I have been looking for this too but I only got answer for phyton but nothing for bash :(

ADD REPLY • link 6.7 years ago by luzglongoria ▴ 50

score 1 · Answer 1 · 2018-11-20

1

Entering edit mode

6.7 years ago

Pierre Lindenbaum 166k

awk '(NR%4==2) {gsub(/[ATnNat]/,"");N+=length($0);}END{print N;}' in.fastq

ADD COMMENT • link 6.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks Pierre. This command will calculate directly the CG content?

So, in my case, if my fastq file is called myfile.fastq the command will be:

awk '(NR%4==2) {gsub(/[ATnNat]/,"");N+=length($0);}END{print N;}' myfile.fastq

Is it that correct?

ADD REPLY • link 6.7 years ago by luzglongoria ▴ 50

1

Entering edit mode

no, it only print the number of GC. to get the GC% that would be

 awk '(NR%4==2) {N1+=length($0);gsub(/[AT]/,"");N2+=length($0);}END{print N2/N1;}' in.fq

ADD REPLY • link 6.7 years ago by Pierre Lindenbaum 166k

score 0 · Answer 2 · 2018-11-20

0

Entering edit mode

6.7 years ago

thackl ★ 3.0k

sed -n '2~4p' file.fastq | tr -d aAtTnN"\n" | wc -c should do if you want to stick as close as possible to your fasta example. Probably not the best way though.

ADD COMMENT • link 6.7 years ago by thackl ★ 3.0k

score 0 · Answer 3 · 2018-11-20

0

Entering edit mode

6.7 years ago

i.sudbery 22k

The fastQC program will do this for you.

fastqc file.fastq

ADD COMMENT • link 6.7 years ago by i.sudbery 22k

0

Entering edit mode

Do you know if fastqc does that for the entire dataset or just the subset of data it samples.

ADD REPLY • link 6.7 years ago by GenoMax 153k

0

Entering edit mode

It is using all reads, so to speed things up you could potentially create a subset with 1 million reads, but I prefer running it using all availible reads.

ADD REPLY • link 6.7 years ago by caggtaagtat ★ 1.9k