getting random fastq records with awk
2
0
Entering edit mode
9.8 years ago

Hi folks!

I was wondering if anyone knows how to get through AWK or SED, RANDOM records from a fastq file. Obviously, with record I mean the four lines block of each ID

i.e.

@IRIS:7:1:17:394#0/1
GTCAGGACAAGAAAGACAANTCCAATTNACATTATG
+IRIS:7:1:17:394#0/1
aaabaa`]baaaaa_aab]D^^`b`aYDW]abaa`^

Thanks!

sequence genome awk fastq next-gen • 2.6k views
ADD COMMENT
1
Entering edit mode

awk has a rand() function, so you could use that. What have you tried?

ADD REPLY
0
Entering edit mode

Indeed I was trying with the Shuf command but it give me only random line.

shuf file.fastq | head n-10 > randomlines.txt

I need to get the block and definitely now I will chek for rand() in AWK

ADD REPLY
2
Entering edit mode
9.8 years ago

Following Devon's advice of using rand() in awk:

gunzip -c my.fq.gz \
| paste - - - - \
| awk 'BEGIN{srand(1234)}{if(rand() < 0.3) print $0}' \
| tr '\t' '\n' \
| gzip > rnd.fq.gz

(Replace 0.3 with the proportion of reads to keep)

However, at a first approximation the reads in a fastq file should be random so something like this might be good enough and possibly faster(?):

...
| awk '{if(NR % 3 == 0) print $0}' \
...
ADD COMMENT
0
Entering edit mode

Yep. The only really non-random part in the fastq file is that reads at the beginning/end tend to be a bit worse quality due to being near the edge of the flow cell. But I agree that for many purposes that's really not important so just skipping rand() would be better.

ADD REPLY
0
Entering edit mode

Excuse me! Now I understood! It is not random! thanks!

ADD REPLY
0
Entering edit mode

Thanks you all!

This is just what I wanted. However, I am new to awk so could you explain me which is the command line that takes random blocks of 4 lines?

Thank you very much, really helpful!

ADD REPLY
0
Entering edit mode
9.8 years ago

Another option is to use sample with a --lines-per-offset value of 4.

First, get the number of FASTQ records:

$ gunzip -c records.fq.gz | wc -l | awk '{d=$1; print d/4;}'

Multiply the result (say, 12345 records) by the fraction of records you want (say, 42%):

$ calc "floor(12345*42/100)"
5184

Then sample:

$ gunzip -c records.fq.gz > records.fq
$ sample --lines-per-offset=4 --sample-size=5184 records.fq > sample.fq
ADD COMMENT
0
Entering edit mode

It seems very interesting! I downloaded it but could you please provide some installation instructions?

Thanks you so much!

ADD REPLY
0
Entering edit mode

Try:

$ ./runall
$ ./configure
$ make 
$ make install
ADD REPLY
0
Entering edit mode

After ./runall I get:

aclocal: error: 'configure.ac' is required
autoheader: 'configure.ac' or 'configure.in' is required
automake: error: 'configure.ac' is required
autoconf: error: no input file

How could I fix this?

Thanks

ADD REPLY
1
Entering edit mode

I posted a makefile, so all that is needed is:

$ make
$ sudo cp sample /usr/local/bin
ADD REPLY
0
Entering edit mode

It works! thank you so much!

ADD REPLY

Login before adding your answer.

Traffic: 1520 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6