Question

getting random fastq records with awk

0

Entering edit mode

10.4 years ago

damiano.bassani ▴ 10

Hi folks!

I was wondering if anyone knows how to get through AWK or SED, RANDOM records from a fastq file. Obviously, with record I mean the four lines block of each ID

i.e.

@IRIS:7:1:17:394#0/1
GTCAGGACAAGAAAGACAANTCCAATTNACATTATG
+IRIS:7:1:17:394#0/1
aaabaa`]baaaaa_aab]D^^`b`aYDW]abaa`^

Thanks!

sequence genome awk fastq next-gen • 3.1k views

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by damiano.bassani ▴ 10

1

Entering edit mode

awk has a rand() function, so you could use that. What have you tried?

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Devon Ryan 105k

0

Entering edit mode

Indeed I was trying with the Shuf command but it give me only random line.

shuf file.fastq | head n-10 > randomlines.txt

I need to get the block and definitely now I will chek for rand() in AWK

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by damiano.bassani ▴ 10

Ram · Answer 1 · 2015-02-20

2

Entering edit mode

10.4 years ago

dariober 15k

Following Devon's advice of using rand() in awk:

gunzip -c my.fq.gz \
| paste - - - - \
| awk 'BEGIN{srand(1234)}{if(rand() < 0.3) print $0}' \
| tr '\t' '\n' \
| gzip > rnd.fq.gz

(Replace 0.3 with the proportion of reads to keep)

However, at a first approximation the reads in a fastq file should be random so something like this might be good enough and possibly faster(?):

...
| awk '{if(NR % 3 == 0) print $0}' \
...

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by dariober 15k

0

Entering edit mode

Yep. The only really non-random part in the fastq file is that reads at the beginning/end tend to be a bit worse quality due to being near the edge of the flow cell. But I agree that for many purposes that's really not important so just skipping rand() would be better.

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Devon Ryan 105k

0

Entering edit mode

Excuse me! Now I understood! It is not random! thanks!

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by damiano.bassani ▴ 10

0

Entering edit mode

Thanks you all!

This is just what I wanted. However, I am new to awk so could you explain me which is the command line that takes random blocks of 4 lines?

Thank you very much, really helpful!

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by damiano.bassani ▴ 10

Ram · Answer 2 · 2015-02-20

0

Entering edit mode

10.4 years ago

Alex Reynolds 36k

Another option is to use sample with a --lines-per-offset value of 4.

First, get the number of FASTQ records:

$ gunzip -c records.fq.gz | wc -l | awk '{d=$1; print d/4;}'

Multiply the result (say, 12345 records) by the fraction of records you want (say, 42%):

$ calc "floor(12345*42/100)"
5184

Then sample:

$ gunzip -c records.fq.gz > records.fq
$ sample --lines-per-offset=4 --sample-size=5184 records.fq > sample.fq

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Alex Reynolds 36k

0

Entering edit mode

It seems very interesting! I downloaded it but could you please provide some installation instructions?

Thanks you so much!

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by damiano.bassani ▴ 10

0

Entering edit mode

Try:

$ ./runall
$ ./configure
$ make 
$ make install

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Alex Reynolds 36k

0

Entering edit mode

After ./runall I get:

aclocal: error: 'configure.ac' is required
autoheader: 'configure.ac' or 'configure.in' is required
automake: error: 'configure.ac' is required
autoconf: error: no input file

How could I fix this?

Thanks

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by damiano.bassani ▴ 10

1

Entering edit mode

I posted a makefile, so all that is needed is:

$ make
$ sudo cp sample /usr/local/bin

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Alex Reynolds 36k

0

Entering edit mode

It works! thank you so much!

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by damiano.bassani ▴ 10