Question

Subset Fastq file from list of read numbers

0

Entering edit mode

5.1 years ago

johnsonn573 ▴ 10

I have a txt file each line of which is a number corresponding to a specific read in a fastq file. I would like to make a subsetted fastq file from my larger fastq file with just the reads corresponding to the numbers in the txt file. Is there a simple way to do this? Thank you!

fastq subset reads • 2.2k views

ADD COMMENT • link updated 5.1 years ago by Pierre Lindenbaum 164k • written 5.1 years ago by johnsonn573 ▴ 10

0

Entering edit mode

not clear: what is that number ? the line number starting from 0 ? from 1 ? the fastq record in the file ? starting from 0 ? from 1 ?

ADD REPLY • link 5.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Please post few records from input files: fastq and text. In the absense of them, i would suggest to use seqkit grep/range function @ johnsonn573

ADD REPLY • link 5.1 years ago by cpad0112 21k

0

Entering edit mode

Problem is OP here only has record numbers (odd) and not fastq headers, as far as I see.

ADD REPLY • link 5.1 years ago by GenoMax 148k

0

Entering edit mode

some thing like this? @ genomax @ johnsonn573. Example is with .fasta file. Same code works for fastq file and user needs to replace input fasta with input fastq. For fastq code would be: parallel seqkit range -r {}:{} test.fq :::: test.txt

input:

$ cat test.fa
>abc
atgc
>cdg
atgc
>def
atgc

$ cat test.txt 
2
3

output:

$ parallel seqkit range -r {}:{} test.fa :::: test.txt
>cdg
atgc
>def
atgc

ADD REPLY • link 5.1 years ago by cpad0112 21k

2

Entering edit mode

5.1 years ago

Pierre Lindenbaum 164k

assuming the number are the 1-based index of each record in the fastq.

join -t $'\t' -1 1 -2 1 \
   <(gunzip -c input.fq.gz | paste - - - - | awk '{printf("%d\t%s\n",NR,$0);}' | sort -T . -t $'\t' -k1,1 ) \
   <(sort -T . numbers.txt) | \
sort -t $'\t' -k1,1n |\
cut -f 2- |\
tr "\t" "\n" > subset.fastq

ADD COMMENT • link 5.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Yes, the numbers start from 1. Thank you!

ADD REPLY • link 5.1 years ago by johnsonn573 ▴ 10

0

Entering edit mode

My fastq file is not gzipped. Is there a way to modify this script so it works on an unzipped input.fq file?

ADD REPLY • link 5.1 years ago by johnsonn573 ▴ 10

2

Entering edit mode

At this point, I'm afraid you'll have to learn it by yourself.

ADD REPLY • link 5.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

That's fine. I tried gzipping the fastq file, and running as input.fq.gz, but the script still wouldn't run. So there must be another problem. I will post when I have a functioning script.

ADD REPLY • link 5.1 years ago by johnsonn573 ▴ 10

0

Entering edit mode

but the script still wouldn't run.

You'll have to be more precise if you expect any help.

ADD REPLY • link 5.1 years ago by WouterDeCoster 47k

score 2 · Accepted Answer · 2019-11-22

2

Entering edit mode

5.1 years ago

GenoMax 148k

Doing this by just (record?) number is going to be tricky. If you have read headers it would be much simpler to use filterbyname.sh from BBMap suite.

filterbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string> include=<t/f>

names=          A list of strings or files.  The files can have one name per line, or
                be a standard read file (fasta, fastq, or sam).

Run filterbyname.sh without any options to see in-line help.

ADD COMMENT • link 5.1 years ago by GenoMax 148k

0

Entering edit mode

Thank you for letting me know about this. I changed my script so that I could filter using filterbyname.sh.

ADD REPLY • link 5.1 years ago by johnsonn573 ▴ 10