Question

How To Extract Set Of Reads From Fastq (Or Eventually Fasta And Qual) Based On List Of Ids?

8

Entering edit mode

12.9 years ago

Biomonika (Noolean) 3.2k

I have a list of ids and two files (male and female). I want to find sequences and quality scores for all ids in my list.

I have:

list of ids

2 fastq files (eventually 2 fasta files and 2 qual files)

and I want to get:

extracted_sequences_based_on_ids.fastq (eventually extracted_sequences_based_on_ids.fasta and extracted_sequences_based_on_ids.qual) (can be two files that I merge afterwards)

Yes, I can write the script that will be for each id looking up the sequence and quality score and writing the result to a new file, but I would not like to reinvent the wheel in case this is something easy to perform with already existing tools:) (and might be handy for other biostars users as well)

Thanks a lot.

extraction read 454 fastq fasta • 40k views

ADD COMMENT • link updated 8 months ago by GenoMax 150k • written 12.9 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

Duplicated post: http://www.biostars.org/post/show/10353/how-to-efficiently-parse-a-huge-fastq-file/

ADD REPLY • link 12.9 years ago by Leszek 4.2k

score 15 · Answer 1 · 2012-05-21

15

Entering edit mode

12.9 years ago

Ole Kristian Tørresen ▴ 160

I think Heng Li's seqtk will do what you need: https://github.com/lh3/seqtk

Extract sequences with names in file name.lst, one sequence name per line:

seqtk subseq in.fq name.lst > out.fq

ADD COMMENT • link 12.9 years ago by Ole Kristian Tørresen ▴ 160

1

Entering edit mode

Hi, I try to extract sequences from a fastq file using the "subseq" in seqtk. But the extract file contains only the 1st sequence but no others. I am wondering whether my name.lst file does not fit with what seqtk needs. I have names of each sequence without other symbols each line in the name.lst. But the fastq file starts each sequence name with a @. Should I add @ in front of each sequence name? Or what other problem it can be?

Any suggestion is welcome. Thanks,

Chih-Ming

ADD REPLY • link 12.2 years ago by ymwur1 ▴ 10

0

Entering edit mode

Thank you very much! It worked.

ADD REPLY • link 12.9 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

I am using seqtk to extract reads from FASTQ files successfully.

$ cat  name 
SRR1946553.2920144
SRR1946553.9728216
SRR1946553.16735946
SRR1946553.25227565
SRR1946553.11939425
SRR1946553.19470728
$ seqtk subseq    /public1/home/stu_zhangyixing/raw_data/zhang_sir/output/10023_1.clean.fastq.gz name > test_1.fq
$ seqtk subseq    /public1/home/stu_zhangyixing/raw_data/zhang_sir/output/10023_2.clean.fastq.gz name > test_2.fq

Thanks !

ADD REPLY • link 8 months ago by zhang yi xing ▴ 50

score 4 · Answer 2 · 2012-05-21

4

Entering edit mode

12.9 years ago

Martin A Hansen 3.0k

This can be done with Biopieces and is in fact covered in the Howto.

ADD COMMENT • link 12.9 years ago by Martin A Hansen 3.0k

0

Entering edit mode

Thanks a lot, I didnt know Biopieces (looks very interesting though), but I unfortunately wasnt able to install it because of some problem with Perl modul Inline.

ADD REPLY • link 12.9 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

Try the Biopieces Google Group for help.

ADD REPLY • link 12.2 years ago by Martin A Hansen 3.0k

score 0 · Answer 3 · 2024-08-12

Since this old thread seems to have got a bump to main page I will add filterbyname.sh from BBMap suite as another option to filter fastq reads.

Second option is using seqkit grep ( https://bioinf.shenwei.me/seqkit/usage/#grep )

seqkit grep -f id.txt seqs.fa.

-n, --by-name               match by full name instead of just id
-i, --ignore-case           ignore case
-r, --use-regexp            patterns are regular expression