How To Extract Set Of Reads From Fastq (Or Eventually Fasta And Qual) Based On List Of Ids?
3
8
Entering edit mode
12.6 years ago

I have a list of ids and two files (male and female). I want to find sequences and quality scores for all ids in my list.

I have:

list of ids

2 fastq files (eventually 2 fasta files and 2 qual files)

and I want to get:

extracted_sequences_based_on_ids.fastq (eventually extracted_sequences_based_on_ids.fasta and extracted_sequences_based_on_ids.qual) (can be two files that I merge afterwards)

Yes, I can write the script that will be for each id looking up the sequence and quality score and writing the result to a new file, but I would not like to reinvent the wheel in case this is something easy to perform with already existing tools:) (and might be handy for other biostars users as well)

Thanks a lot.

extraction read 454 fastq fasta • 39k views
ADD COMMENT
15
Entering edit mode
12.6 years ago

I think Heng Li's seqtk will do what you need: https://github.com/lh3/seqtk

Extract sequences with names in file name.lst, one sequence name per line:

seqtk subseq in.fq name.lst > out.fq
ADD COMMENT
1
Entering edit mode

Hi, I try to extract sequences from a fastq file using the "subseq" in seqtk. But the extract file contains only the 1st sequence but no others. I am wondering whether my name.lst file does not fit with what seqtk needs. I have names of each sequence without other symbols each line in the name.lst. But the fastq file starts each sequence name with a @. Should I add @ in front of each sequence name? Or what other problem it can be?

Any suggestion is welcome. Thanks,

Chih-Ming

ADD REPLY
0
Entering edit mode

Thank you very much! It worked.

ADD REPLY
0
Entering edit mode

I am using seqtk to extract reads from FASTQ files successfully.

$ cat  name 
SRR1946553.2920144
SRR1946553.9728216
SRR1946553.16735946
SRR1946553.25227565
SRR1946553.11939425
SRR1946553.19470728
$ seqtk subseq    /public1/home/stu_zhangyixing/raw_data/zhang_sir/output/10023_1.clean.fastq.gz name > test_1.fq
$ seqtk subseq    /public1/home/stu_zhangyixing/raw_data/zhang_sir/output/10023_2.clean.fastq.gz name > test_2.fq

Thanks !

ADD REPLY
4
Entering edit mode
12.6 years ago

This can be done with Biopieces and is in fact covered in the Howto.

ADD COMMENT
0
Entering edit mode

Thanks a lot, I didnt know Biopieces (looks very interesting though), but I unfortunately wasnt able to install it because of some problem with Perl modul Inline.

ADD REPLY
0
Entering edit mode

Try the Biopieces Google Group for help.

ADD REPLY
0
Entering edit mode
4 months ago
GenoMax 148k

Since this old thread seems to have got a bump to main page I will add filterbyname.sh from BBMap suite as another option to filter fastq reads.

Second option is using seqkit grep ( https://bioinf.shenwei.me/seqkit/usage/#grep )

seqkit grep -f id.txt seqs.fa.

-n, --by-name               match by full name instead of just id
-i, --ignore-case           ignore case
-r, --use-regexp            patterns are regular expression
ADD COMMENT

Login before adding your answer.

Traffic: 2124 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6