Question

How to extract sequences from multiple fastq files based on part of the header?

0

Entering edit mode

3.7 years ago

leranwangcs ▴ 150

Hi,

I have a .txt file with a list of sequence IDs, looks like this:

A00580:377:HMC2FDSXY:3:2251:27389:24314
A00580:377:HMC2FDSXY:3:1506:13575:27571
A00580:377:HMC2FDSXY:3:1540:25934:5509
A00580:377:HMC2FDSXY:3:1439:18276:25160
A00580:377:HMC2FDSXY:3:1366:3161:27602
A00580:377:HMC2FDSXY:3:1555:21531:3959
A00580:377:HMC2FDSXY:3:2412:24261:33301
A00580:377:HMC2FDSXY:3:2444:9317:12931
A00580:377:HMC2FDSXY:3:2223:28619:24064
A00580:377:HMC2FDSXY:3:1112:23782:17347
A00580:377:HMC2FDSXY:3:1439:17987:33082
A00580:377:HMC2FDSXY:3:1113:22797:26757

And I have multiple .fastq.gz files and each contains sequences like this:

@A00580:377:HMC2FDSXY:3:1101:1154:1016 1:N:0:TCTACCATTT+NACTCTCCCG
CAAGAGGTCTGCGGACGGGTCATTGGCC
+
:FFFFFF:F:FFFF,FF:FFFFFFFFFF
@A00580:377:HMC2FDSXY:3:1101:1280:1016 1:N:0:TCTACCATTT+NACTCTCCCG
GTGCGTGGTAGGTAGCACGTACAGCGTA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFF:
@A00580:377:HMC2FDSXY:3:1101:1298:1016 1:N:0:TCTACCATTT+NACTCTCCCG
GAAACCTCATAATGAGCTTCTTGAAACA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00580:377:HMC2FDSXY:3:1101:1371:1016 1:N:0:TCTACCATTT+NACTCTCCCG
GGAGGATCAGGTCCCATTGTTCAATTTC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00580:377:HMC2FDSXY:3:1101:1479:1016 1:N:0:TCTACCATTT+NACTCTCCCG
ATACCGAAGTAAACGTGACAAGGATCTT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFF

I can see that the sequence IDs are first part of the sequence headers. I want to extract the sequences based on the list of sequence IDs, but I cannot figure out how to do that.

Can anyone provide some help?

Thanks so much!!

sequencing • 2.5k views

ADD COMMENT • link updated 3.7 years ago by rpolicastro 13k • written 3.7 years ago by leranwangcs ▴ 150

0

Entering edit mode

Have you tried zgrep with the option - A3?

Fastq is structured to have 4 lines for each read, having the ID in the first line. You might need to parse the output before storing the entry in a new fastq file.

Alternatively, you might find other solutions like here

ADD REPLY • link 3.7 years ago by michael.ante ★ 3.9k

0

Entering edit mode

Thanks! I have not tried zgrep yet, will try that if the current method doesn't work!

ADD REPLY • link 3.7 years ago by leranwangcs ▴ 150

1

Entering edit mode

3.7 years ago

rpolicastro 13k

Here's a seqkit answer as well.

seqkit grep -f ids.txt file.fastq.gz > filtered_file.fastq.gz

ADD COMMENT • link 3.7 years ago by rpolicastro 13k

0

Entering edit mode

Thanks for your help @rpolicastro! Currently I'm trying out the method from @GenoMax. I'll definitely try this if I had issue with that one. Appreciate!

ADD REPLY • link 3.7 years ago by leranwangcs ▴ 150

score 3 · Accepted Answer · 2021-03-09

3

Entering edit mode

3.7 years ago

GenoMax 147k

Use filterbyname.sh from BBMap suite. Run the program without options to get full list of command line options. names= can point to a file with names, one per line, that you want to extract.

Description:  Filters reads by name.

Usage:  filterbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string> include=<t/f>

in2 and out2 are for paired reads and are optional.
If input is paired and there is only one output file, it will be written interleaved.
Important!  Leading > and @ symbols are NOT part of sequence names;  they are part of
the fasta, fastq, and sam specifications.  Therefore, this is correct:
names=e.coli_K12
And these are incorrect:
names=>e.coli_K12
names=@e.coli_K12

ADD COMMENT • link 3.7 years ago by GenoMax 147k

0

Entering edit mode

Thanks @GenoMax! I'm trying this method on a test dataset, seems working well! I'll post a final result after the entire datasets finish running! Thanks again!