How to extract sequences from multiple fastq files based on part of the header?
2
0
Entering edit mode
3.7 years ago
leranwangcs ▴ 150

Hi,

I have a .txt file with a list of sequence IDs, looks like this:

A00580:377:HMC2FDSXY:3:2251:27389:24314
A00580:377:HMC2FDSXY:3:1506:13575:27571
A00580:377:HMC2FDSXY:3:1540:25934:5509
A00580:377:HMC2FDSXY:3:1439:18276:25160
A00580:377:HMC2FDSXY:3:1366:3161:27602
A00580:377:HMC2FDSXY:3:1555:21531:3959
A00580:377:HMC2FDSXY:3:2412:24261:33301
A00580:377:HMC2FDSXY:3:2444:9317:12931
A00580:377:HMC2FDSXY:3:2223:28619:24064
A00580:377:HMC2FDSXY:3:1112:23782:17347
A00580:377:HMC2FDSXY:3:1439:17987:33082
A00580:377:HMC2FDSXY:3:1113:22797:26757

And I have multiple .fastq.gz files and each contains sequences like this:

@A00580:377:HMC2FDSXY:3:1101:1154:1016 1:N:0:TCTACCATTT+NACTCTCCCG
CAAGAGGTCTGCGGACGGGTCATTGGCC
+
:FFFFFF:F:FFFF,FF:FFFFFFFFFF
@A00580:377:HMC2FDSXY:3:1101:1280:1016 1:N:0:TCTACCATTT+NACTCTCCCG
GTGCGTGGTAGGTAGCACGTACAGCGTA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFF:
@A00580:377:HMC2FDSXY:3:1101:1298:1016 1:N:0:TCTACCATTT+NACTCTCCCG
GAAACCTCATAATGAGCTTCTTGAAACA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00580:377:HMC2FDSXY:3:1101:1371:1016 1:N:0:TCTACCATTT+NACTCTCCCG
GGAGGATCAGGTCCCATTGTTCAATTTC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00580:377:HMC2FDSXY:3:1101:1479:1016 1:N:0:TCTACCATTT+NACTCTCCCG
ATACCGAAGTAAACGTGACAAGGATCTT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFF

I can see that the sequence IDs are first part of the sequence headers. I want to extract the sequences based on the list of sequence IDs, but I cannot figure out how to do that.

Can anyone provide some help?

Thanks so much!!

sequencing • 2.5k views
ADD COMMENT
0
Entering edit mode

Have you tried zgrep with the option - A3?

Fastq is structured to have 4 lines for each read, having the ID in the first line. You might need to parse the output before storing the entry in a new fastq file.

Alternatively, you might find other solutions like here

ADD REPLY
0
Entering edit mode

Thanks! I have not tried zgrep yet, will try that if the current method doesn't work!

ADD REPLY
3
Entering edit mode
3.7 years ago
GenoMax 147k

Use filterbyname.sh from BBMap suite. Run the program without options to get full list of command line options. names= can point to a file with names, one per line, that you want to extract.

Description:  Filters reads by name.

Usage:  filterbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string> include=<t/f>

in2 and out2 are for paired reads and are optional.
If input is paired and there is only one output file, it will be written interleaved.
Important!  Leading > and @ symbols are NOT part of sequence names;  they are part of
the fasta, fastq, and sam specifications.  Therefore, this is correct:
names=e.coli_K12
And these are incorrect:
names=>e.coli_K12
names=@e.coli_K12
ADD COMMENT
0
Entering edit mode

Thanks @GenoMax! I'm trying this method on a test dataset, seems working well! I'll post a final result after the entire datasets finish running! Thanks again!

ADD REPLY
0
Entering edit mode

Hi @GenoMax, it worked perfectly! Thanks!!

ADD REPLY
1
Entering edit mode
3.7 years ago

Here's a seqkit answer as well.

seqkit grep -f ids.txt file.fastq.gz > filtered_file.fastq.gz
ADD COMMENT
0
Entering edit mode

Thanks for your help @rpolicastro! Currently I'm trying out the method from @GenoMax. I'll definitely try this if I had issue with that one. Appreciate!

ADD REPLY

Login before adding your answer.

Traffic: 2910 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6