Hi,
I have a .txt file with a list of sequence IDs, looks like this:
A00580:377:HMC2FDSXY:3:2251:27389:24314
A00580:377:HMC2FDSXY:3:1506:13575:27571
A00580:377:HMC2FDSXY:3:1540:25934:5509
A00580:377:HMC2FDSXY:3:1439:18276:25160
A00580:377:HMC2FDSXY:3:1366:3161:27602
A00580:377:HMC2FDSXY:3:1555:21531:3959
A00580:377:HMC2FDSXY:3:2412:24261:33301
A00580:377:HMC2FDSXY:3:2444:9317:12931
A00580:377:HMC2FDSXY:3:2223:28619:24064
A00580:377:HMC2FDSXY:3:1112:23782:17347
A00580:377:HMC2FDSXY:3:1439:17987:33082
A00580:377:HMC2FDSXY:3:1113:22797:26757
And I have multiple .fastq.gz files and each contains sequences like this:
@A00580:377:HMC2FDSXY:3:1101:1154:1016 1:N:0:TCTACCATTT+NACTCTCCCG
CAAGAGGTCTGCGGACGGGTCATTGGCC
+
:FFFFFF:F:FFFF,FF:FFFFFFFFFF
@A00580:377:HMC2FDSXY:3:1101:1280:1016 1:N:0:TCTACCATTT+NACTCTCCCG
GTGCGTGGTAGGTAGCACGTACAGCGTA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFF:
@A00580:377:HMC2FDSXY:3:1101:1298:1016 1:N:0:TCTACCATTT+NACTCTCCCG
GAAACCTCATAATGAGCTTCTTGAAACA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00580:377:HMC2FDSXY:3:1101:1371:1016 1:N:0:TCTACCATTT+NACTCTCCCG
GGAGGATCAGGTCCCATTGTTCAATTTC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00580:377:HMC2FDSXY:3:1101:1479:1016 1:N:0:TCTACCATTT+NACTCTCCCG
ATACCGAAGTAAACGTGACAAGGATCTT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFF
I can see that the sequence IDs are first part of the sequence headers. I want to extract the sequences based on the list of sequence IDs, but I cannot figure out how to do that.
Can anyone provide some help?
Thanks so much!!
Have you tried zgrep with the option
- A3
?Fastq is structured to have 4 lines for each read, having the ID in the first line. You might need to parse the output before storing the entry in a new fastq file.
Alternatively, you might find other solutions like here
Thanks! I have not tried zgrep yet, will try that if the current method doesn't work!