Entering edit mode
2.1 years ago
Eliveri
▴
350
Hi. I have a list of headers for which I want to find the sequence in a large fastq file to write a new fastq file.
List of headers list.txt
@A00111:399:HWLKHDSXX:2:1101:1000:7434
@A00111:399:HWLKHDSXX:2:1101:1000:...
@A00111:399:HWLKHDSXX:2:1101:1000:...
fastq file largefile.fastq
@A00111:399:HWLKHDSXX:2:1101:10004:332111
AACAAGTGATAATCAGAGAGTTCTCACAGGTTCTCACTGATAATGATAAAGGTTCTCACAG...
@A00111:399:HW...
I am currently using the following command, but it takes a really long time.
while read HEADER
do
grep -A 1 -m 1 $HEADER largefile.fastq >> new.fastq
done < headers list.txt
Is there a faster way of doing this?
Thank you all for the help! In the end I chose to write a short python script rather than use bash as the reference fasta is not so large that it cannot be read into memory.