I have a large fastq file containing NGS sequences.
Using prior tools I've been able to narrow down the names of some 2,500 reads which are of interest to me, in OutputFile.txt.
In this file there are names of the reads ordered in the following patter:
1123 084b5b69-e819-426f-9ff3-fea4891af330 runid=d20680f0495125cc465d6a96efb49b194cb0777b read=2290 ch=184 start_time=2018-03-27T11:36:09Z
1124 5c621711-9233-40a3-a544-419fe5135d83 runid=d20680f0495125cc465d6a96efb49b194cb0777b read=2382 ch=320 start_time=2018-03-27T11:35:19Z
etc...
I'd like to use this OutputFile.txt to extract only the names of the reads and the sequences from the original fastq file into a new filtered fastq file
I have Seqtk installed on windows, running via Command Line. I believe I could use Seqtk grep
to extract only those 2,500 reads. I'm having trouble though with the proper command and the flags.
Could you please advise?
I think BBMap do this job, with
filterbyname.sh
scriptFrom this thread
Edit :
Also possible with seqtk subseq :
From here
Thanks @Bastien. But I have tried it before and it gives an error:
Any idea why this happens?
Maybe it is related to nanopore data, space in the headers... I never process these kind of data. If BBmap and seqtk cannot manage those data I cannot do much more sadly.
You have the best ones on the case, see below , I will not say better :)
There is always Biopython
Is this nanopore data? I wonder if the tools are having problems with the spaces in the names of fastq headers. You could certainly try
filterbyname.sh
or otherwise change the spaces to_
temporarily and then use the tools mentioned by Bastien.This is indeed nanopore data. Could these spaces in the header of the nanopore data be the reason for the error I'm getting (see my reply to Bastien's suggestion)?
While I did not spend a lot of time it appears that neither
filterbyname.sh
orseqtk subseq
work as intended with nanopore fastq data. I have asked @Brian about BBMap.Also checking with @Wouter deCoster to see if his
nanofilt
tool can be easily modified to do this task.The
084b5b69-e819-426f-9ff3-fea4891af330
portion should be a unique identifier. Therefore you don't need the other parts of your txt file/read names for filtering. I'd assume filterbyname.sh should work if you only use those unique parts.Thanks Wouter, I understand your tip. However, what would be a fast way to filter the file which contains all the reads of my interest (that is, the file which I'd liketo guide the filtering of the FASTQ file), and keep just the identifiers, i.e.
084b5b69-e819-426f-9ff3-fea4891af330
?