Extract sequences from a fastq file by a list of IDs
2
1
Entering edit mode
19 months ago
mhpakdel96 ▴ 10

Hey guys,

I have a fastq file looks like this:

@E100062344L1C001R00100004672/1
TGGCCATTTTCCGAAAGAACGAGTGCTTTTATATTTGAAACGCTCGGATAGTCAGTGTAC
+
6C?;?F@DFD?F:<FB>FEFF3?EFCEEEDFDF;EFE5EFB88C9@0F9EDF9F7;EFED
@E100062344L1C001R00100007908/1
TTATACAACGTTTTCAAAGTATCAAAATACGTATTAACTTATTTTCATTAATATTATGTTGTTGTTTTTTTTTTAAATT
+
EGGFGDF<DFFBFGGFFFGFFF9FFFF3FFFEFFFGFDFDFF?GFFG6GGFDFGCBFFDFGDFE@FGFGFGGBFB@F>F
@E100062344L1C001R00100042396/1
AGCTAGTTAGCAAACTCACATTGGTTTTCAAAATTCCAACACCTTTTGGTAGAAGAAAA
+
FFFFGF@3FFFFF;BDGGFDEGEFBFFCGGFFGFGFCDFFFGC@FECF<FFF?<EFGF?
@E100062344L1C001R00100052634/1
CAGCAGAAGCAGACGCCAGAAGAACGGCCAAGGAAGATAAGATTCGTGGTGAACTCACCG
+
FC.=A0E*B8:<51F66A3<E58/8C<ECE;==0EFC@EA6EFCCA+5@E+1F:A6@CCC

and also I have IDs file that looks like this:

E100062344L1C001R00100004672    
E100062344L1C001R00100007908

I am trying to extract for my fastq file the sequences that have their id in my txt.file. I am using seqkit for that, but with no success. Do you have any idea that how can I extract the sequences of my IDs?

fastq • 2.3k views
ADD COMMENT
0
Entering edit mode

I have tried all of them but none of them work for me

ADD REPLY
0
Entering edit mode
ADD REPLY
3
Entering edit mode
19 months ago
GenoMax 148k

Using filterbyname.sh from BBMap suite:

You need to include the /1 in the header in your list file (here I am using the name=your_header_example directly but names=file_w_names, this file will contain one ID per line). This is likely your problem when you say "you tried them all but none work".

$ filterbyname.sh in=test.fq out=stdout.fq names=E100062344L1C001R00100007908/1 include=t

Or use substring=t option (no need to add /1 at end)

$ filterbyname.sh in=test.fq out=stdout.fq names=E100062344L1C001R00100007908 substring=t include=t

Both will get you the following output.

Input is being processed as unpaired
@E100062344L1C001R00100007908/1
TTATACAACGTTTTCAAAGTATCAAAATACGTATTAACTTATTTTCATTAATATTATGTTGTTGTTTTTTTTTTAAATT
+
EGGFGDF<DFFBFGGFFFGFFF9FFFF3FFFEFFFGFDFDFF?GFFG6GGFDFGCBFFDFGDFE@FGFGFGGBFB@F>F
Time:                           0.196 seconds.
Reads Processed:           4    0.02k reads/sec
Bases Processed:         258    0.00m bases/sec
Reads Out:          1
Bases Out:          79 

Replace stdout.fq (I was writing to screen in this example) with a filename to get file output. You can use gzipped files directly as input and output.

ADD COMMENT
0
Entering edit mode

Thanks a lot, it works

ADD REPLY
0
Entering edit mode
19 months ago

Some people I know have used filter-fastq successfully: https://github.com/Floor-Lab/filter-fastq

ADD COMMENT

Login before adding your answer.

Traffic: 1937 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6