Question

Extract sequences from a fastq file by a list of IDs

1

Entering edit mode

19 months ago

mhpakdel96 ▴ 10

Hey guys,

I have a fastq file looks like this:

@E100062344L1C001R00100004672/1
TGGCCATTTTCCGAAAGAACGAGTGCTTTTATATTTGAAACGCTCGGATAGTCAGTGTAC
+
6C?;?F@DFD?F:<FB>FEFF3?EFCEEEDFDF;EFE5EFB88C9@0F9EDF9F7;EFED
@E100062344L1C001R00100007908/1
TTATACAACGTTTTCAAAGTATCAAAATACGTATTAACTTATTTTCATTAATATTATGTTGTTGTTTTTTTTTTAAATT
+
EGGFGDF<DFFBFGGFFFGFFF9FFFF3FFFEFFFGFDFDFF?GFFG6GGFDFGCBFFDFGDFE@FGFGFGGBFB@F>F
@E100062344L1C001R00100042396/1
AGCTAGTTAGCAAACTCACATTGGTTTTCAAAATTCCAACACCTTTTGGTAGAAGAAAA
+
FFFFGF@3FFFFF;BDGGFDEGEFBFFCGGFFGFGFCDFFFGC@FECF<FFF?<EFGF?
@E100062344L1C001R00100052634/1
CAGCAGAAGCAGACGCCAGAAGAACGGCCAAGGAAGATAAGATTCGTGGTGAACTCACCG
+
FC.=A0E*B8:<51F66A3<E58/8C<ECE;==0EFC@EA6EFCCA+5@E+1F:A6@CCC

and also I have IDs file that looks like this:

E100062344L1C001R00100004672    
E100062344L1C001R00100007908

I am trying to extract for my fastq file the sequences that have their id in my txt.file. I am using seqkit for that, but with no success. Do you have any idea that how can I extract the sequences of my IDs?

fastq • 2.3k views

ADD COMMENT • link updated 19 months ago by Ram 44k • written 19 months ago by mhpakdel96 ▴ 10

0

Entering edit mode

why screenshots when you can just copy and paste the text ? How To Extract Set Of Reads From Fastq (Or Eventually Fasta And Qual) Based On List Of Ids? extracting reads from fastq file based on read_id How To Extract A Subset Of Reads In Fastq Using An Id List? Extracting specific sequences from FASTQ using Seqtk choosing reads from a fastq file based on another fastq file etc....

ADD REPLY • link 19 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

I have tried all of them but none of them work for me

ADD REPLY • link 19 months ago by mhpakdel96 ▴ 10

0

Entering edit mode

but none of them work for me

https://meta.stackexchange.com/questions/147616/

ADD REPLY • link 19 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

19 months ago

colindaven 7.0k

Some people I know have used filter-fastq successfully: https://github.com/Floor-Lab/filter-fastq

ADD COMMENT • link 19 months ago by colindaven 7.0k

score 3 · Accepted Answer · 2023-06-07

Using filterbyname.sh from BBMap suite:

You need to include the /1 in the header in your list file (here I am using the name=your_header_example directly but names=file_w_names, this file will contain one ID per line). This is likely your problem when you say "you tried them all but none work".

$ filterbyname.sh in=test.fq out=stdout.fq names=E100062344L1C001R00100007908/1 include=t

Or use substring=t option (no need to add /1 at end)

$ filterbyname.sh in=test.fq out=stdout.fq names=E100062344L1C001R00100007908 substring=t include=t

Both will get you the following output.

Input is being processed as unpaired
@E100062344L1C001R00100007908/1
TTATACAACGTTTTCAAAGTATCAAAATACGTATTAACTTATTTTCATTAATATTATGTTGTTGTTTTTTTTTTAAATT
+
EGGFGDF<DFFBFGGFFFGFFF9FFFF3FFFEFFFGFDFDFF?GFFG6GGFDFGCBFFDFGDFE@FGFGFGGBFB@F>F
Time:                           0.196 seconds.
Reads Processed:           4    0.02k reads/sec
Bases Processed:         258    0.00m bases/sec
Reads Out:          1
Bases Out:          79

Replace stdout.fq (I was writing to screen in this example) with a filename to get file output. You can use gzipped files directly as input and output.