Question

How to split fastq file contains two different header information

0

Entering edit mode

7.0 years ago

goh ▴ 10

Dear all,

I have a 16s NGS fastq file which comprise of two separate NGS run with the heading information as below:-

Run 1 heading : @HISEQ:791:HCGHLBCX2:1:1101:5407:3025 1:N:0:AGTGGTCA

Run 2 heading: @HISEQ:781:HCMFJBCX2:1:2101:15224:51333 1:N:0:ATTGAGGA

I would like to split the sequences based on the heading information into two separate files.

I have tried looking in the forum for any QIIME and sed script but couldn't manage to done it.

Any idea how to do it ?

Thanks.

sequencing next-gen split file • 4.5k views

ADD COMMENT • link updated 7.0 years ago by cpad0112 21k • written 7.0 years ago by goh ▴ 10

0

Entering edit mode

You can grep them out..some thing like this:

grep -A 3 -w "HISEQ:791:HCGHLBCX2" test.fastq > test.out.fastq

But these are read headers. Flow cells and run IDs seem to be different (from OP). You can grep based on them.

ADD REPLY • link 7.0 years ago by cpad0112 21k

0

Entering edit mode

Thanks for the reply. I have tried it. However, it just give me a list without sequences or quality. I think i did not mention my question very clear initially, i will try to clarify below. Thanks.

I have a fastq file which contains a few hundreds of sequences like below. However, there are two separate runs (781 & 791) in one fastq file. I would like to separate them including the sequences and quality into separate fastq files.

@HISEQ:781:HCMFJBCX2:1:1101:16664:3823 1:N:0:ATTGAGGA
TGGGGAATTTTGGACAATGGGCGAAAGCCTGATCCAGCAATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTTGTCCGGAAAGAAATCCTTGGCTCTAATACAGTCGGGGGATGACGGTACCGGAAGAATAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAG
+
HEHIIIIIEHIHGHHIIIIIIIHIH<EHIIIIIIIHIIIIIIIHIIGHIIIIHIIIGIIIIFHIIIGIGHHIIIIIII0FHIIIHIEGHHDHHIGGHEHHHHHHIIIIGHIIIIHHHDHDH?GGEHEHIHDHIHHIHCEGEEHHIIIIH<HDFH@G?GHGIHIHCH@.;??DDH<EHHHCBEHHEHHHHA?CDHHHHIFEHHHHIFHHCF?H<HC@-F-7-B@####
@HISEQ:791:HCGHLBCX2:1:2216:12138:99630 1:N:0:AGTGGTCA
TGGGGAATTTTGGACAATGGGCGAAAGCCTGATCCAGCAATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTTGTCCGGAAAGAAATCCTTGGCTCTAATACAGTCGGGGGATGACGGTACCGGAAGAATAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAG
+
EH@GHHHIEHHHHHIIIHHHEEDECHHGHHIIIIHHHHCHIIIEEHHCCCHHIII?@FHHIIFEHHGGGHDHIEHHHH11CGHHH?CEFDHHHHC@HHH@GHCHHE@HEFHIFHHII<ECHDHHHIEHHH@?GH-??.FCHIHEA.;:E<-:...B.7@..;-@.-85-E-7?,7BG-6-8B-5@6--6-5,,6@8EH-@E-5-B@@-86-FD<5@@@GD-@#####

ADD REPLY • link updated 7.0 years ago by GenoMax 151k • written 7.0 years ago by goh ▴ 10

0

Entering edit mode

@OP: I guess the sequences are copy/pasted from webpage or word or some other formatted pages. Sequence qualities are affected by this. Please post as they are. Otherwise, tools fail to parse fastq proper.

input (copy/pasted from above and formatted):

cat test.fq
@HISEQ:781:HCMFJBCX2:1:1101:16664:3823 1:N:0:ATTGAGGA 
TGGGGAATTTTGGACAATGGGCGAAAGCCTGATCCAGCAATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTTGTCCGGAAAGAAATCCTTGGCTCTAATACAGTCGGGGGATGACGGTACCGGAAGAATAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAG 
+ 
HEHIIIIIEHIHGHHIIIIIIIHIH<ehiiiiiiihiiiiiiihiighiiiihiiigiiiifhiiigighhiiiiiii0fhiiihieghhdhhigghehhhhhhiiiighiiiihhhdhdh?ggehehihdhihhihcegeehhiiiih<hdfh@g?ghgihihch@.;??ddh<ehhhcbehhehhhha?cdhhhhifehhhhifhhcf?h<hc@-f-7-b@#### 
@hiseq:791:hcghlbcx2:1:2216:12138:99630="" 1:n:0:agtggtca
tggggaattttggacaatgggcgaaagcctgatccagcaatgccgcgtgtgtgaagaaggccttcgggttgtaaagcacttttgtccggaaagaaatccttggctctaatacagtcgggggatgacggtaccggaagaataagcaccggctaactacgtgccagcagccgcggtaatacgtagggtgcgagcgttaatcggaattactgggcgtaaagcgtgcgcag 
+
eh@ghhhiehhhhhiiihhheedechhghhiiiihhhhchiiieehhccchhiii?@fhhiifehhggghdhiehhhh11cghhh?cefdhhhhc@hhh@ghchhe@hefhifhhii<echdhhhiehhh@?gh-??.fchihea.;:e

output:

 grep -A 3 -i -w  'HISEQ:781' test.fq
@HISEQ:781:HCMFJBCX2:1:1101:16664:3823 1:N:0:ATTGAGGA 
TGGGGAATTTTGGACAATGGGCGAAAGCCTGATCCAGCAATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTTGTCCGGAAAGAAATCCTTGGCTCTAATACAGTCGGGGGATGACGGTACCGGAAGAATAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAG

+ 
HEHIIIIIEHIHGHHIIIIIIIHIH<ehiiiiiiihiiiiiiihiighiiiihiiigiiiifhiiigighhiiiiiii0fhiiihieghhdhhigghehhhhhhiiiighiiiihhhdhdh?ggehehihdhihhihcegeehhiiiih<hdfh@g?ghgihihch@.;??ddh<ehhhcbehhehhhha?cdhhhhifehhhhifhhcf?h<hc@-f-7-b@####

ADD REPLY • link 7.0 years ago by cpad0112 21k

1

Entering edit mode

with seqkit:

$ seqkit grep -irp "HISEQ:781" test.fq

ADD REPLY • link 7.0 years ago by cpad0112 21k

0

Entering edit mode

for ubuntu/parallel users, try (test.fastq - input fastq):

$ parallel --dry-run 'seqkit grep -irp "{1}" {3} -o {1}.fastq.gz; seqkit grep -irp "{2}" {3} -o {2}.fastq.gz' ::: K00193:38 ::: K00193:40 ::: test.fastq

seqkit grep -irp "K00193:38" test.fastq -o K00193:38.fastq.gz; seqkit grep -irp "K00193:40" test.fastq -o K00193:40.fastq.gz

ADD REPLY • link 7.0 years ago by cpad0112 21k

score 2 · Answer 1 · 2018-05-25

2

Entering edit mode

7.0 years ago

Chirag Parsania ★ 2.0k

Here is the R script you can use that.

library(ShortRead)
library(data.table)

## load fastq combined of 791 and 781. I saved your data in the biostar_query.fastq
dd <- readFastq("~/Downloads/biostar_query.fastq")

### get fastq ids
ids <- id(dd)

## get matching index of 791 and 781 
index_791 <- grep("hiseq:791:" , ids)
index_781 <- grep("HISEQ:781:" , ids )

## filter reads by id 
reads_791 <- dd[index_791]
reads_781 <- dd[index_781]

## write files 
writeFastq(object = reads_791 , file = "791.fastq")
writeFastq(object = reads_781 , file = "781.fastq")

ADD COMMENT • link 7.0 years ago by Chirag Parsania ★ 2.0k

0

Entering edit mode

Thanks for the help. However, i'm not a bioinfo based person. This is too advance.

ADD REPLY • link 7.0 years ago by goh ▴ 10

GenoMax · Answer 2 · 2018-05-25

1

Entering edit mode

7.0 years ago

lakhujanivijay 5.9k

One liner using seqkit

seqkit grep -inr -p "HISEQ:791" your.fq  > 791.fq && seqkit grep -inr -p "HISEQ:781" your.fq > 781.fq

ADD COMMENT • link 7.0 years ago by lakhujanivijay 5.9k

0

Entering edit mode

I have tried your approach on cmd administrator but unsuccessful. Below i attach the script and error msg.

seqkit grep -inr -p "HISEQ:791" C:\Users\BIOTECH\Desktop\Shared_Folder\CR052T_S1_L001_R1_001.fastq > C:\Users\BIOTECH\Desktop\Shared_Folder\CR052T_S1_L001_R1_001_791.fastq && seqkit -inr -p "HISEQ:781" C:\Users\BIOTECH\Desktop\Shared_Folder\CR052T_S1_L001_R1_001.fastq > C:\Users\BIOTECH\Desktop\Shared_Folder\CR052T_S1_L001_R1_001_781.fastq

FYI, i installed seqkit as instructed by the website and after unzip i copy the seqkit into windows/system32. When i type seqkit in the command line, a list of functions of seqkit do come out.

Is it the script i type is wrong ? Thanks.

ADD REPLY • link updated 7.0 years ago by GenoMax 151k • written 7.0 years ago by goh ▴ 10

0

Entering edit mode

try '-ir -p' instead of -inr -p

ADD REPLY • link 7.0 years ago by cpad0112 21k

0

Entering edit mode

Still cant. the same error poped up.

I guess it's easier if i share one of my file with you. click here

ADD REPLY • link 7.0 years ago by goh ▴ 10

1

Entering edit mode

Seems chaining && works in windows CMD only, not in power shell. In powershell, try running them individually.

First:

seqkit grep -inr -p "HISEQ:791" C:\Users\BIOTECH\Desktop\Shared_Folder\CR052T_S1_L001_R1_001.fastq > C:\Users\BIOTECH\Desktop\Shared_Folder\CR052T_S1_L001_R1_001_791.fastq

seqkit -inr -p "HISEQ:781" C:\Users\BIOTECH\Desktop\Shared_Folder\CR052T_S1_L001_R1_001.fastq > C:\Users\BIOTECH\Desktop\Shared_Folder\CR052T_S1_L001_R1_001_781.fastq

instead of

seqkit grep -inr -p "HISEQ:791" C:\Users\BIOTECH\Desktop\Shared_Folder\CR052T_S1_L001_R1_001.fastq > C:\Users\BIOTECH\Desktop\Shared_Folder\CR052T_S1_L001_R1_001_791.fastq && seqkit -inr -p "HISEQ:781" C:\Users\BIOTECH\Desktop\Shared_Folder\CR052T_S1_L001_R1_001.fastq > C:\Users\BIOTECH\Desktop\Shared_Folder\CR052T_S1_L001_R1_001_781.fastq

If you want to run above line, try using command line (from run dialog, type "cmd" and press enter). For me following worked (on windows 10 command line, not powershell):

seqkit grep -inr -p "HISEQ:791" C:\Users\genomics\Desktop\CR052T_S1_L001_R1_001.fastq > C:\Users\genomics\Desktop\791.fastq  && seqkit grep -inr -p "HISEQ:781" C:\Users\genomics\Desktop\CR052T_S1_L001_R1_001.fastq > C:\Users\genomics\Desktop\781.fastq