Question

separate read1 and read2 from merged fastq file and align against reference genome

0

Entering edit mode

2.2 years ago

bioyas ▴ 20

Hi, I am processing a merged fastq file.

I used the following command to separate read1s and read2s in separate files for alignment using bwa mem.

paste - - - - - - - - < merged.fq | tee >(cut -f 1-4 | tr "\t" "\n" > read1.fq) | cut -f 5-8 | tr "\t" "\n" > read2.fq

here is the read 1s from the first 3 sequences:

@SRR10359518.1.1 1 length=26
TTATGAAATTCCTAGGCAAATGGATG
+SRR10359518.1.1 1 length=26
??????????????????????????
@SRR10359518.2.1 2 length=26
CCCTTATGCAGCTCGAGAAGGCGGAC
+SRR10359518.2.1 2 length=26
??????????????????????????
@SRR10359518.3.1 3 length=26
TCAGTCGTCCCAACATCGGACGCTTC
+SRR10359518.3.1 3 length=26
??????????????????????????

here is the read 2s of the same first 3 sequences:

@SRR10359518.1.2 1 length=26
TGGGTATCCTAAGTTTCTGGGCTAAN
+SRR10359518.1.2 1 length=26
??????????????????????????
@SRR10359518.2.2 2 length=26
TAGCAACCACAGATCCAACATGATTC
+SRR10359518.2.2 2 length=26
??????????????????????????
@SRR10359518.3.2 3 length=26
CCTCCAAGCAAACCCCACTGACCCCN
+SRR10359518.3.2 3 length=26
??????????????????????????

When I run the alignment

bwa mem ref.Genome read1.fastq read2.fastq -o my.sam

I get the following error that paired reads have different names:

paired reads have different names: "SRR10359518.1.1", "SRR10359518.1.2"

Do you have any idea how I can fix the issue?

Thanks

fastq alignment regex headers • 753 views

ADD COMMENT • link updated 2.2 years ago by seidel 11k • written 2.2 years ago by bioyas ▴ 20

score 1 · Answer 1 · 2022-10-23

Since they are in separate files, make sure the reads in file 2 have the same name as in file 1. Replace the read name suffix ".2" in the second file with ".1". Or remove the suffix in each file altogether since they are not necessary.

There are many ways to accomplish this. A perl one-liner that removes the numeric suffix on the read ID is as follows:

perl -lane 'if(/(^\@SRR[\d]+\.[\d]+)/){$F[0] = $1}{print join(" ",@F);}' read1.fastq > r1_ns.fastq

This translates to using the perlrun capability to:

autosplit a string to the @F array
if the string starts with the read ID, and matches everything up to but not including the numeric suffix
then replace the Read ID element of the @F array with the suffix-less version
print out the line by reconstituting the @F array with spaces

It's a little messy because the unaltered read ID is still in the optional 3rd fastq field, but if your Read IDs all have the same structure, it should work. You should feel free to explore and try your own solution to this sort of brain puzzle! I'm sure there's likely a simpler way.