separate read1 and read2 from merged fastq file and align against reference genome
1
0
Entering edit mode
2.2 years ago
bioyas ▴ 20

Hi, I am processing a merged fastq file.

I used the following command to separate read1s and read2s in separate files for alignment using bwa mem.

paste - - - - - - - - < merged.fq | tee >(cut -f 1-4 | tr "\t" "\n" > read1.fq) | cut -f 5-8 | tr "\t" "\n" > read2.fq

here is the read 1s from the first 3 sequences:

@SRR10359518.1.1 1 length=26
TTATGAAATTCCTAGGCAAATGGATG
+SRR10359518.1.1 1 length=26
??????????????????????????
@SRR10359518.2.1 2 length=26
CCCTTATGCAGCTCGAGAAGGCGGAC
+SRR10359518.2.1 2 length=26
??????????????????????????
@SRR10359518.3.1 3 length=26
TCAGTCGTCCCAACATCGGACGCTTC
+SRR10359518.3.1 3 length=26
??????????????????????????

here is the read 2s of the same first 3 sequences:

@SRR10359518.1.2 1 length=26
TGGGTATCCTAAGTTTCTGGGCTAAN
+SRR10359518.1.2 1 length=26
??????????????????????????
@SRR10359518.2.2 2 length=26
TAGCAACCACAGATCCAACATGATTC
+SRR10359518.2.2 2 length=26
??????????????????????????
@SRR10359518.3.2 3 length=26
CCTCCAAGCAAACCCCACTGACCCCN
+SRR10359518.3.2 3 length=26
??????????????????????????

When I run the alignment

bwa mem ref.Genome read1.fastq read2.fastq -o my.sam

I get the following error that paired reads have different names:

paired reads have different names: "SRR10359518.1.1", "SRR10359518.1.2"

Do you have any idea how I can fix the issue?

Thanks

fastq alignment regex headers • 753 views
ADD COMMENT
1
Entering edit mode
2.2 years ago
seidel 11k

Since they are in separate files, make sure the reads in file 2 have the same name as in file 1. Replace the read name suffix ".2" in the second file with ".1". Or remove the suffix in each file altogether since they are not necessary.

There are many ways to accomplish this. A perl one-liner that removes the numeric suffix on the read ID is as follows:

perl -lane 'if(/(^\@SRR[\d]+\.[\d]+)/){$F[0] = $1}{print join(" ",@F);}' read1.fastq > r1_ns.fastq

This translates to using the perlrun capability to:

  • autosplit a string to the @F array
  • if the string starts with the read ID, and matches everything up to but not including the numeric suffix
  • then replace the Read ID element of the @F array with the suffix-less version
  • print out the line by reconstituting the @F array with spaces

It's a little messy because the unaltered read ID is still in the optional 3rd fastq field, but if your Read IDs all have the same structure, it should work. You should feel free to explore and try your own solution to this sort of brain puzzle! I'm sure there's likely a simpler way.

ADD COMMENT

Login before adding your answer.

Traffic: 1901 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6