Hello,
I am attempting to align sequence reads (fastq) to the PhiX genome in order to remove reads that map to the PhiX genome before going into my assembly. I have performed the alignment using BWA, and then extracted the unmapped reads from the alignment by using this command:
samtools view -f4 aln_B51_2phix.sam > aln_B51_2phix.unmapped.sam
I then convert the reads to fastq using this command:
cat aln_B51_2phix.unmapped.sam | grep -v ^@ | awk 'NR%2==1 {print "@"$1"_1\n"$10"\n+\n"$11}' > unmapped/aln_B51_2phix_0.fastq
and
cat aln_B51_2phix.unmapped.sam | grep -v ^@ | awk 'NR%2==0 {print "@"$1"_1\n"$10"\n+\n"$11}' > unmapped/aln_B51_2phix_0.fastq
I have a script which typically adds /1
/2
to the headers, and I added these to the reads before I did the alignment, but after the alignment they are no longer included in the fastq headers. I am unable to add the /1
/2
headers to these fastq files, even though they should be in the same format as the input fastq files for the alignment at this point.
I need the headers to include the /1
/2
designation for forward and reverse reads in order to perform the assembly. Does anyone know if I am doing something incorrectly or if there is a much simpler way to do all this?
Thanks!
As a side note, BBMap can output fastq streams directly, making the process easier:
Though I generally use BBDuk for phiX removal instead, as it's faster:
In each case the original read names are maintained. You can also simplify arguments like this:
...as long as the files are named 1 and 2.