I want to ensure I can convert my BAMs back to FASTQ without any loss of data. However, I have noticed that, when running samtools fastq
, the reads that come out look different from the reads I originally aligned. In particular, they seem to have lost the second segment of the read ID that contains the index sequence. For example, lets say the original reads looked like this:
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Once converted back, they look more like:
@EAS139:136:FC706VJ:2:2104:15343:197393
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
In addition, when I look at the BAM file, post alignment, I see that this part of the read ID isn't there either. So I have to assume BWA is stripping them out. Why would it do this? Is there any way to make it preserve all data?
all the aligner uses header till first space. if you want the full header you need to replace the space with something else.
But make sure if doesn't get too long for the specifications.
reformat.sh in=your.bam out1=R1.fq.gz out2=R2.fq.gz
from BBMap suite, should preserve the header as is provided your alignments retain the information about R1/R2 reads.