I've been working with SRA sequence data, and the bwa-derived alignment stubbornly give me the following error message when assessed using the Picard ValidateSamFile tool:
ERROR: MISMATCH_FLAG_MATE_NEG_STRAND <thousands>
ERROR: MISSING READ GROUP 1
Initially, I thought that it was due to having excessively stringent filters, but when I run bwa on unfiltered fastq files, I get precisely the same error. To make sure that filtering was not the issue, the only processing I applied to the two paired-read fastq files was fastp to remove tags (specifically with no filter for quality or length).
The reads in the fastq files could be out of order. Several colleagues warned me with using SRA files decoded by fastq-dump may result in misordered reads, or at least loss of read headers. I used the "raw" fastq files from EBI-ENA to avoid this issue, but even there the reads are minimally labeled, e.g. the SRA identifier and an indicator of line number for each read:
@SRR8439156.2500001 2500001/1
@SRR8439156.2500002 2500002/1
How can I validate that the paired end reads in files 1 and 2 are in the same order?
Alternatively, there may have been a problem when I split the files. I used the default Linux split command, i.e.
split -d -l 2000000 <my.fastq> <out.fastq>
This is fairly generic and so I fail to see what could have gone wrong unless the read pairs were out of order in the two files to begin with.
I'm at a loss of what else I could be doing wrong that contributes to the Missing Read Group and Mismatched Mate end errors, barring some fundamental error in the raw data itself that leads to problems following the split, since processing is minimal. Is there anything else that I can try, e.g. some way to determine if the read order is inconsistent and correcting it?
Remark: I used an older BWA, with aln rather than mem, in order for my output to be consistent with the alignments already in our research group's data set, using (where ${ref} is the path to the reference genome used). Since bwa aln works fine with other data sets, I doubt this is the issue.
bwa aln -t 8 ${ref}.fasta SRR8439151.1_R1_Block83.fastq > SRR8439151.1_R1_Block83.sai
bwa aln -t 8 ${ref}.fasta SRR8439151.1_R2_Block83.fastq > SRR8439151.1_R2_Block83.sai
bwa sampe -P ${ref}.fasta SRR8439151.1_R1_Block83.sai SRR8439151.1_R2_Block83.sai SRR8439151.1_R1_Block83.fastq SRR8439151.1_R2_Block83.fastq > SRR8439151.1_Block83.sam
Last, if I am unable to fix this, how am I to interpret the Missing Read Group error? Can I still work with those reads that were successfully mapped/aligned for the purposes of variant calling by forcing the .bam through the pipeline with the errors, or would anything derived from alignments with such errors be suspect?
You can use
repair.sh
tool from BBMap suite to do this. There is a guide for that.That may not have been a wise choice. What was the reason you were splitting the files? You can use
reformat.sh
from BBMap suite orseqtk seq
to split files.Those errors are because you did not add read groups when you aligned the data. See this page for an explanation of read groups.
You can try re-dumping your data using
-F
option (which will create original Illumina format read headers). This is not guaranteed to work if submitters did not submit the sequence in original format.Could you please clarify why using the Linux/Unix split command is problematic for splitting fastq files? It seems to me that all there is to it is splitting a file by line counts and breaks since fastq/fasta are text flat files.
I used the AddOrReplaceReadGroups function in Picard to put in a placeholder string for the missing read groups (the headers of the SRA fastq did not contain any information to assign these), but the error message noted above persisted.
If your reads were not in sync to begin with then splitting them could potentially have exacerbated that problem.