BWA-MEM - Improperly Paired Reads
1
0
Entering edit mode
4.2 years ago
mglasena ▴ 40

After aligning paired-end 100bp reads to a reference genome, I am getting very low properly paired percentage:

369208441   0   total (QC-passed reads + QC-failed reads)

8985531 0   secondary

289733341   0   mapped

78.47%  N/A mapped %

360222910   0   paired in sequencing

180111455   0   read1

180111455   0   read2

1393338 0   properly paired

0.39%   N/A properly paired %

280747810   0   with itself and mate mapped

0   0   singletons

0.00%   N/A singletons %

39590468    0   with mate mapped to a different chr

0   0   with mate mapped to a different chr (mapQ>=5)

I followed GATK best practices to align paired-end short-read data to a reference genome. I downloaded the short-read data from NCBI SRA into fastq files using SRA toolkit's fastq-dump, converted the fastq files into unmapped bam using Picard FastqToSam, and marked adapters using Picard MarkIlluminaAdapters. I then piped Picard SamToFastq, bwa mem, and Picard MergeBamAlignment. To get stats on the alignment, I used samtools flagstat. For several of my samples, the alignment went great (90% mapped, 80% properly paired). However, for a couple of my samples, the properly paired percentage was well below 1%. I'm wondering how I could have a normal amount of reads mapping (~78%) but have only .39% of those reads properly paired.

I have double-checked that my fastq files from fastq-dump have identical read counts, and that they are properly interleaved after Picard FastqToSam. I additionally ran Picard ValidateSamFile to troubleshoot the file output by MergeBamAlignment and found no errors.

bwa next-gen genome alignment gatk • 4.8k views
ADD COMMENT
0
Entering edit mode

check some of your improperly paired reads. Usually they get marked like that only if either one of the reads does not map or if both get mapped and the distance between the outermost coordinates is either too large or too little.

it is possible that your reads overlap (a good deal) and get marked as improper pair for that reason.

ADD REPLY
0
Entering edit mode

I found the source of my issue. The data I've downloaded from NCBI is supposed to be paired-end reads of 100 bp length. However, the forward and reverse reads are identical, which must have been a mistake during NBCI SRA submission. Am I just out of luck at this point? Or should I contact the researcher who submitted the data to NCBI SRA?

ADD REPLY
0
Entering edit mode

yes, contact the original submitter and tell NCBI (via the info email address) as well, they may put more pressure on the submitter

ADD REPLY
0
Entering edit mode

The data I've downloaded from NCBI is supposed to be paired-end reads of 100 bp length. However, the forward and reverse reads are identical, which must have been a mistake during NBCI SRA submission.

Which accession number is this so we can take a look and let you know for sure. Did you use --split-files option when you dumped the reads from SRA using fastq-dump?

ADD REPLY
0
Entering edit mode

Also look at these lines:

39590468 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

seems to indicate that your mates map to different chromosomes

ADD REPLY
0
Entering edit mode

Please don't delete posts that have already gotten answers and generated discussion. Biostars is meant to serve as a repository of good questions and answers, so it's a shame to see any of them removed.

ADD REPLY
0
Entering edit mode
4.2 years ago

converted the fastq files into unmapped bam using Picard FastqToSam, and marked adapters using Picard MarkIlluminaAdapters. I then piped Picard SamToFastq,

"Marked" adapters? Did you trim them away? Was this marking carried over into the fastq names? If not, didn't you lose it when you converted back to fastq?

You probably goofed the fastq ordering converting to and from fastq format.

ADD COMMENT
0
Entering edit mode

Picard MarkIlluminaAdapters adds the XT tag to a read record to mark the 5' start position of the specified adapter sequence. Picard SamToFastq changes the quality scores of bases marked with an XT tag to two, which should effectively prevent the adapter portion from contributing to read alignment and metrics. Picard MergeBamAlignment restores the original base qualities. I've been following this gatk tutorial - "(How to) Map and clean up short short read sequence data efficiently."

I will double-check the fastq conversion steps.

ADD REPLY
0
Entering edit mode

to be honest with you, that tutorial that you linked and seem to follow appears to advise an over-engineered approach that seems unnecessarily complicated and probably has no effect whatsoever. these back and forth conversions ... are silly

altering data to manufacture artificial quality scores to avoid alignment then restoring the qualities just feels outlandish. As far as I know, bwa would not take quality scores into account when performing the alignment in the first place, so what would even be the point of altering the qualities?

In fact bwa mem will align the original reads just fine, adapters and all, will soft clip the adapters none of the actions should be needed.

Regardless, this was not your original problem, but I do feel the need to point out that I think the approach described there appears to be unnecessarily complicated.

ADD REPLY
0
Entering edit mode

I appreciate the tips. I'm just getting started with genome analysis.

ADD REPLY

Login before adding your answer.

Traffic: 2674 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6