After aligning paired-end 100bp reads to a reference genome, I am getting very low properly paired percentage:
369208441 0 total (QC-passed reads + QC-failed reads)
8985531 0 secondary
289733341 0 mapped
78.47% N/A mapped %
360222910 0 paired in sequencing
180111455 0 read1
180111455 0 read2
1393338 0 properly paired
0.39% N/A properly paired %
280747810 0 with itself and mate mapped
0 0 singletons
0.00% N/A singletons %
39590468 0 with mate mapped to a different chr
0 0 with mate mapped to a different chr (mapQ>=5)
I followed GATK best practices to align paired-end short-read data to a reference genome. I downloaded the short-read data from NCBI SRA into fastq files using SRA toolkit's fastq-dump, converted the fastq files into unmapped bam using Picard FastqToSam, and marked adapters using Picard MarkIlluminaAdapters. I then piped Picard SamToFastq, bwa mem, and Picard MergeBamAlignment. To get stats on the alignment, I used samtools flagstat. For several of my samples, the alignment went great (90% mapped, 80% properly paired). However, for a couple of my samples, the properly paired percentage was well below 1%. I'm wondering how I could have a normal amount of reads mapping (~78%) but have only .39% of those reads properly paired.
I have double-checked that my fastq files from fastq-dump have identical read counts, and that they are properly interleaved after Picard FastqToSam. I additionally ran Picard ValidateSamFile to troubleshoot the file output by MergeBamAlignment and found no errors.
check some of your improperly paired reads. Usually they get marked like that only if either one of the reads does not map or if both get mapped and the distance between the outermost coordinates is either too large or too little.
it is possible that your reads overlap (a good deal) and get marked as improper pair for that reason.
I found the source of my issue. The data I've downloaded from NCBI is supposed to be paired-end reads of 100 bp length. However, the forward and reverse reads are identical, which must have been a mistake during NBCI SRA submission. Am I just out of luck at this point? Or should I contact the researcher who submitted the data to NCBI SRA?
yes, contact the original submitter and tell NCBI (via the info email address) as well, they may put more pressure on the submitter
Which accession number is this so we can take a look and let you know for sure. Did you use
--split-files
option when you dumped the reads from SRA using fastq-dump?Also look at these lines:
seems to indicate that your mates map to different chromosomes
Please don't delete posts that have already gotten answers and generated discussion. Biostars is meant to serve as a repository of good questions and answers, so it's a shame to see any of them removed.