I have been working with some of the genomic BAM files from TCGA downloaded from the GDC. While trying to use some tools, I noticed the BAM files lack any properly paired reads. These files are uploaded as paired-end, so I would assume these would be labeled correctly. I am relatively new to working with BAM files in this capacity, so I am pretty lost for what to do. After using samtools flagstat on the BAM files, they each show a large number of QC-passing reads, just with 0 properly paired.
I haven't seen any other posts online about this issue so I am not too confident here. Any help would be greatly appreciated!!
Does your flagstat show that their are roughly equal numbers of read1s and read2s, but they are not marked as properly paired, or that all the reads are marked as read1?
These are the flagstat results for the BAM file:
This looks to me like its not just that they arn't "proper paired", but that they arn't paired at all. No reads are makred "read2" and the number of alignments (28m) isn't really enough for read2s being mistaked as read1s.
We have seen this - some TCGA projects have a number of singled ended samples in them.
That all makes sense! Thank you so much. If there are single ended samples that would make perfect sense, I really appreciate the help.
This shouldn't be happening.
Try checking the bam by reextracting the reads:
Also try
samtools stats
on your bam, thenmultiqc
on the resulting directory, to get more info.I can't reextract the reads unfortunately because TCGA only supplies the BAM files.
I have done samtools stats and the initial stat summaries are below, however I'm not sure what you mean by "resulting directory" as just outputs results in text in command line.
Nothing seems to jump out from the samtools stats run as out of place to me, though again I am new-ish to working with BAM files like this
You can reextract FASTQ reads from a bam file, I do it frequently with the command I gave you.
Yes sorry -- you are correct. I was meaning to get at the fact that there are no paired reads or reads labeled as R1 or R2, so separating into those 2 Fastq files will not work I think. A response above suggests TCGA contains some single ended data, which makes sense here. Thanks for the help!
So I tried downloading two BAM files from two different TCGA projects (original from TCGA-READ, second from TCGA-BRCA, third from TCGA-HNSC) and they both showed proper pairing...