How to pair TCGA bam files with no properly paired reads?
1
2
Entering edit mode
2.2 years ago

I have been working with some of the genomic BAM files from TCGA downloaded from the GDC. While trying to use some tools, I noticed the BAM files lack any properly paired reads. These files are uploaded as paired-end, so I would assume these would be labeled correctly. I am relatively new to working with BAM files in this capacity, so I am pretty lost for what to do. After using samtools flagstat on the BAM files, they each show a large number of QC-passing reads, just with 0 properly paired.

I haven't seen any other posts online about this issue so I am not too confident here. Any help would be greatly appreciated!!

GDC TCGA Paired-Read BAM • 1.9k views
ADD COMMENT
1
Entering edit mode

Does your flagstat show that their are roughly equal numbers of read1s and read2s, but they are not marked as properly paired, or that all the reads are marked as read1?

ADD REPLY
0
Entering edit mode

These are the flagstat results for the BAM file:

enter image description here

ADD REPLY
1
Entering edit mode

This looks to me like its not just that they arn't "proper paired", but that they arn't paired at all. No reads are makred "read2" and the number of alignments (28m) isn't really enough for read2s being mistaked as read1s.

We have seen this - some TCGA projects have a number of singled ended samples in them.

ADD REPLY
0
Entering edit mode

That all makes sense! Thank you so much. If there are single ended samples that would make perfect sense, I really appreciate the help.

ADD REPLY
0
Entering edit mode

This shouldn't be happening.

Try checking the bam by reextracting the reads:

samtools fastq -1 R1.fastq -2 R2.fastq input.bam

Also try samtools stats on your bam, then multiqc on the resulting directory, to get more info.

ADD REPLY
0
Entering edit mode

I can't reextract the reads unfortunately because TCGA only supplies the BAM files.

I have done samtools stats and the initial stat summaries are below, however I'm not sure what you mean by "resulting directory" as just outputs results in text in command line.

Nothing seems to jump out from the samtools stats run as out of place to me, though again I am new-ish to working with BAM files like this

enter image description here

ADD REPLY
0
Entering edit mode

You can reextract FASTQ reads from a bam file, I do it frequently with the command I gave you.

ADD REPLY
1
Entering edit mode

Yes sorry -- you are correct. I was meaning to get at the fact that there are no paired reads or reads labeled as R1 or R2, so separating into those 2 Fastq files will not work I think. A response above suggests TCGA contains some single ended data, which makes sense here. Thanks for the help!

ADD REPLY
1
Entering edit mode

So I tried downloading two BAM files from two different TCGA projects (original from TCGA-READ, second from TCGA-BRCA, third from TCGA-HNSC) and they both showed proper pairing...

ADD REPLY
5
Entering edit mode
2.1 years ago
Zhenyu Zhang ★ 1.2k

These pair-end values were provided by data submitters, but not computed by the GDC. If the data is submitted as individual FASTQs, there are explicit GDC QC steps to validate these values, but if the data is submitted in BAMs (or some old TCGA tarballs), no quick on-the-fly check happened.

TCGA data is a mixed bag, with some very old data/project tend to have old Illumina machines, shorter reads, more read groups, single-ended; and the new data tends to have longer and paired-ended reads. So my best bet is that you actually have an SE BAM, and the PE labeling you see is wrong. To confirm that, you can go to the GDC legacy archive, looking for the input BAM that has the same experimental strategy and on the same aliquot, and do a flagstat.

ADD COMMENT

Login before adding your answer.

Traffic: 1879 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6