Question

Mutect2 produces 0 results for somatic mutation calling of paired-end reads (BAM)

4

Entering edit mode

2.1 years ago

Chilly ★ 1.3k

I am trying to identify variants of tumour paired-end RNA-seq data using Mutect2. Please ignore more advanced operations such as using tumour and normal matched sample pair or PoN creation. I will use a simple example here.

I used MACS2 to map a pair of fastq files of R1 and R2 on the genome (I used CDS sequences as 'genome', thus MACS2 might able to be used for RNA-seq) and generated a bam file containing paired-end reads. As you can see in the image below it has many variants relative to the genome (e.g. BAM image in IGV: changed from G to A at position 156). But when I try to use mutant2 to identify variants such as position 156 (covering the entire genome)：

java -Xmx1t -jar gatk-package-4.2.6.1-local.jar Mutect2 -R current.genome.fa -I cancer_T1.sorted.bam --native-pair-hmm-threads 50 --max-mnp-distance 0 -O test_cancer_N1.sorted.vcf.gz --bam-output test.bam

The output is 0 variant found:

$ zcat test_cancer_N1.sorted.vcf.gz | head

##fileformat=VCFv4.2
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting each allele">
##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting each allele">
##FORMAT=<ID=FAD,Number=R,Type=Integer,Description="Count of fragments supporting each allele.">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles">

$ zcat test_cancer_N1.sorted.vcf.gz | tail

##contig=<ID=CCDS87784.1|Hs109|chrX,length=2697>
##contig=<ID=CCDS87786.1|Hs109|chrX,length=1002>
##contig=<ID=CCDS87789.1|Hs109|chrX,length=471>
##contig=<ID=CCDS87792.1|Hs109|chrX,length=1737>
##contig=<ID=CCDS87793.1|Hs109|chrX,length=1110>
##contig=<ID=CCDS87800.1|Hs109|chrX,length=714>
##filtering_status=Warning: unfiltered Mutect 2 calls.  Please run FilterMutectCalls to remove false positives.
##source=Mutect2
##tumor_sample=cancer_N1
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  cancer_N1

All other paired-end RNA-seq data, regardless of species, produced almost no variant results like the above example. When splitting R1 and R2 into single-end and calling variants separately, a large number of variant results will be produced for each of them. I don't understand what I'm doing wrong with paired-end RNA-seq data, or that Mutect2 was not originally designed for paired-end RNA-seq data.

I know it is wrong to split paired-end into two single-ends and run them separately, it will miss a lot of information. Ask for any valid suggestions to get paired-end RNA-seq data working properly in Mutect2.

Mutect2 RNA-seq mutation paired-end variant • 2.0k views

ADD COMMENT • link 2.1 years ago by Chilly ★ 1.3k

score 0 · Answer 1 · 2022-10-09

0

Entering edit mode

2.1 years ago

Shred ★ 1.5k

Mutect2 isn't designed for RNAseq data at all. You'll get a huge amount of false positives calls and germline mutations identified as somatic.
Despite that, your approach is like shooting in a foot. A paired end sequencing is done to increase reads confidence, and your analysis voluntarily discards every advantage offered by the paired end design. Why not mapping both pair generating a single bam output? I think following GATK Best practices for RNAseq pre-processing (implemented as a NextFlow workflow here ) is a good start point.

ADD COMMENT • link 2.1 years ago by Shred ★ 1.5k

4

Entering edit mode

Hi Shred,

Thanks for your reply! I have some points that I would like to discuss with you further:

Mutect2 isn't designed for RNAseq data at all.

The main reason I favour Mutect2 is that my samples come in pairs, a tumour sample from consenting individuals and a surrounding healthy sample. This may make false-positive results less likely. Do you have any other tool recommendations that can apply matched sample pairs?

A paired end sequencing is done to increase reads confidence, and your analysis voluntarily discards every advantage offered by the paired end design.

I don't understand you. Why do you say my analysis gives up the paired-end advantage? Do you mean switching from paired-end to single-end?

Why not mapping both pair generating a single bam output?

Yes, that's how I did it. But this fails to produce valid variant calling. Suppose we have the following two paired-end seqed mRNAs:

two mRNA

If we treat R1 and R2 separately as single-end RNA-seq, there will be no overlap between the two R1 reads and the two R2 reads. This also means that there would be no potential variant appearing multiple times, which is bad news for variant calling, between these two reads.

and

If keep paired-end data, there will be an overlap between R2a and R1b. This is good news for variant calling between these two reads.

R1R2

So I always think it is necessary to keep the paired-end form instead of splitting it into two single-end data. But I still haven't found the reason why Mutect2 can't detect any variant using a paired-end BAM file, even though I can visualise a lot of variants using IGV.

I think following GATK Best practices for RNAseq pre-processing (implemented as a NextFlow workflow here ) is a good start point

I found some people use GATK Best practices for RNAseq pre-processing before doing Mutect2. But I don't understand the relationship.

In addition, for paired-end RNA-seq data, Can CalliNGS-NF only process R1 and R2 fastq files sequentially?

ADD REPLY • link 2.1 years ago by Chilly ★ 1.3k

0

Entering edit mode

Please ignore more advanced operations such as using tumour and normal matched sample pair or PoN creation

Then later

The main reason I favour Mutect2 is that my samples come in pairs, a tumour sample from consenting individuals and a surrounding healthy sample. This may make false-positive results less likely. Do you have any other tool recommendations that can apply matched sample pairs

It's unclear which is your setup. From your command, you're running Mutect2 using only a tumor sample, without a matched control sample.

If we treat R1 and R2 separately as single-end RNA-seq, there will be no overlap between the two R1 reads and the two R2 reads. This also means that there would be no potential variant appearing multiple times, which is bad news for variant calling, between these two reads

Why this would be a bad news? If you have a paired bam file, each read is linked to its paired one, so they'll be considered together. This is again useful for the later filtering step, where the potential bias between the two pairs in term of detected allele would be corrected (more here).

GATK preprocessing steps are intended to do many things, from deduplication to base quality recalibration (where, again, having a paired end BAM file helps a lot). They wrote a very extended wiki explaining each process in detail and, while the website may appear confusing, their doc isn't.

ADD REPLY • link 2.1 years ago by Shred ★ 1.5k

4

Entering edit mode

Hi Shred,

thank you for your reply. I learned a lot from it. Do you think it's a bad idea to use PEAR or FLASH to concatenate R1 and R2 from the sorted single paired-end BAM file into ONE single read and run Mutect2 as single-end data? I still haven't given up on using Mutect2 and trying to get it working.

ADD REPLY • link 2.1 years ago by Chilly ★ 1.3k

1

Entering edit mode

I’m still far from understanding why you’re scratching your head with such a confused approach. I’d suggest to map fastq pairs into a single bam (STAR is the better choice) do deduplication and base quality recalibration, then run Mutect2.

ADD REPLY • link 2.1 years ago by Shred ★ 1.5k

4

Entering edit mode

Forgive me for laughing so loudly. Yes, I've been scratching my head all day today and complicating matters.

I'll try your suggestion right away. Expecting Mutect2 to give me some result instead of 0.

ADD REPLY • link 2.1 years ago by Chilly ★ 1.3k