I am trying to identify variants of tumour paired-end RNA-seq data using Mutect2. Please ignore more advanced operations such as using tumour and normal matched sample pair or PoN creation. I will use a simple example here.
I used MACS2 to map a pair of fastq files of R1 and R2 on the genome (I used CDS sequences as 'genome', thus MACS2 might able to be used for RNA-seq) and generated a bam file containing paired-end reads. As you can see in the image below it has many variants relative to the genome (e.g. BAM image in IGV: changed from G to A at position 156). But when I try to use mutant2 to identify variants such as position 156 (covering the entire genome):
java -Xmx1t -jar gatk-package-4.2.6.1-local.jar Mutect2 -R current.genome.fa -I cancer_T1.sorted.bam --native-pair-hmm-threads 50 --max-mnp-distance 0 -O test_cancer_N1.sorted.vcf.gz --bam-output test.bam
The output is 0 variant found:
$ zcat test_cancer_N1.sorted.vcf.gz | head
##fileformat=VCFv4.2
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting each allele">
##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting each allele">
##FORMAT=<ID=FAD,Number=R,Type=Integer,Description="Count of fragments supporting each allele.">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles">
$ zcat test_cancer_N1.sorted.vcf.gz | tail
##contig=<ID=CCDS87784.1|Hs109|chrX,length=2697>
##contig=<ID=CCDS87786.1|Hs109|chrX,length=1002>
##contig=<ID=CCDS87789.1|Hs109|chrX,length=471>
##contig=<ID=CCDS87792.1|Hs109|chrX,length=1737>
##contig=<ID=CCDS87793.1|Hs109|chrX,length=1110>
##contig=<ID=CCDS87800.1|Hs109|chrX,length=714>
##filtering_status=Warning: unfiltered Mutect 2 calls. Please run FilterMutectCalls to remove false positives.
##source=Mutect2
##tumor_sample=cancer_N1
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT cancer_N1
All other paired-end RNA-seq data, regardless of species, produced almost no variant results like the above example. When splitting R1 and R2 into single-end and calling variants separately, a large number of variant results will be produced for each of them. I don't understand what I'm doing wrong with paired-end RNA-seq data, or that Mutect2 was not originally designed for paired-end RNA-seq data.
I know it is wrong to split paired-end into two single-ends and run them separately, it will miss a lot of information. Ask for any valid suggestions to get paired-end RNA-seq data working properly in Mutect2.
Hi Shred,
Thanks for your reply! I have some points that I would like to discuss with you further:
The main reason I favour Mutect2 is that my samples come in pairs, a tumour sample from consenting individuals and a surrounding healthy sample. This may make false-positive results less likely. Do you have any other tool recommendations that can apply matched sample pairs?
I don't understand you. Why do you say my analysis gives up the paired-end advantage? Do you mean switching from paired-end to single-end?
Yes, that's how I did it. But this fails to produce valid variant calling. Suppose we have the following two paired-end seqed mRNAs:
If we treat R1 and R2 separately as single-end RNA-seq, there will be no overlap between the two R1 reads and the two R2 reads. This also means that there would be no potential variant appearing multiple times, which is bad news for variant calling, between these two reads.
and
If keep paired-end data, there will be an overlap between R2a and R1b. This is good news for variant calling between these two reads.
So I always think it is necessary to keep the paired-end form instead of splitting it into two single-end data. But I still haven't found the reason why Mutect2 can't detect any variant using a paired-end BAM file, even though I can visualise a lot of variants using IGV.
I found some people use GATK Best practices for RNAseq pre-processing before doing Mutect2. But I don't understand the relationship.
In addition, for paired-end RNA-seq data, Can CalliNGS-NF only process R1 and R2 fastq files sequentially?
Then later
It's unclear which is your setup. From your command, you're running Mutect2 using only a tumor sample, without a matched control sample.
Why this would be a bad news? If you have a paired bam file, each read is linked to its paired one, so they'll be considered together. This is again useful for the later filtering step, where the potential bias between the two pairs in term of detected allele would be corrected (more here).
GATK preprocessing steps are intended to do many things, from deduplication to base quality recalibration (where, again, having a paired end BAM file helps a lot). They wrote a very extended wiki explaining each process in detail and, while the website may appear confusing, their doc isn't.
Hi Shred,
thank you for your reply. I learned a lot from it. Do you think it's a bad idea to use PEAR or FLASH to concatenate R1 and R2 from the sorted single paired-end BAM file into ONE single read and run Mutect2 as single-end data? I still haven't given up on using Mutect2 and trying to get it working.
I’m still far from understanding why you’re scratching your head with such a confused approach. I’d suggest to map fastq pairs into a single bam (STAR is the better choice) do deduplication and base quality recalibration, then run Mutect2.
Forgive me for laughing so loudly. Yes, I've been scratching my head all day today and complicating matters.
I'll try your suggestion right away. Expecting Mutect2 to give me some result instead of 0.