Base Recabiliration Filters out 88% of the reads
1
0
Entering edit mode
6.8 years ago
Sharon ▴ 610

Hi All

When using this command:

java -jar ${GATK}/GenomeAnalysisTK.jar -T SplitNCigarReads -R ${hg38}.fasta -I ${WHERE}/${CURRENT}.dedupped.bam -o ${WHERE}/${CURRENT}.split.bam \
 -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

The output is as follows:

INFO 18:47:27,384 ProgressMeter - done 1.42375677E8
4.4 h 111.0 s 100.0% 4.4 h 0.0 s INFO 18:47:27,384 ProgressMeter - Total runtime 15909.94 secs, 265.17 min, 4.42 hours INFO 18:47:27,394 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 142375677 total reads

(0.00%) INFO 18:47:27,395 MicroScheduler - -> 0 reads (0.00% of

total) failing BadCigarFilter INFO 18:47:27,395 MicroScheduler -

-> 0 reads (0.00% of total) failing MalformedReadFilter INFO 18:47:27,396 MicroScheduler - -> 0 reads (0.00% of total) failing ReassignOneMappingQualityFilter

But when I do base recabliration:

         java -jar ${GATK}/GenomeAnalysisTK.jar \
        -T BaseRecalibrator \
        -R ${hg38}.fasta \
        -I ${WHERE}/${CURRENT}.split.bam \
        -knownSites ${DBSNP} \
        -knownSites ${GOLDINDELS} \
        -o ${WHERE}/${CURRENT}.recal_data.table

I got:

INFO 12:02:24,217 BaseRecalibrator - ...done! INFO 12:02:24,218 BaseRecalibrator - BaseRecalibrator was able to recalibrate 21765617 reads INFO 12:02:24,219 ProgressMeter - done
2.1765706E7 94.9 m 4.4 m 100.0% 94.9 m 0.0 s INFO 12:02:24,220 ProgressMeter - Total runtime 5693.92 secs, 94.90 min, 1.58 hours INFO 12:02:24,220 MicroScheduler - 160165683 reads > were filtered out during the traversal out of approximately 181931389 > total reads (88.04%) INFO 12:02:24,221 MicroScheduler - -> 6632 reads (0.00% of total) failing BadCigarFilter INFO 12:02:24,221 MicroScheduler - -> 72386752 reads (39.79% of total) failing DuplicateReadFilter INFO 12:02:24,223 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 12:02:24,223 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 12:02:24,223 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 12:02:24,223 MicroScheduler - -> 63827082 reads (35.08% of total) failing MappingQualityZeroFilter INFO 12:02:24,224 MicroScheduler - -> 23945217 reads (13.16% of total) failing NotPrimaryAlignmentFilter INFO 12:02:24,225 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter

What makes >88% of reads filtered out, if they were not filtered before? and how should I fix this?

Thanks

RNA-Seq Variant Calling • 3.3k views
ADD COMMENT
0
Entering edit mode

lookls like SplitNCigarReads produces malformed reads. what's your version of gatk ?

ADD REPLY
0
Entering edit mode

GATK3. I think we have 0 reads (0.00% of total) failing MalformedReadFilter before and after the SplitNCigarRead.

ADD REPLY
0
Entering edit mode

what the major-minor version of gatk please . GATK3.* ?

ADD REPLY
0
Entering edit mode

I don't know how to check this Pierre Lindenbaum ?

ADD REPLY
1
Entering edit mode
6.8 years ago

This is from an RNA Seq experiment right, and you're attempting to call variants? AFAIK, SplitNCigarReads corrects the CIGAR string when reads span splice junctions, and forces filters for MalformedReadFilter and ReassignOneMappingQualityFilter, of which all of your reads seems to have passed. When you're running BQSR, you're losing a lot of reads from Duplicates (expected as you have a quantitive experiment). The major concern for me here is the MappingQualityUnavailableFilter, how did you perform your alignments?

ADD COMMENT
0
Entering edit mode

I used STAR for alignment, same as GATK pipeline for variant calling. So does this mean there is too many reads that are mapped but with low quality? But if yes, how can I fix. I think the default parameters of STAR are relaxed enough, making them more stringent will lead to more unmapped. Or I am thinking in a wrong direction? Thanks a lot.

ADD REPLY
0
Entering edit mode

The issues are these:

72386752 reads (39.79% of total) failing DuplicateReadFilter

Too many duplicate reads

--------------------------

63827082 reads (35.08% of total) failing MappingQualityZeroFilter

Too many reads with MAPQ of nil / zero. You'll have to see what MAPQ=0 means according to STAR

--------------------------

23945217 reads (13.16% of total) failing NotPrimaryAlignmentFilter

Too many secondary reads

ADD REPLY
0
Entering edit mode

Thanks alot, but is there a way to fix this? How to fix multimapping, duplicates, and secondary reads. Those are related to the orignal reads right?

ADD REPLY
0
Entering edit mode

It may help to explain the nature of the experiment - is it RNA-seq variant calling, like Andrew said?. If it is indeed RNA-seq, then many duplicates are expected.

Are you following step-by-step the pipeline by the GATK: https://gatkforums.broadinstitute.org/gatk/discussion/3892/the-gatk-best-practices-for-variant-calling-on-rnaseq-in-full-detail ?

ADD REPLY
0
Entering edit mode

Modified:mYes, following it step by step. I think those issues are related to the quality of the original reads? Also the link you posted could be a reason. So my question is it safe to ignore the base recalibration step? and move with the bam file after splitncigar? The base recabilration is discarding 88% of the reads !

ADD REPLY
0
Entering edit mode

Do you have stats using FASTQC on your raw FASTQ reads? 5PM here - now going home!

ADD REPLY
0
Entering edit mode

I do apologize Kevin, I feel I pushed you, really did not mean to, so sorry.
Here is a snapshot of the quality, just in case you get a chance :) It is not terrible to the extent to get 88% discarded.

https://ibb.co/iSTNBS

ADD REPLY
1
Entering edit mode

Now safe in the comfort of my overpriced home. Nothing out of the ordinary on the FASTQC image. My own opinion is that the base quality recalibration step is not needed at all. Even on the page where the pipeline flow diagram is located, they say:

For one thing, these recommendations are based on high quality RNA-seq data (30 million 75bp paired-end reads produced on Illumina HiSeq). Other types of data might need slightly different processing. In addition, we have currently worked only on data from one tissue from one individual. Once we’ve had the opportunity to get more experience with different types (and larger amounts) of data, we will update these recommendations to be more comprehensive.

For me, that basically means that modifications are expected.

ADD REPLY
1
Entering edit mode

I would also look up the default parameter settings for the base recalibration step in order to see what could have been resulting in the filtering out of so many reads.

ADD REPLY
0
Entering edit mode

Thanks Kevin so much. Much appreciated :)

ADD REPLY
0
Entering edit mode

Thanks a lot andrew.j.skelton73 :)

ADD REPLY

Login before adding your answer.

Traffic: 2728 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6