Hello!
I've googled for hours but haven't found a solution that worked for me. I have some raw human DNA paired-end data in fastq format, then I use these programs in this order:
Trimmomatic (trim the reads) BWA (map the reads to hg19) SAMtools (sorting and indexing) Picard (duplicate marking) GATK (indel realignment) GATK (base recalibration) SAMtools (mpileup) VarScan
Everything looks good until the mpileup step. The output files are empty. I did a flagstat on the bam files that I input to mpileup, and they look mostly like this:
2136672 + 0 in total (QC-passed reads + QC-failed reads)
12002 + 0 secondary
0 + 0 supplementary
1370901 + 0 duplicates
2131283 + 0 mapped (99.75% : N/A)
2124670 + 0 paired in sequencing
1062335 + 0 read1
1062335 + 0 read2
2032962 + 0 properly paired (95.68% : N/A)
2115336 + 0 with itself and mate mapped
3945 + 0 singletons (0.19% : N/A)
5880 + 0 with mate mapped to a different chr
3537 + 0 with mate mapped to a different chr (mapQ>=5)
This looks good right? What am I missing? I tried adding the -A flag to mpileup because that should include improperly mated read pairs (but judging by the flagstat output, I should not have to do this), it changes nothing.
Commands in order:
java -jar ${EBROOTTRIMMOMATIC}/trimmomatic-0.38.jar PE "$f1" "$f2" "TRIMMED/${sample}.fp.gz" "TRIMMED/${sample}.fu.gz" "TRIMMED/${sample}.rp.gz" "TRIMMED/${sample}.ru.gz" "$illuclip"
bwa mem -t 8 -M -R "@RG\tID:${sample}\tSM:${sample}\tPL:ILLUMINA" "$refGenome" "$read1" "$read2" | samtools view -Sb - > "MAPPED/${sample}.bam"
samtools sort "MAPPED/${sample}.bam" > "SORTED/${sample}.bam"
samtools index "SORTED/${sample}.bam" "SORTED/${sample}.bai"
java -jar "${EBROOTPICARD}/picard.jar" MarkDuplicates I="SORTED/${sample}.bam" O="DUPLICATES/MARKED/${sample}.bam" M="QC/PICARD/${sample}.txt"
samtools index "DUPLICATES/MARKED/${sample}.bam" "DUPLICATES/MARKED/${sample}.bai"
java -jar "$EBROOTGATK/GenomeAnalysisTK.jar" -T RealignerTargetCreator -R "$refGenome" "${knownI[@]}" -I "$i" --filter_reads_with_N_cigar -o "$t"
java -jar "$EBROOTGATK/GenomeAnalysisTK.jar" -T IndelRealigner -R "$refGenome" "${knownI[@]}" -I "$i" -targetIntervals "$t" -o "INDELSREALIGNED/${sample}.bam"
samtools index "INDELSREALIGNED/${sample}.bam" "INDELSREALIGNED/${sample}.bai"
java -jar "$EBROOTGATK/GenomeAnalysisTK.jar" -T BaseRecalibrator -R "$refGenome" "${knownS[@]}" -I "$i" -o "$o"
java -jar "$EBROOTGATK/GenomeAnalysisTK.jar" -T BaseRecalibrator -R "$refGenome" "${knownS[@]}" -I "$i" -BQSR "$o" -o "$p"
java -jar "$EBROOTGATK/GenomeAnalysisTK.jar" -T AnalyzeCovariates -R "$refGenome" -before "$o" -after "$p" -plots "BASERECALIBRATED/${sample}.pdf"
java -jar "$EBROOTGATK/GenomeAnalysisTK.jar" -T PrintReads -R "$refGenome" -I "$i" -BQSR "$o" -o "BASERECALIBRATED/${sample}.bam"
samtools mpileup -A -B -d 1000 -f "$refGenome" -o "PILEDUP/${sample}.mup" "BASERECALIBRATED/${sample}.bam"
java -Xmx3G -jar "${EBROOTVARSCAN}/VarScan.v2.4.1.jar" mpileup2snp "PILEDUP/${sample}.mup" --min-var-freq 0.001 --strand-filter 0 --p-value 0.05 > "VARCALLS/$sample.snp"
I'm newly graduated so don't have much experience. I just like to program, heh.
Big thanks in advance!!
Hi Joel Wallenius, show the relevant command please ( Brief Reminder On How To Ask A Good Question )
Changed the title and added command!
That doesn't look like the (current?) GATK best practices to me.
Changed the title and added command!
It will be good if you are able to paste all your commands as we can see the parameters used. what are the commands for Varscan ?
Also, generally, it is not a great idea to mix and match up these programs that were developed by independent groups. If you want to use GATK, then follow their best practices guide and use HaplotypeCaller to call variants. As far as I am aware, the indel realignment step is neither now necessary as a separate step in the GATK 4 release - you should check.
As an example of where this may go wrong, GATK does base recalibration separately, but SAMtools can also do it during the mpileup command. One should not be doing it twice (?).
The super computer I'm working with does not support the latest GATK so I have to use 3.6. The best practices guide includes the indel realignment and base recalibrations as far as I can see on their web page. Unless SAMtools mpileup performs the recalibration automatically without being asked to, I am surely not doing it twice. Advice noted though, and thank you!
conda
is your friend. https://anaconda.org/bioconda/gatk4Or what do you mean by "does not support"?
They use a module system, you load modules more or less like you'd activate virtual envs. The latest GATK module is 3.7 or something. I could maybe ask the admins to install the latest one but that would take a while, probably. How much do I stand to gain in practice from an update? The problem I'm having is with mpileup, after all (though I guess maybe mpileup misbehaves because of the output from GATK's baserecalibration?).
Yes, modules are common on HPCs but you have a
$HOME
folder, no? There you can install whatever you like locally and link it to your$PATH
.conda
is a package manager to do exactly that.Yes true but this script is supposed to be used by the rest of my research group and they aren't very savvy, if you know what I mean. I'd like to make it as easy as possible for them, and installing a bunch of packages as a first step is... well, it's not a priority right now at least. Perhaps in a few weeks if I don't receive new tasks...
To avoid to install tools locally in your home (and also to avoid that your colleagues install themselves tools in their home) you may try to build a singularity image with all the tools you want. Then share the image with your colleagues. I'm sure your HPC has a module to use singularity images.
The commands for VarScan do not matter as it's a later step. I will paste the commands tomorrow when I'm at work again!
Added all the commands in order!
How was the library done? Is this amplicon based? I'm asking because you have a very high duplication rate and in case of amplicon based data, you cannot mark duplicates.
I'm not sure actually, I only know that the input at the start of my script is in .fastq.gz format, that the reads are paired, and produced by an Illumina machine. FWIW, all the reads and all the samples have the same adapter sequences. Does that supply enough information?
So is it WGS or WES or targeted ?
Your question gave me an idea. I am using a file with targets in a BED-format, and the chromosome names in that file were like 1,2,3... rather than chr1, chr2, and so on... that's why the output was empty. :[ The shame is real. At least the problem is solved now...