Entering edit mode
4.1 years ago
Fid_o
▴
40
Greetings,
I want to do SNP analysis (variant call) using the steps below which I got from https://approachedinthelimit.wordpress.com/2015/10/09/variant-calling-with-gatk/
These steps use short reads. But I am working on assembled whole genome bacterial sequences, how do I go about it using assembled whole genome sequences?
Thank you
#Index reference
bwa index reference.fasta
#Sort reference
samtools faidx reference.fasta
#Create sequence dictionary
java -jar ~/bin/picard-tools-1.8.5/CreateSequenceDictionary.jar REFERENCE=reference.fasta OUTPUT=reference.dict
#Align reads and assign read group
bwa mem -R “@RG\tID:FLOWCELL1.LANE1\tPL:ILLUMINA\tLB:test\tSM:PA01” reference.fasta R1.fastq.gz R2.fastq.gz > aln.sam
#Sort sam file
java -jar ~/bin/picard-tools-1.8.5/SortSam.jar I=aln.sam O=sorted.bam SORT_ORDER=coordinate
#Mark duplicates
java -jar ~/bin/picard-tools-version/MarkDuplicates.jar I=sorted.bam O=dedup.bam METRICS_FILE=metrics.txt
#Sort bam file
java -jar ~/bin/picard-tools-version/BuildBamIndex.jar INPUT=dedup.bam
#Create realignment targets
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fasta -I dedup.bam -o targetintervals.list
#Indel realignment
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T IndelRealigner -R reference.fasta -I dedup.bam -targetIntervals targetintervals.list -o realigned.bam
#Call variants (HaplotypeCaller)
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I realigned.bam -ploidy 1 -stand_call_conf 30 -stand_emit_conf 10 -o raw.vcf
The resulting vcf file will contain your variant calls!
Then you can optionally filter the variants:
#Filter variants
~/bin/vcflib/bin/vcffilter -f ‘DP > 9’ -f ‘QUAL > 10’ raw.vcf > filtered.vcf
Or first split the raw.vcf file into SNPs and indels:
#Extract SNPs
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta -V raw.vcf -selectType SNP -o snps.vcf
#Extract Indels
java -jar ~/bin/GATK/GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta -V raw.vcf -selectType INDEL -o indels.vcf
Would you like to tally the differences in the different sequences? I.e. multiple alignments and phylogenetic tree construction? Or something else?
Yes, absolutely that - tally the differences. That is what I want to do - multiple alignments and phylogenetic tree construction
GATK3.3, vcffilter , ... this is a very very old pipeline
Thanks Pierre, would you advise what newer pipeline there is?