Hi,
Recently, I have tried to construct a pipeline (as the following lines show) to call snp after reading the workflow of GATK and samtools for a species do not have a reference dbsnp,.
And there are several problem confusing me:
(1) If I have more than one sample with repetition (i.e. sample_1_1, sample_1_2 for sample_1 and sample_2_1, sample_2_2 for sample_2), what should I do?
(2) Because of the absence of reference dbsnp, It's necessary to take the step "quality score recalibration" ?
(3) Which parametrics in the following step are very important?
########################################
############### Mapping ################
########################################
################
#(1) build index
################
bwa index -a bwtsw -p <reference> <reference.fa>
################
#(2) Sort reference
################
samtools faidx reference.fasta
################
#(3) Create sequence dictionary
################
java -jar CreateSequenceDictionary.jar REFERENCE=reference.fa OUTPUT=reference.dict
################
#(4) Align samples to reference genome with bwa mem
################
bwa mem -M -t 12 -R '@RG\tID:uniq_ID\tSM:Sample_name\tLB:Sample_name_lib_num\tPL:ILLUMINA' \
reference left.fq right.fq | \
samtools view -bS - > Sample_name_lib_num.bam
################
#(5) Fix Mate Information
################
samtools fixmate -O bam Sample_name_lib_num.bam Sample_name_lib_num.fixmate.bam
################
#(6) Sort BAM file
################
samtools sort -T /tmp/Sample_name_lib_num.sorted -o Sample_name_lib_num.sorted.bam Sample_name_lib_num.fixmate.bam
########################################
############# Improvement ##############
########################################
###############
#(7)Splits reads with Ns in CIGAR
###############
java –jar GenomeAnalysisTK.jar \
–T SplitNCigarReads \
–R reference.fa \
–I Sample_name_lib_num.sorted.bam \
–o Sample_name_lib_num.sorted.SplitNCigar.bam \
–U ALLOW_N_CIGAR_READS \
–rf ReassignOneMappingQuality \
–RMQF 255 \
–RMQT 60
################
#(8)Identify target regions for realignment
################
java -Xmx2g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator \
-R reference.fa \
-I Sample_name_lib_num.sorted.SplitNCigar.bam \
-ip 50 \
-o Sample_name_lib_num.sorted.SplitNCigar.intervals
################
#(9) Perform realignment
################
java -Xmx4g -jar GenomeAnalysisTK.jar -T IndelRealigner \
-R reference.fa \
-I Sample_name_lib_num.sorted.SplitNCigar.bam \
-targetIntervals Sample_name_lib_num.sorted.SplitNCigar.intervals \
-output Sample_name_lib_num.realigned.bam
################
#(10) fixes mate information that may be changed during the realignment process
################
java -Djava.io.tmpdir=/tmp/ \
-jar picard.jar FixMateInformation \
INPUT=Sample_name_lib_num.realigned.bam \
OUTPUT=Sample_name_lib_num.realigned.fixed.bam \
SO=coordinate \
VALIDATION_STRINGENCY=LENIENT \
CREATE_INDEX=true
################
#(11) Build the recalibration model
################
java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator \
-R reference.fa \
-I Sample_name_lib_num.realigned.fixed.bam \
-ip 50 \
-o Sample_name_lib_num_recalibration.table
################
#(12) Apply the recalibration model
################
java -Xmx4g -jar GenomeAnalysisTK.jar -T PrintReads \
-R reference.fa \
-I Sample_name_lib_num.realigned.bam \
-BQSR Sample_name_lib_num_recalibration.table \
-o Sample_name_lib_num.realigned.recal.bam
################
#(13) Mark Duplicates
################
java -jar $PICARD MarkDuplicates \
I=Sample_name_lib_num.realigned.recal.bam \
O=Sample_name_lib_num.improvement.bam \
M=Sample_name_lib_num_dedup_metrics.txt \
CREATE_INDEX=true
################
#(14) index BAM using samtools
################
samtools index Sample_name_lib_num.improvement.bam
########################################
########## Variant Calling #############
########################################
###################
## (15) Call variants using HaplotypeCaller in GVCF mode.
###################
java -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R reference.fa \
-I Sample_name_lib_num.improvement.bam \
-o Sample_name_lib_num.HaplotypeCaller.vcf \
-ERC GVCF
###################
#(16) Run GenotypeGVCFs in GVCF mode.
###################
java -jar GenomeAnalysisTK.jar \
-T GenotypeGVCFs \
-R reference.fa \
-V Sample_name_lib_num.HaplotypeCaller.vcf \
-o Sample_name_lib_num.HaplotypeCaller.gg.vcf
###################
# (17) Call variants with Samtools
###################
samtools mpileup -C 50 -uf reference.fa Sample_name_lib_num.improvement.bam | \
bcftools view -bvcg - > Sample_name_lib_num.samtools_var.bcf
bcftools view Sample_name_lib_num.samtools_var.bcf | vcfutils.pl varFilter -D 100 > Sample_name_lib_num.samtools_var.vcf
###################
# (18) Merge two separate callsets
###################
java -jar GenomeAnalysisTK.jar \
-T CombineVariants \
-R reference.fa \
--variant input1.vcf \
--variant input2.vcf \
-o output.vcf \
-genotypeMergeOptions UNIQUIFY
Did you look at
snakemake
? Its easy to understand and build pipelines. Here is the tutorial.Build bioinformatics pipelines with Snakemake
1) If they are biological replicates you can merge the data (either the fastq or the bam). As an alternative you can call SNPs independently in the replicates and then build a kind of "consensus of the SNPs" (i.e. call a SNP only if it's present in at least 2 replicates out of 3, or something similar)
2) No, you can skip this step
3) I don't know (i.e. we would need to talk for several days to decide the most important parameters)