How to improve the pipeline of call snp in RNA-seq?
1
2
Entering edit mode
8.5 years ago
liuhui ▴ 20

Hi,

Recently, I have tried to construct a pipeline (as the following lines show) to call snp after reading the workflow of GATK and samtools for a species do not have a reference dbsnp,.

And there are several problem confusing me:

(1) If I have more than one sample with repetition (i.e. sample_1_1, sample_1_2 for sample_1 and sample_2_1, sample_2_2 for sample_2), what should I do?

(2) Because of the absence of reference dbsnp, It's necessary to take the step "quality score recalibration" ?

(3) Which parametrics in the following step are very important?

########################################
############### Mapping ################
########################################

################
#(1) build index
################
bwa index -a bwtsw -p <reference> <reference.fa>

################
#(2) Sort reference
################
samtools faidx reference.fasta

################
#(3) Create sequence dictionary
################
java -jar CreateSequenceDictionary.jar REFERENCE=reference.fa OUTPUT=reference.dict

################
#(4) Align samples to reference genome with bwa mem
################
bwa mem -M -t 12 -R '@RG\tID:uniq_ID\tSM:Sample_name\tLB:Sample_name_lib_num\tPL:ILLUMINA' \
reference left.fq right.fq | \
samtools view -bS - > Sample_name_lib_num.bam

################
#(5) Fix Mate Information
################
samtools fixmate -O bam Sample_name_lib_num.bam Sample_name_lib_num.fixmate.bam

################
#(6) Sort BAM file
################
samtools sort -T /tmp/Sample_name_lib_num.sorted -o Sample_name_lib_num.sorted.bam Sample_name_lib_num.fixmate.bam

########################################
############# Improvement ##############
########################################

###############
#(7)Splits reads with Ns in CIGAR
###############
java –jar GenomeAnalysisTK.jar \
–T SplitNCigarReads \
–R reference.fa \
–I Sample_name_lib_num.sorted.bam \
–o Sample_name_lib_num.sorted.SplitNCigar.bam \
–U ALLOW_N_CIGAR_READS \
–rf ReassignOneMappingQuality \
–RMQF 255 \
–RMQT 60

################
#(8)Identify target regions for realignment
################
java -Xmx2g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator \
-R reference.fa \
-I Sample_name_lib_num.sorted.SplitNCigar.bam  \
-ip 50 \
-o Sample_name_lib_num.sorted.SplitNCigar.intervals

################
#(9) Perform realignment
################
java -Xmx4g -jar GenomeAnalysisTK.jar -T IndelRealigner \
-R reference.fa \
-I Sample_name_lib_num.sorted.SplitNCigar.bam  \
-targetIntervals Sample_name_lib_num.sorted.SplitNCigar.intervals \
-output Sample_name_lib_num.realigned.bam

################
#(10) fixes mate information that may be changed during the realignment process
################
java -Djava.io.tmpdir=/tmp/ \
-jar picard.jar FixMateInformation \
INPUT=Sample_name_lib_num.realigned.bam \
OUTPUT=Sample_name_lib_num.realigned.fixed.bam \
SO=coordinate \
VALIDATION_STRINGENCY=LENIENT \
CREATE_INDEX=true

################
#(11) Build the recalibration model
################
java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator \
-R reference.fa \
-I Sample_name_lib_num.realigned.fixed.bam \
-ip 50 \
-o Sample_name_lib_num_recalibration.table 

################
#(12) Apply the recalibration model
################
java -Xmx4g -jar GenomeAnalysisTK.jar -T PrintReads \
-R reference.fa \
-I Sample_name_lib_num.realigned.bam \
-BQSR Sample_name_lib_num_recalibration.table \
-o Sample_name_lib_num.realigned.recal.bam

################
#(13) Mark Duplicates
################
java -jar $PICARD MarkDuplicates \
I=Sample_name_lib_num.realigned.recal.bam \
O=Sample_name_lib_num.improvement.bam \
M=Sample_name_lib_num_dedup_metrics.txt \
CREATE_INDEX=true

################
#(14) index BAM using samtools
################
samtools index Sample_name_lib_num.improvement.bam

########################################
########## Variant Calling #############
########################################

###################
## (15) Call variants using HaplotypeCaller in GVCF mode. 
###################

java -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R reference.fa \
-I Sample_name_lib_num.improvement.bam \
-o Sample_name_lib_num.HaplotypeCaller.vcf \
-ERC GVCF
###################
#(16) Run GenotypeGVCFs in GVCF mode. 
###################
java -jar GenomeAnalysisTK.jar \
-T GenotypeGVCFs \
-R reference.fa \
-V Sample_name_lib_num.HaplotypeCaller.vcf \
-o Sample_name_lib_num.HaplotypeCaller.gg.vcf

###################
# (17) Call variants with Samtools
###################
samtools mpileup -C 50 -uf reference.fa Sample_name_lib_num.improvement.bam | \
bcftools view -bvcg - > Sample_name_lib_num.samtools_var.bcf
bcftools view Sample_name_lib_num.samtools_var.bcf | vcfutils.pl varFilter -D 100 > Sample_name_lib_num.samtools_var.vcf


###################
# (18) Merge two separate callsets
###################
java -jar GenomeAnalysisTK.jar \
-T CombineVariants \
-R reference.fa \
--variant input1.vcf \
--variant input2.vcf \
-o output.vcf \
-genotypeMergeOptions UNIQUIFY
SNP samtools GATK RNA-Seq • 4.3k views
ADD COMMENT
1
Entering edit mode

Did you look at snakemake ? Its easy to understand and build pipelines. Here is the tutorial.

Build bioinformatics pipelines with Snakemake

ADD REPLY
0
Entering edit mode

1) If they are biological replicates you can merge the data (either the fastq or the bam). As an alternative you can call SNPs independently in the replicates and then build a kind of "consensus of the SNPs" (i.e. call a SNP only if it's present in at least 2 replicates out of 3, or something similar)

2) No, you can skip this step

3) I don't know (i.e. we would need to talk for several days to decide the most important parameters)

ADD REPLY
2
Entering edit mode
8.4 years ago
A. Domingues ★ 2.7k

This comes a bit late but hopefully still in time to help you.

First of all the recommended mapping strategy by GATK for RNAseq variant calling is a 2-step mapping with STAR. bwa is recommended for DNAseq.

Now for your questions:

  1. Biological replicates can be merged. I prefer to merge the fastqc's.

  2. I believe that is recommend so make sure there really ins't a reference. For zebrafish I found one rsync -av rsync://ftp.ncbi.nih.gov/snp/organisms/zebrafish_7955/VCF/, so maybe there one somewhere for your species.

  3. Unless you have a good reason, and know what you are doing, use the defaults recommended by GATK. Read carefully the best practices for RNAseq because there might be differences from the usual DNAseq pipeline.

ADD COMMENT

Login before adding your answer.

Traffic: 1685 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6