Question

How to improve the pipeline of call snp in RNA-seq?

2

Entering edit mode

8.5 years ago

liuhui ▴ 20

Hi,

Recently, I have tried to construct a pipeline (as the following lines show) to call snp after reading the workflow of GATK and samtools for a species do not have a reference dbsnp,.

And there are several problem confusing me:

(1) If I have more than one sample with repetition (i.e. sample_1_1, sample_1_2 for sample_1 and sample_2_1, sample_2_2 for sample_2), what should I do?

(2) Because of the absence of reference dbsnp, It's necessary to take the step "quality score recalibration" ?

(3) Which parametrics in the following step are very important?

########################################
############### Mapping ################
########################################

################
#(1) build index
################
bwa index -a bwtsw -p <reference> <reference.fa>

################
#(2) Sort reference
################
samtools faidx reference.fasta

################
#(3) Create sequence dictionary
################
java -jar CreateSequenceDictionary.jar REFERENCE=reference.fa OUTPUT=reference.dict

################
#(4) Align samples to reference genome with bwa mem
################
bwa mem -M -t 12 -R '@RG\tID:uniq_ID\tSM:Sample_name\tLB:Sample_name_lib_num\tPL:ILLUMINA' \
reference left.fq right.fq | \
samtools view -bS - > Sample_name_lib_num.bam

################
#(5) Fix Mate Information
################
samtools fixmate -O bam Sample_name_lib_num.bam Sample_name_lib_num.fixmate.bam

################
#(6) Sort BAM file
################
samtools sort -T /tmp/Sample_name_lib_num.sorted -o Sample_name_lib_num.sorted.bam Sample_name_lib_num.fixmate.bam

########################################
############# Improvement ##############
########################################

###############
#(7)Splits reads with Ns in CIGAR
###############
java –jar GenomeAnalysisTK.jar \
–T SplitNCigarReads \
–R reference.fa \
–I Sample_name_lib_num.sorted.bam \
–o Sample_name_lib_num.sorted.SplitNCigar.bam \
–U ALLOW_N_CIGAR_READS \
–rf ReassignOneMappingQuality \
–RMQF 255 \
–RMQT 60

################
#(8)Identify target regions for realignment
################
java -Xmx2g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator \
-R reference.fa \
-I Sample_name_lib_num.sorted.SplitNCigar.bam  \
-ip 50 \
-o Sample_name_lib_num.sorted.SplitNCigar.intervals

################
#(9) Perform realignment
################
java -Xmx4g -jar GenomeAnalysisTK.jar -T IndelRealigner \
-R reference.fa \
-I Sample_name_lib_num.sorted.SplitNCigar.bam  \
-targetIntervals Sample_name_lib_num.sorted.SplitNCigar.intervals \
-output Sample_name_lib_num.realigned.bam

################
#(10) fixes mate information that may be changed during the realignment process
################
java -Djava.io.tmpdir=/tmp/ \
-jar picard.jar FixMateInformation \
INPUT=Sample_name_lib_num.realigned.bam \
OUTPUT=Sample_name_lib_num.realigned.fixed.bam \
SO=coordinate \
VALIDATION_STRINGENCY=LENIENT \
CREATE_INDEX=true

################
#(11) Build the recalibration model
################
java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator \
-R reference.fa \
-I Sample_name_lib_num.realigned.fixed.bam \
-ip 50 \
-o Sample_name_lib_num_recalibration.table 

################
#(12) Apply the recalibration model
################
java -Xmx4g -jar GenomeAnalysisTK.jar -T PrintReads \
-R reference.fa \
-I Sample_name_lib_num.realigned.bam \
-BQSR Sample_name_lib_num_recalibration.table \
-o Sample_name_lib_num.realigned.recal.bam

################
#(13) Mark Duplicates
################
java -jar $PICARD MarkDuplicates \
I=Sample_name_lib_num.realigned.recal.bam \
O=Sample_name_lib_num.improvement.bam \
M=Sample_name_lib_num_dedup_metrics.txt \
CREATE_INDEX=true

################
#(14) index BAM using samtools
################
samtools index Sample_name_lib_num.improvement.bam

########################################
########## Variant Calling #############
########################################

###################
## (15) Call variants using HaplotypeCaller in GVCF mode. 
###################

java -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R reference.fa \
-I Sample_name_lib_num.improvement.bam \
-o Sample_name_lib_num.HaplotypeCaller.vcf \
-ERC GVCF
###################
#(16) Run GenotypeGVCFs in GVCF mode. 
###################
java -jar GenomeAnalysisTK.jar \
-T GenotypeGVCFs \
-R reference.fa \
-V Sample_name_lib_num.HaplotypeCaller.vcf \
-o Sample_name_lib_num.HaplotypeCaller.gg.vcf

###################
# (17) Call variants with Samtools
###################
samtools mpileup -C 50 -uf reference.fa Sample_name_lib_num.improvement.bam | \
bcftools view -bvcg - > Sample_name_lib_num.samtools_var.bcf
bcftools view Sample_name_lib_num.samtools_var.bcf | vcfutils.pl varFilter -D 100 > Sample_name_lib_num.samtools_var.vcf


###################
# (18) Merge two separate callsets
###################
java -jar GenomeAnalysisTK.jar \
-T CombineVariants \
-R reference.fa \
--variant input1.vcf \
--variant input2.vcf \
-o output.vcf \
-genotypeMergeOptions UNIQUIFY

SNP samtools GATK RNA-Seq • 4.3k views

ADD COMMENT • link updated 8.4 years ago by A. Domingues ★ 2.7k • written 8.5 years ago by liuhui ▴ 20

1

Entering edit mode

Did you look at snakemake ? Its easy to understand and build pipelines. Here is the tutorial.

Build bioinformatics pipelines with Snakemake

ADD REPLY • link 8.5 years ago by venu 7.1k

0

Entering edit mode

1) If they are biological replicates you can merge the data (either the fastq or the bam). As an alternative you can call SNPs independently in the replicates and then build a kind of "consensus of the SNPs" (i.e. call a SNP only if it's present in at least 2 replicates out of 3, or something similar)

2) No, you can skip this step

3) I don't know (i.e. we would need to talk for several days to decide the most important parameters)

ADD REPLY • link 8.5 years ago by Fabio Marroni ★ 3.0k

score 2 · Answer 1 · 2016-06-30

This comes a bit late but hopefully still in time to help you.

First of all the recommended mapping strategy by GATK for RNAseq variant calling is a 2-step mapping with STAR. bwa is recommended for DNAseq.

Now for your questions:

Biological replicates can be merged. I prefer to merge the fastqc's.
I believe that is recommend so make sure there really ins't a reference. For zebrafish I found one rsync -av rsync://ftp.ncbi.nih.gov/snp/organisms/zebrafish_7955/VCF/, so maybe there one somewhere for your species.
Unless you have a good reason, and know what you are doing, use the defaults recommended by GATK. Read carefully the best practices for RNAseq because there might be differences from the usual DNAseq pipeline.