Question

Best Pipeline For Doing Variant Analysis On Rna-Seq Data?

4

Entering edit mode

12.3 years ago

adrianj.randall ▴ 40

Hello everyone,

I inherited some colorspace data of various samples that I am trying to make use of. I don't have access to lifescope/bioscope, and finding open-source tools that handle colorspace data well seems like a challenge. What I have are basically *.bam files of mapped/unmapped reads as well as *.csfasta and *.quals files. What I am trying to do is perform variant analysis to see if I can find SNPs/indels in my data, both across samples as well as in comparison to the reference genome.

I am thinking about using the GATK pipeline, but I wanted to ask if there is anything 'better' for doing what I'd like to do. The bam files were generated using the hg18 reference genome, and common aligners like bwa don't seem to support colorspace anymore. From what I understand, due to differences in how errors are handled, converting the csfasta/quals files to fastq files isn't recommended, although it seems there are some people who analyze colorspace data this way.

Can anyone here recommend a pipeline for me to basically take my RNA-seq data and either 1) re-align using a newer reference genome or 2) use the existing *.bam files to perform variant analysis to find sequence differences?

Thanks, and all the best.

rna-seq variant analysis pipeline • 7.4k views

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 12.3 years ago by adrianj.randall ▴ 40

1

Entering edit mode

Keep in mind that most aligners being discussed here are not splice-aware. Using a non-splice-aware aligner is likely to make your job of finding variants in RNA-seq even harder (and it is a hard problem to begin with).

ADD REPLY • link 12.3 years ago by Sean Davis 27k

score 3 · Answer 1 · 2013-01-23

From what I understand, due to differences in how errors are handled, converting the csfasta/quals files to fastq files isn't recommended, although it seems there are some people who analyze colorspace data this way.

Yeah, don't do it this way.

Can anyone here recommend a pipeline for me to basically take my RNA-seq data and either 1) re-align using a newer reference genome

If I'm not mistaken, the 0.6+ version of bwa were when colorspace alignment was dropped, so you should be able to use the latest 0.5.x version (0.5.10) to align your *.csfasta and *.qual files to hg19.

I'm pretty sure that SHRiMP also aligns colorspace data, so you can try that -- it's had a later release that the bwa-0.5.x branch.

score 3 · Answer 2 · 2013-01-23

Hey

2nd question) If you plan to work with BAM files, then you should better use GATK. Make sure you do the realignment around the indels. Call for SNPs and Indels using Unified Genotyper. I think most of the aligners that can align SOLiD reads output a BAM file that contains both nucleotide (color space gets translated to nucleotide during alignment) and colorspace coded read. Color space coded read still exist in the BAM file so that some downstream SOLiD specific tool can utilize it when calling for SNPs and Indels. I am not aware of any current and open source variant caller that uses color information when calling for SNPs and Indels. I assume GATK, Samtools and other tools use nucleotide reads (colour translated to nucleotide) in the bam file to call for SNPs and Indels and it works fairly well.

1st question ) You can use SHRiMP2 to align your solid data. I think it is the best aligner available for solid reads right now. You will have to convert colorspace reads to csfastq reads using a script that comes with BFAST software or let me know i can get it for you. This csfastq is a different one as it contains color coded read and not the color code translated to nucleotide. SHRiMP2 asks the input in this format. Then you can use the GATK to call for variants. Mapping quality phred score produced by SHRiMP is different from that produced by BWA. so you may have to tweak your bam file when using it with GATK. I think GATK has some option in it to deal with such issues.

let me know if you need some other help. Using RNAseq data for variant calling may give some weird results.

Thanks

Ram · Answer 3 · 2014-07-15

3

Entering edit mode

10.8 years ago

Ting-You Wang ▴ 80

Currently, I think you can use STAR and GATK pipeline for RNA-seq variants calling.

http://gatkforums.broadinstitute.org/discussion/3891/calling-variants-in-rnaseq

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by Ting-You Wang ▴ 80

Ram · Answer 4 · 2013-01-24

1

Entering edit mode

12.3 years ago

decodenomics ▴ 10

For RNA-Seq data, you need worry about SNP around the exon-intron junction where the SNP may be actually mismatch.

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 12.3 years ago by decodenomics ▴ 10

Ram · Answer 5 · 2013-01-23

0

Entering edit mode

12.3 years ago

Sean Davis 27k

I think that novoaligncs can handle colorspace reads and can perform splice-aware alignments. There is a license cost, but the last I checked it was VERY reasonable.

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 12.3 years ago by Sean Davis 27k