Question

How To Interpret Funky Vcf Output?

0

Entering edit mode

12.8 years ago

Dan D 7.4k

A customer with twelve RNASEQ mouse samples sequenced on an Illumina HiSeq 2000 wanted to compare the expression numbers for a single gene. This is outside my realm of expertise at the moment, but after some research I came up with a strategy. I realize the following commands aren't the most efficient way to do things, but I erred on the side of caution:

I downloaded a reference transcriptome from UCSC for mm9 and indexed it with bwa and samtools.

I mapped each sample to the reference transcriptome using BWA:

bwa aln -t 10 [path to transcriptome fasta] [fastq.gz filepath]

I converted the sai to BAM:

bwa samse [path to transcriptome fasta] [path to sai file] [path to fastq] | samtools view -Shub -o [bam file name] -

I sorted the BAM with samtools:

samtools sort [path to bam file] [sorted bam prefix]

I generated a VCF:

samtools mpileup -f [path to transcriptome fasta] [path to sorted bam] -u | bcftools view - > [vcf output path]

When I open the VCF, the header looks like this:

    ##fileformat=VCFv4.1
    ##samtoolsVersion=0.1.18 (r982:295)
CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  [path to sorted bam]

And here are a few sample lines:

gi|126352347|ref|NM_028260.2|   108     .       G       X       0       .       DP=27;I16=13,14,0,0,1040,40426,0,0,520,10400,0,0,314,4688,0,0   PL      0,81,199

gi|126352347|ref|NM_028260.2|   109     .       G       X       0       .       DP=28;I16=13,15,0,0,1086,42418,0,0,540,10800,0,0,323,4795,0,0   PL      0,84,201

gi|126352347|ref|NM_028260.2|   110     .       A       X       0       .       DP=29;I16=14,15,0,0,1110,42864,0,0,560,11200,0,0,333,4957,0,0   PL      0,87,201

I'm having trouble correlating this with the official VCF 4.1 specification. Does anyone have any perspective on what I'm looking at here, or did I do something wrong? Specifically, column 1, column 5, the I16 comma-separated list in column 8, and the values in the last two columns are what I'm having trouble understanding, based on the samtools and VCF documentation.

vcf samtools mpileup bcftools rna-seq • 4.9k views

ADD COMMENT • link updated 12.1 years ago by Biostar 20 • written 12.8 years ago by Dan D 7.4k

score 3 · Answer 1 · 2012-10-16

3

Entering edit mode

12.8 years ago

JC 13k

First, because you are mapping your reads to the reference genome with BWA, you are missing all spliced reads, with long reads (>50b) this is really important. Second, after mapping your reads you don't need to generate a VCF file to infer the expression level, the VCF is for variants not expression.

Please, for RNAseq data consider to change your strategy, there are a lot of literature about how to analyze that, here in BioStar exists a section with Reviews, but in short I can suggest a TopHat + Cufflinks analysis that you can run on your machines or even in Galaxy.

ADD COMMENT • link 12.8 years ago by JC 13k

0

Entering edit mode

JC is absolutely correct. Also, you may want to align to genomic sequence if you are variant calling.

ADD REPLY • link 12.8 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Thanks for the comments. One of the primary sources I drew from was this:

http://bsc2010.bioinformatics.ucdavis.edu/handson/rna_seq_statistics_workshop.html

But what you are saying definitely makes sense. I will try out some other approaches. Thanks!

ADD REPLY • link 12.8 years ago by Dan D 7.4k

1

Entering edit mode

Please forget my first point, I missed that your are mapping to the reference transcriptome, that's correct. After that you need to count how many reads mapped per transcripts and save that as matrix that can be used in R:Bioconductor edgeR or DESeq to normalize and identify DEGs. One tricky thing is the reads that map to more than one transcript, BWA by default select one random hit but TopHat will report more than one hit if they are in different locations in the genome, after that Cufflinks will resolve the proportions.

ADD REPLY • link 12.8 years ago by JC 13k

0

Entering edit mode

Thanks so much for taking the time to follow up!

ADD REPLY • link 12.8 years ago by Dan D 7.4k

score 1 · Answer 2 · 2012-10-16

1

Entering edit mode

12.8 years ago

Zev.Kronenberg 12k

you should also have the column headers:

CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  bamname

I haven't seen those 'Xs' in ALT.

You might want to try running varfilter before investigating your VCF files.

samtools-0.1.18/bcftools/vcfutils.pl varFilter

ADD COMMENT • link 12.8 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Thanks! I didn't include the headers at first, but I just added them in after reading your post.

I'll check out varFilter. Thanks for the tip!

ADD REPLY • link 12.8 years ago by Dan D 7.4k