How To Interpret Funky Vcf Output?
2
0
Entering edit mode
12.1 years ago
Dan D 7.4k

A customer with twelve RNASEQ mouse samples sequenced on an Illumina HiSeq 2000 wanted to compare the expression numbers for a single gene. This is outside my realm of expertise at the moment, but after some research I came up with a strategy. I realize the following commands aren't the most efficient way to do things, but I erred on the side of caution:

I downloaded a reference transcriptome from UCSC for mm9 and indexed it with bwa and samtools.

I mapped each sample to the reference transcriptome using BWA:

bwa aln -t 10 [path to transcriptome fasta] [fastq.gz filepath]

I converted the sai to BAM:

bwa samse [path to transcriptome fasta] [path to sai file] [path to fastq] | samtools view -Shub -o [bam file name] -

I sorted the BAM with samtools:

samtools sort [path to bam file] [sorted bam prefix]

I generated a VCF:

samtools mpileup -f [path to transcriptome fasta] [path to sorted bam] -u | bcftools view - > [vcf output path]

When I open the VCF, the header looks like this:

    ##fileformat=VCFv4.1
    ##samtoolsVersion=0.1.18 (r982:295)
CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  [path to sorted bam]

And here are a few sample lines:

gi|126352347|ref|NM_028260.2|   108     .       G       X       0       .       DP=27;I16=13,14,0,0,1040,40426,0,0,520,10400,0,0,314,4688,0,0   PL      0,81,199

gi|126352347|ref|NM_028260.2|   109     .       G       X       0       .       DP=28;I16=13,15,0,0,1086,42418,0,0,540,10800,0,0,323,4795,0,0   PL      0,84,201

gi|126352347|ref|NM_028260.2|   110     .       A       X       0       .       DP=29;I16=14,15,0,0,1110,42864,0,0,560,11200,0,0,333,4957,0,0   PL      0,87,201

I'm having trouble correlating this with the official VCF 4.1 specification. Does anyone have any perspective on what I'm looking at here, or did I do something wrong? Specifically, column 1, column 5, the I16 comma-separated list in column 8, and the values in the last two columns are what I'm having trouble understanding, based on the samtools and VCF documentation.

vcf samtools mpileup bcftools rna-seq • 4.5k views
ADD COMMENT
3
Entering edit mode
12.1 years ago
JC 13k

First, because you are mapping your reads to the reference genome with BWA, you are missing all spliced reads, with long reads (>50b) this is really important. Second, after mapping your reads you don't need to generate a VCF file to infer the expression level, the VCF is for variants not expression.

Please, for RNAseq data consider to change your strategy, there are a lot of literature about how to analyze that, here in BioStar exists a section with Reviews, but in short I can suggest a TopHat + Cufflinks analysis that you can run on your machines or even in Galaxy.

ADD COMMENT
0
Entering edit mode

JC is absolutely correct. Also, you may want to align to genomic sequence if you are variant calling.

ADD REPLY
0
Entering edit mode

Thanks for the comments. One of the primary sources I drew from was this:

http://bsc2010.bioinformatics.ucdavis.edu/handson/rna_seq_statistics_workshop.html

But what you are saying definitely makes sense. I will try out some other approaches. Thanks!

ADD REPLY
1
Entering edit mode

Please forget my first point, I missed that your are mapping to the reference transcriptome, that's correct. After that you need to count how many reads mapped per transcripts and save that as matrix that can be used in R:Bioconductor edgeR or DESeq to normalize and identify DEGs. One tricky thing is the reads that map to more than one transcript, BWA by default select one random hit but TopHat will report more than one hit if they are in different locations in the genome, after that Cufflinks will resolve the proportions.

ADD REPLY
0
Entering edit mode

Thanks so much for taking the time to follow up!

ADD REPLY
1
Entering edit mode
12.1 years ago

you should also have the column headers:

CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  bamname

I haven't seen those 'Xs' in ALT.

You might want to try running varfilter before investigating your VCF files.

samtools-0.1.18/bcftools/vcfutils.pl varFilter
ADD COMMENT
0
Entering edit mode

Thanks! I didn't include the headers at first, but I just added them in after reading your post.

I'll check out varFilter. Thanks for the tip!

ADD REPLY

Login before adding your answer.

Traffic: 2014 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6