A customer with twelve RNASEQ mouse samples sequenced on an Illumina HiSeq 2000 wanted to compare the expression numbers for a single gene. This is outside my realm of expertise at the moment, but after some research I came up with a strategy. I realize the following commands aren't the most efficient way to do things, but I erred on the side of caution:
I downloaded a reference transcriptome from UCSC for mm9 and indexed it with bwa and samtools.
I mapped each sample to the reference transcriptome using BWA:
bwa aln -t 10 [path to transcriptome fasta] [fastq.gz filepath]
I converted the sai to BAM:
bwa samse [path to transcriptome fasta] [path to sai file] [path to fastq] | samtools view -Shub -o [bam file name] -
I sorted the BAM with samtools:
samtools sort [path to bam file] [sorted bam prefix]
I generated a VCF:
samtools mpileup -f [path to transcriptome fasta] [path to sorted bam] -u | bcftools view - > [vcf output path]
When I open the VCF, the header looks like this:
##fileformat=VCFv4.1
##samtoolsVersion=0.1.18 (r982:295)
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT [path to sorted bam]
And here are a few sample lines:
gi|126352347|ref|NM_028260.2| 108 . G X 0 . DP=27;I16=13,14,0,0,1040,40426,0,0,520,10400,0,0,314,4688,0,0 PL 0,81,199
gi|126352347|ref|NM_028260.2| 109 . G X 0 . DP=28;I16=13,15,0,0,1086,42418,0,0,540,10800,0,0,323,4795,0,0 PL 0,84,201
gi|126352347|ref|NM_028260.2| 110 . A X 0 . DP=29;I16=14,15,0,0,1110,42864,0,0,560,11200,0,0,333,4957,0,0 PL 0,87,201
I'm having trouble correlating this with the official VCF 4.1 specification. Does anyone have any perspective on what I'm looking at here, or did I do something wrong? Specifically, column 1, column 5, the I16 comma-separated list in column 8, and the values in the last two columns are what I'm having trouble understanding, based on the samtools and VCF documentation.
JC is absolutely correct. Also, you may want to align to genomic sequence if you are variant calling.
Thanks for the comments. One of the primary sources I drew from was this:
http://bsc2010.bioinformatics.ucdavis.edu/handson/rna_seq_statistics_workshop.html
But what you are saying definitely makes sense. I will try out some other approaches. Thanks!
Please forget my first point, I missed that your are mapping to the reference transcriptome, that's correct. After that you need to count how many reads mapped per transcripts and save that as matrix that can be used in R:Bioconductor edgeR or DESeq to normalize and identify DEGs. One tricky thing is the reads that map to more than one transcript, BWA by default select one random hit but TopHat will report more than one hit if they are in different locations in the genome, after that Cufflinks will resolve the proportions.
Thanks so much for taking the time to follow up!