I have a phased genome, and I am trying to generate a counts matrix that contains the number of reads that map to both alleles. For example, something like this.
Wild1 Wild2 Wild3 MT1 MT2 MT3
C1_000001 0 0 0 0 0 0
C1_000002 10 9 8 1 1 1
C1_000003 0 0 0 10 12 10
C1_000004 6 5 7 0 0 0
Right now I have used bowtie, tophat, and have an accepted_hits file, aln.bam.
I have used samtools to sort the aln.bam but am having trouble with the next step. Based on what I have read so far, I think my next step is to generate a consensus sequence?
samtools mpileup -uf ref.fa aln.bam | bcftools call -c | vcfutils.pl vcf2fq > cns.fq
I want to make sure I am understanding this step, and subsequent steps.
I don't understand this portion of the above line of code, and am having trouble finding documents on it :
vcfutils.pl vcf2fq > cns.fq
- I have read documentation saying that we can use some of these functions to generate a consensus across samples, or alleles. I want to make sure I am doing it by alleles.
- It looks like this will generate a consensus fastq file. Not sure what the next steps would be.
- Any other suggestions besides "Google Allele Specific Pipelines" would be helpful. Perhaps a link of one in particular that you would recommend.
Update: I also don't need the whole matrix. Figuring out a way just to do it for one gene would be sufficient.
Thanks in advance!
Point 4 is quite demanding. People help whichever way they can.
I'm not trying to be rude, I just need something a little more than that at this point.