Hi all. This is a very over simplified example but hopefully someone will be able to help.
I have a file amp.fa (e.g. >amp\nGAACATAAGCC
) which I've run bwa index and samtools faidx on.
I've then aligned all my fastq read files (${f}) to this index (${a}):
bwa mem ${a} ${f}_L001_R1_001.fastq ${f}_L001_R2_001.fastq > ${f}.sam
samtools view -bS ${f}.sam > ${f}.bam
samtools sort ${f}.bam > ${f}.sorted.bam
samtools index ${f}.sorted.bam
samtools mpileup -f ${a} ${f}.sorted.bam > ${f}.mpileup
and basically all I want to do now is, with respect to my reference (GAACATAAGCC) calculate the percentages of different reads mapping the the middle section (CATAA) when that central T is either a T or a C. There are other changes and I want to see if they co occur with the T->C change and what % do.
From visual inspection in IGV I can see when reads are still wildtype (T) then there's a good chance there's a insertion just before the CATAA section. When the reads are mutant (C) the C and A of CATAA are often, but not always, also mutated.
Ideally I'd like to read a string aligned with regards to the read's cigar into R and work with it in there. If I could convert the sequence to a matrix of ncol=maximum read length and nrow=number of reads, where position 6 always means that base aligned to GAACATAAGCC then I could definitely figure out the rest myself.
I've been looking at the GenomicAlignments
package in R and hoped that stackStringsFromBam
or sequenceLayer
functions would help to do this, but I think I must've misunderstood the documentation. Does anyone know how to do this or have suggestions of how to do it in a different way? Perhaps MSA using BAM?
TLDR: I'm interested as to whether a particular SNP cooccurs or occurs independently of various other genomic changes (two further SNPs, two indels) for each read in my fastq files
Thank you Matt. I'm not massively comfortable with python but I'll take a look at simplesam
Regarding pileup, that would be absolutely perfect if the read results column were ordered by unique read. In that case I could split the string into a matrix with rows=unique positions and columns=unique reads. However since the read results entries all have different lengths and there's no character to mean "not covered" I don't think this can be the case. But please let me know if I'm wrong :)