Hi everyone,
I would like to extract sequence of reads from a bam file in such a way that indels are also incoproored based on information obtained from flags/penalty information. Basically I want something like pairwise sequence alignment of all my reads ..1,2,3...N against my refence sequence of interest . (Just imagine each row is a read from Sam/bam file obtained after alignment of data) and showing wherever a deletion was present as '-'. Insertion as read adjusted according to reference.
Do you know any Unix or R based tools?
Thanks in advance
samtools tview
?This is what you probably need: https://github.com/mlafave/sam2pairwise
Thank you for the response. I tried sam2pairwise. But I am not getting the required output.
May be I should explain it better.
Please see this image.
I want to focus only on the output. So if an insertion or deletion occur the reference + dashes (adjusted to insertion in reads) will acts as a superset and reads will be the variables of different combinations (the biological sequence). So the dimension of the matrix will depends on reference +dashes (due to insertion in reads). Lets says 100 reads (rows) X 100 bases (columns) (adjusted by the dashes). I want to take out only the reads not the reference. To add on here I want to give an example of samtools fasta function, which converts a BAM to a FASTA. The issue is, it is not giving me the indels information (DASHES) and this basically shift the alignment of base position and also changes the dimension of matrix.
I hope my query is clear to you. If I can get only the sequence of reads as in image attached or even the fasta (same dimension), that will do the job.
Please suggest if you know some way to do that.
such a tool does not exist - or shall I say is not a mainstream and common need.
you could write one yourself, and the way it would work is that you read and parse the CIGAR string with a library such as simplesam (if you use Python) and then apply the transformations to both the reads and reference.
I will say that the problem could end up being more difficult in the sense that I am not sure you would want to add all insertions into the reference, only the ones that are valid insertions supported by the majority of the reads. Otherwise, you risk creating a wonky reference with lots of spurious inserts.
Basically you would want some sort of consensus to be built as well.
Is that requirement not being satisfied by
sam2pairwise
?sam2pairwise will do the alignments pairwise, and represents that single alignment
it will not insert bases into the reference based on another overlapping read, that is what the OP seems to want