Hi,
I am fairly new to bioinformatics and currently have issues with a task that seems like it should be fairly simple. The situation is the following:
I have a library of DNA molecules with a variable region surrounded by fixed sequences. What I want is to a) get statistics regarding the distribution of nucleotides in the variable sequence (e.g. calculate a sequence logo) b) possibly find families of sequences in the variable region via clustering.
The library was sequenced with Nanopore sequencing. Aligning the reads was straightforward and gave me a sam file. What I am stuck with is how to extract sequence information from the aligned reads for the specific part of the reference I am interested in.
What I would like to have is the sequences of all individual reads in the region of interest. To illustrate this:
If the reference and the reads are (region of interest is bold)
ref:
CAGGCTTATG**NNNNNNNNNN**TAGCAAGCAG
reads:
CAGGCTGATG**ACTCAAAC[del]A**TAGCAAG[del]AG
CAGUCTTATG**AATATATTTT**TAGCAAGCAG
CCGGCTTATG**GCCCGCG[ins]TTA**TAGCAA[ins]GCAG
then I would like to extract
ACTCAAAC[del]A
AATATATTTT
GCCCGCG[ins]TTA
I can use "samtools view" to get the reads that cover (part of) the region of interest. This is not helpful, as I only care about the subsequence of the reads that is aligned the region of interest. "samtools mpileup" comes fairly close, but I can only get statistics about individual nucleotides (or do I misunderstand how this works?), not entire read subsequences.
Is there a tool that does what I want?
Thanks
It would help if you could post few examples from your SAM. For the later part, ( I removed
[
and]
).:input:
output with sed:
with bash:
The reason I posted a toy example is that the reads are fairly long. If it helps, here's the first four lines from the sam file:
The reference would be
The problem with using either regular expressions or trimming is that nanopore data contains lots of inserts and deletions, so the length and composition of the 'fixed' region next to the variable region is extremely variable in the reads. I don't have any guarantee that the sequence next to the variable region is what I want it to be.
I guess I could write a script that uses the leftmost index in conjunction with the CIGAR string to calculate where my region of interest begins and ends for any given read and then extracts it - assuming there's no command line tool to do this kind of stuff.
Thanks for posting the data. '
From your description above, fixed region is as good as variable region. If there is a regular pattern that you can throw light on, it can be done. Given that sam file is text file, there are several command line tools that can handle regex in *nix.
May be you can show us how variable fixed region is compared to variable region. From there, some one here can provide solution. Btw, to extract only sequence one can use
awk '{print $6}' test.txt
from sam file (for SE). gordon.freeman