Extraction of read subsequences based on a region from a sam file
1
0
Entering edit mode
6.2 years ago

Hi,

I am fairly new to bioinformatics and currently have issues with a task that seems like it should be fairly simple. The situation is the following:

I have a library of DNA molecules with a variable region surrounded by fixed sequences. What I want is to a) get statistics regarding the distribution of nucleotides in the variable sequence (e.g. calculate a sequence logo) b) possibly find families of sequences in the variable region via clustering.

The library was sequenced with Nanopore sequencing. Aligning the reads was straightforward and gave me a sam file. What I am stuck with is how to extract sequence information from the aligned reads for the specific part of the reference I am interested in.

What I would like to have is the sequences of all individual reads in the region of interest. To illustrate this:

If the reference and the reads are (region of interest is bold)

ref:

CAGGCTTATG**NNNNNNNNNN**TAGCAAGCAG

reads:

CAGGCTGATG**ACTCAAAC[del]A**TAGCAAG[del]AG

CAGUCTTATG**AATATATTTT**TAGCAAGCAG

CCGGCTTATG**GCCCGCG[ins]TTA**TAGCAA[ins]GCAG

then I would like to extract

ACTCAAAC[del]A

AATATATTTT

GCCCGCG[ins]TTA

I can use "samtools view" to get the reads that cover (part of) the region of interest. This is not helpful, as I only care about the subsequence of the reads that is aligned the region of interest. "samtools mpileup" comes fairly close, but I can only get statistics about individual nucleotides (or do I misunderstand how this works?), not entire read subsequences.

Is there a tool that does what I want?

Thanks

alignment sequence • 1.7k views
ADD COMMENT
0
Entering edit mode

It would help if you could post few examples from your SAM. For the later part, ( I removed [ and ]).:

input:

CAGGCTGATGACTCAAACdelATAGCAAGdelAG
CAGUCTTATGAATATATTTTTAGCAAGCAG
CCGGCTTATGGCCCGCGinsTTATAGCAAinsGCAG

output with sed:

$ sed -e 's/^[ACTGNU]\{10\}\(.*\)TAGCAA.*/\1/g' test.txt 
ACTCAAACdelA
AATATATTTT
GCCCGCGinsTTA

with bash:

$ grep -Po '(?<=^[ATGCNU]{10}).*(?=TAGCAA.*)' test.txt 
ACTCAAACdelA
AATATATTTT
GCCCGCGinsTTA
ADD REPLY
0
Entering edit mode

The reason I posted a toy example is that the reads are fairly long. If it helps, here's the first four lines from the sam file:

1   40  29S1M1D38M1I46M4D7M1D4M1I3M1I20M1I2M1D70M1D11M2D5M1I14M1D42M1I2M1I8M1D1M1D10M1D7M2D8M1I10M1I9M1D20M1D4M1I1M1I30M1D13M3S *   0   0   GGTGTACTTCGTTCAGTTACGTATTTGCTGTCTAATACGACTCACTATAGGAATCGTGCTTGTCGTATTGTATGTTGGGGATCCCGACTGGCGAGAGCCGGGTAACGAATGGATCCCCACGCCTTTCGGTAGAGCCAACTCAATCATGGCGGAGCTGTAAAACCACAGGCGGAGTGTCCACTGCTCTGCCTGGCCTCGTCCACTGAGGCGACGGTGAAGCCTCCCGCCGTCTGCGATTCCCGTCAGGCTCGCAGGTGAAGGAGGCACGCCCTTGTCATGTGTATGTTGGGGGCTCATTGGCAGCTCGGTGTCCGAGCCCCACATGCTGTTGACATCAAGTCAATCGCTGGCAAGTGGTAATCACTCAATCGCGGTGGACGATTGCCGTGACGGGCTCCGAGACGTGGAGCGCGGTCTGATAAACATGGG   *   MD:Z:1^T34A1G11N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0^NNNN5A0T0^A5T4G0T14A2^G7G62^A11^CG19^G40G2T5A0T1^A1^C10^C4A2^TT6G10A9^A10C0G8^A35^A13   NM:i:84 AS:i:1214   H0:i:1  ZE:f:1.29842e-82    ZF:f:0.919579   ZQ:i:429    ZR:i:418
1   40  31S29M2I2M2D4M1I5M2I46M2D18M1D5M2I23M1D8M1D43M1I26M1D17M1I23M1D6M1I33M1D7M2D8M1I2M1I17M1D24M1I1M1I30M1D2M1I2M1I7M1I2M   *   0   0   TTGTTGTACTTCGTTCCGGTTACGTATTGCTGTTCTAATACGACTCACTATAGGAATCATTAGCGTAAATGTGTGTATGTTGGGGGATCCCGACTGGCGAGAGCCAGGTAACGAATGGATCCCCCACATGCTTTGTTGAGTAACTTTCAATCATGGCAAGGCTGTAAAATACAGGCAGGTGTCCACTGCTCTGCCTGGCCTCGTCCACTGAGGCGACAGTGGAAACCTCCACGCCGTCTGCGCGATTCGTCAGGCTCGCAGGGTGGAAGGAGGCACGCCCTTGTCGTATTGTGTTTGGGGGCTCGTGTGGCTCATTGGCTCCGAGCCCCACATACTGTTGACGGCAAAGTCAATCATGGCAAGTGGTAATCCCTCCGTCGCGGTGAGACTGATGCCGTGACAGGCTCCGAGACGTGGAGCGCAGGTCCTGACGCGGCA  *   MD:Z:27G3^TT2C14N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0^NN7A10^G26G0C0^C6G1^A39G5G23^C36A1G1^G1A16A7A12^C7^TT7T19^A6A27G20^G7T0A0A0A2  NM:i:89 AS:i:1192   H0:i:1  ZE:f:4.80396e-81    ZF:f:0.897671   ZQ:i:438    ZR:i:418

1   40  30S84M6D3M1D6M2D21M1D3M1D7M1D3M1I2M1I34M1I7M1I16M1I11M1I22M1I7M1I2M1I11M1I4M1I4M1I15M71S    *   0   0   TTGTTGTACTTCGTTCAGTTACGTATTGCTGTTCTAATACGACTCACTATAGGAATCGTGCTTGTCATGTGTATGTTGGGGATCCCGACTGGCGAAACCAGGTAACGAATGGATCCCCATGCTGTTGAGGTAACTCAATCATGTAAGCTGTAAACTTTCCTGGTGGAGTGTCCACTGCTCTGCCTGGCCTCGTCCCACTGAGGGCGACGGTGAAGCCTCCCACGCCGTCTGCCGCGATTCCGTCAGGCTCGCAGGAGTGAAGGGAGAGCACGCCAGCAATCGTAGTATTATGTTGGGGGCTCGTCTGTGGTTGGGTTGGTTAGGCGGGTTGGGGATGGAATCGAACATGGAAACACCAAGTCTGTCCTCCCGGCA *   MD:Z:50N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0^NNNNNN3^A3A2^TT20G0^C3^G7^G1C0A1A2C103C0T0T0G2A3G16 NM:i:70 AS:i:716    H0:i:1  ZE:f:1.60618e-47    ZF:f:0.704296   ZQ:i:375    ZR:i:418

1   40  34S3M1I4M2I7M1D69M6D6M1I4M2D6M1D18M1I10M1I1M1D3M1I11M1I22M1I4M1D13M1I5M2D6M2I34M1I1M1I7M2I3M1I2M1D15M1I5M1I10M1D12M1D7M2D18M1D3M1I1M2I4M1D6M1D22M1I7M1I5M1I1M1D27M5S    *   0   0   TTGGTAGCCCATTACGTTCAGTTACGTATTTGCTGTTCCTAAGCCACGACTACTATAGGAATCGTGCTTGTCATGTGTATGTTGGGATCCCGACTGGCGAAACCAGGTAACGAATGGTCCCCCACGCTGCTGTTGAGTAACTCAATCGTAGCAGGAGCTGTAAAACTCCTGAGCGGAGTGTCCCACTGCTCTGCCTGGCCTCGTCCCACTGGGCGACAGTGGAGGCCTCCGCCGTCACTGCACGATTCCGTCAGGCTCGCAGGGTGAAGGAGAGGCACGCCCGCTTGGTGTGTGTATGTTGGGGGGCTCATTGTGGCTCGTGGCTCCGAGCCCCACGCCCTGATGACGTAAGTCAATCCGCCCAGCAGTGGTAATCCTCCGTCGCGGTGAGACGTGCCCGTGACGGAGCTCCCGGACGTGGAGCGCGAGTCTGATAAACATGCGGG  *   MD:Z:7T6^C35N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0^NNNNNN5A1A2^TT6^G10A1G3A9G2^A1A38^A6G3A7^AC9G42C0^A18G3A4A1T0^A12^C2A0T0A2^TT1T15A0^T1G6^A6^A35^A27    NM:i:104    AS:i:1046   H0:i:1  ZE:f:1.03977e-70    ZF:f:0.89082    ZQ:i:446    ZR:i:418

The reference would be

GTTCTAATACGACTCACTATAGGAATCGTGCTTGTCATGTGTATGTTGGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCACATACTTTGTTGAGGTAACTCAATCATGGCAAGGCTGTAAAGCCACAGGCGGAGTGTCCACTGCTCTGCCTGGCCTCGTCCACTGAGGCGACGGTGAAGCCTCCACGCCGTCTGCGCGATTCCGTCAGGCTCGCAGGGTGAAGGAGGCACGCCCTTGTCATGTGTATGTTGGGGGCTCGTGTAGCTCATTAGCTCCGAGCCCCCACATACTTTGTTGACGTAAGTCAATCATGGCAAGTGAGTAATCACTCCGTCGCGGTGAGACGTGCCGTGACGGGCTCCGAGACGTGGAGCGCGAGTCTGATAAACATCGCCACTACTCTC

The problem with using either regular expressions or trimming is that nanopore data contains lots of inserts and deletions, so the length and composition of the 'fixed' region next to the variable region is extremely variable in the reads. I don't have any guarantee that the sequence next to the variable region is what I want it to be.

I guess I could write a script that uses the leftmost index in conjunction with the CIGAR string to calculate where my region of interest begins and ends for any given read and then extracts it - assuming there's no command line tool to do this kind of stuff.

ADD REPLY
0
Entering edit mode

Thanks for posting the data. '

so the length and composition of the 'fixed' region next to the variable region is extremely variable in the reads

From your description above, fixed region is as good as variable region. If there is a regular pattern that you can throw light on, it can be done. Given that sam file is text file, there are several command line tools that can handle regex in *nix.

May be you can show us how variable fixed region is compared to variable region. From there, some one here can provide solution. Btw, to extract only sequence one can use awk '{print $6}' test.txt from sam file (for SE). gordon.freeman

ADD REPLY
0
Entering edit mode
6.2 years ago
sacha ★ 2.4k

If sequence margin are relatively constant, you can check cutadapt
A: Extract sequence from fasta using primer

ADD COMMENT

Login before adding your answer.

Traffic: 2465 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6