samtools extract the non-specified region
1
0
Entering edit mode
11 months ago
sapuizait ▴ 10

Hi all

I am trying to extract specific regions from sequences in fasta files

Example file:

>oneseq GAATATGCTTTTCGCTATTTTGGTGGCAGAACAAAAGCAATTATGTATGACCAGGATAAAGTTTTTGTTGTAAGTGAGAATTTCGGCAATGTAATCTTTGTTCCGGAGT
>twoseq TCTGGAGGGACTGCCGGCGCAAGCCGTGAGGAAGGATGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCGACACACGTGTTACAATGGTCGGCACAGCGGGAAGCCATATGGTGACATAGAGCGGAACCCGAAAGCCGGTCTCAGTTCGGATCGGAGTCTGCAA
>threeseq CATCCGGTAGTTTCTTTCCCATATCCTGCACCGCCGGAACCTTCTTCTCATAGGTCTGTAACTGTGTATTGTTTTCTGCCATACAAACACCGCCCTTTCTATGATATTTCAGATATTTCAAGCAATATTTCAAAAAATTAAATCTAATCTTAACTTTATTCCAACCCTT

I often use samtools to do it using sth like this:

samtools faidx test.fas oneseq:2-10 twoseq:3-10

However, this time, instead of this, I would like to extract the non-specified regions In the above example, I would like to get as output this:

>oneseq
G-TTCGCTATTTTGGTGGCAGAACAAAAGCAATTATGTATGACCAGGATAAAGTTTTTGTTGTAAGTGAGAATTTCGGCAATGTAATCTTTGTTCCGGAGT
>twoseq
TC-CTGCCGGCGCAAGCCGTGAGGAAGGATGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCGACACACGTGTTACAATGGTCGGCACAGCGGGAAGCCATATGGTGACATAGAGCGGAACCCGAAAGCCGGTCTCAGTTCGGATCGGAGTCTGCAA

OR even better :)

>oneseq-upstream 
G
>oneseq-downstream TTCGCTATTTTGGTGGCAGAACAAAAGCAATTATGTATGACCAGGATAAAGTTTTTGTTGTAAGTGAGAATTTCGGCAATGTAATCTTTGTTCCGGAGT
>twoseq-upstream 
TC
>twoseq-downstream CTGCCGGCGCAAGCCGTGAGGAAGGATGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCGACACACGTGTTACAATGGTCGGCACAGCGGGAAGCCATATGGTGACATAGAGCGGAACCCGAAAGCCGGTCTCAGTTCGGATCGGAGTCTGCAA

I am aware it is possible if I first first get the sequence length and then subtract from it the selected region - But, I was just wondering if there is an easier way to do this?

Thanks

samtools • 556 views
ADD COMMENT
4
Entering edit mode
11 months ago
JC 13k

one simple solution is a simple Perl script:

#!/usr/bin/perl

use strict;
use warnings;

$ARGV[1] or die "use perl split_seqs.pl FASTA COORD1 COORD2 ... > OUTPUT\n";

my $fasta  = shift @ARGV;
my @coords = @ARGV;

my %reg = ();
foreach my $coord (@coords) {
    my ($seq_id, $seq_coord) = split (/:/, $coord);
    $reg{$seq_id} = $seq_coord;
}

$/="\n>"; # read fasta in sequence blocks
open (my $fh, "<", $fasta) or die "cannot read $fasta\n";
while (<$fh>) {
    s/>//g;
    my ($seq_id, @seq) = split (/\n/, $_);
    next unless (defined $reg{$seq_id});

    my $seq = join "", @seq;
    my ($ini, $end) = split (/-/, $reg{$seq_id});

    # upstream
    my $u_seq = substr($seq, 0, $ini - 1);
    print ">$seq_id-upstream\n$u_seq\n";
    # downstream
    my $d_seq = substr($seq, $end, (length $seq) - $end);
    print ">$seq_id-downstream\n$d_seq\n";
}
close $fh;

then you can run as:

$ perl split_seq.pl seqs.fa  oneseq:2-10 twoseq:3-10
>oneseq-upstream
G
>oneseq-downstream
TTCGCTATTTTGGTGGCAGAACAAAAGCAATTATGTATGACCAGGATAAAGTTTTTGTTGTAAGTGAGAATTTCGGCAATGTAATCTTTGTTCCGGAGT
>twoseq-upstream
TC
>twoseq-downstream
CTGCCGGCGCAAGCCGTGAGGAAGGATGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCGACACACGTGTTACAATGGTCGGCACAGCGGGAAGCCATATGGTGACATAGAGCGGAACCCGAAAGCCGGTCTCAGTTCGGATCGGAGTCTGCAA
ADD COMMENT

Login before adding your answer.

Traffic: 2026 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6