Fasta sequence extraction
0
0
Entering edit mode
9.0 years ago
Eva_Maria ▴ 190

I have tab limited file like this

GCF_000707685.1_2840 0 145
GCF_000706885.1_542 0 150
GCF_000593365.1_489 0 156
GCF_000593345.1_3957 256 289
GCF_000593325.1_3041 780 958

I want to extract sequence based on position like 0 to 145 , 256 to 289

>GCF_000707685.1_2840
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
>GCF_000706885.1_542
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

How to do this? awk or perl

blast awk perl genome alignment • 2.8k views
ADD COMMENT
3
Entering edit mode

Hi,

The getfasta in bedtools can accomplish this job. Please follow the link http://bedtools.readthedocs.org/en/latest/content/tools/getfasta.html for details.

ADD REPLY
0
Entering edit mode

its not working for me (files about 7 gb). is there any programme or awk?

ADD REPLY
0
Entering edit mode

Hi, I write a perl script as follows. Usage: perl extr_seq.pl genome.fa list.txt. I am just wondering why bedtools does not work for you. Personally I think it's a good tool.

#!/usr/bin/perl -w
use strict;

my %data;
my $id = '';
my $seq = '';

open my $fasta, '<', $ARGV[0] or die $!;
while (<$fasta>) {
    chomp;
    s/\r//;
    next if /^\s*$/;
    if (/>(\S+)/) {
        $data{$id} = $seq if $seq ne '';
        $id = $1;
        $seq = '';
    } else {
        $seq .= $_;
    }
}
$data{$id} = $seq;
close $fasta;

open my $list, '<', $ARGV[1] or die $!;
while (<$list>) {
    chomp;
    s/\r//;
    next if /^\s*$/;
    my ($gene, $start, $end) = split;
    if (exists $data{$gene}) {
        my $extr_seq = substr($data{$gene}, $start - 1, $end - $start + 1);
        print ">$gene\n$extr_seq\n";
    }
}
close $list;
ADD REPLY
0
Entering edit mode

thanks it works

ADD REPLY
0
Entering edit mode

If bedtools is not working for you then try https://github.com/mdshw5/pyfaidx#cli-script-faidx

I think bedtools might read the entire FASTA into memory before extracting, but not sure.

ADD REPLY
0
Entering edit mode

Here is an awk, one line solution to you problem.

awk 'FS="\t" {if(($2==0) && ($3==145) || ($2==256) && ($3 == 289)) print $1}' < foo.file.txt

if this is your file:

GCF_000707685.1_2840 0 145
GCF_000706885.1_542 0 150
GCF_000593365.1_489 0 156
GCF_000593345.1_3957 256 289
GCF_000593325.1_3041 780  958

the output is:

$ awk 'FS="\t" {if(($2==0) && ($3==145) || ($2==256) && ($3 == 289)) print $1}' < foo.file.txt
GCF_000707685.1_2840
ADD REPLY
0
Entering edit mode

I want to do as a Batch processing because i have 20 mb text file and 256 mb fasta

ADD REPLY
0
Entering edit mode

what do you mean by "batch processing"?

ADD REPLY

Login before adding your answer.

Traffic: 1893 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6