Extract 6mer and headers from fasta and compile into list
3
1
Entering edit mode
9.6 years ago
omansn ▴ 10

I'm hoping you all can help me with a problem I have.

I have a fasta file of ~1300 genes, and I'm looking for all instances of 7 different 6nt sequences. I'm hoping to use my list of 7seven 6mers to query the fasta and extract the matches sequence (+/- about 7nt around it) as well as the match's respective fasta header.

This seems like something that would be really easy to do for someone who is good at programming. Unfortunately, I'm not. I've tried a couple different things- First I tried using grep, but the inflexibility of grep made it difficult to compile the data, especially keeping it attached to the FASTA header.

Next, I tried the following perl script (remember, I can't program) and I get errors each time.

use strict;
use Bio::Seq;
use Bio::DB::Fasta;

my $fastaFile = shift;
my $queryFile = shift;

my $db = Bio::DB::Fasta->new( $fastaFile );
open (IN, $queryFile);
while (<IN>){
    chomp;
    $seq = $_;
    my $sequence = $db->seq($seq);
    if  (!defined( $sequence )) {
            die "Sequence $seq not found. \n"
    }
    print ">$seq\n", "$sequence\n";
}

This gives me the output of :

Global symbol "$seq" requires explicit package name at seqfindr.pl line X.

for every line that contains $seq.

Now, I've tried adjusting my PATH to contain all bioperl modules (I imagine, $seq is contained in one) to no avail.

Can someone help me out here? I'm open to new ideas of how to generate my list, adjustments to the perl script, or reappropriation of other tools.

Thank you!

Stressed and under pressure from the boss :)

gene sequence • 2.9k views
ADD COMMENT
3
Entering edit mode
9.6 years ago
Dan D 7.4k

I don't use Perl anymore, but here's something I wrote in Python. It's a 10-minute effort, but seems to work fine. Put your target sequences in a text file, one target sequence per line. Call the program like this:

python fasta_target_extractor target_seqs.txt my_genome.fasta

For each match it will print the header of the matching FASTA contig, and the target sequence match plus up to 7 bases from each side of the matched target sequence.

#!/usr/bin/python
from sys import argv

#run the application like this:
#python fasta_target_extractor.py target_seqs.txt fasta_file.fasta

#The target_seqs list holds the 6-mers you're querying on
target_seqs = []
with open(argv[1]) as tsf:
    for read in tsf:
        target_seqs.append(read.strip())

current_header = ''

with open(argv[2]) as faf:
    for read in faf:
        if read.startswith('>'):
            current_header = read.strip()
            continue
        for ts in target_seqs:
            ts_index = read.find(ts)
            if ts_index != -1:
                r = read.strip()
                start = 0 if ts_index - 7 < 0 else ts_index - 7
                end = len(r) - 1 if ts_index + len(ts) + 7 >= len(r) else ts_index + len(ts) + 7
                print current_header
                print r[start:end]
ADD COMMENT
0
Entering edit mode

You're so nice! I'll run it tomorrow and let you know how it goes! Thanks Dan.

In general, would you recommend that I focus on Python instead of Perl for bioinformatics? While Python seems to be taking over, a lot of the tools out there are in Perl.

ADD REPLY
0
Entering edit mode

Uh oh, you opened that door. I think it's largely a matter of personal preference. The Python vs. Perl debate leaves lots of bodies in its wake. Here are some of the best biostars threads about it:

When To Choose Python Over Perl (And Vice Versa)?

I want to re-open the old debate: python or perl ?

ADD REPLY
0
Entering edit mode

Ahah! it worked! I am so grateful for your help, Dan. You have no idea how relieved I am!

ADD REPLY
0
Entering edit mode
9.6 years ago

I don't have an exact solution, but you can extract the sequences with BBDuk like this:

bbduk.sh in=data.fasta out=filtered.fasta k=6 mm=f rcomp=f literal=AAAAAA,AAATTT,CCCGGG

It's also possible to change the matching parts of the sequences to lowercase, which may be helpful, with the flag "kmask=lc". You can mask an additional 7 bases around each matching kmer with the flag "trimpad=7".

ADD COMMENT
0
Entering edit mode
9.6 years ago
zhuwei.cug ▴ 30

Since you have enable the 'strict' syntax mode via 'use strict', you have to declare every variable such as:

$seq="ATGC" is wrong, it should be my $seq="ATGC"

So in your script, change $seq = $_; to my $seq = $_;. That will eliminate the errors

ADD COMMENT

Login before adding your answer.

Traffic: 1638 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6