Trim sequences in FASTA file from sanger sequencing
3
0
Entering edit mode
10.6 years ago

Is there a tool out there that trims FASTA sequences at the beginning and ends.

By trimming i mean take the input sequences and generate the ouputs. I have more than 3000 sequences.

Thanks.

INPUT:

>Seq1
NNNNNNNNNNTTGNNNNGGATNTCCTTTCCGAATATTTTTGGTGCATTTGTAATAAATGTCATTTNTCTCCTTTTTAAAGGAATTGTCTTAGAAGAAAGAAGGCAAGCCACCATTTTACCCACGTAAATATATGAATATATTTCTGACATTGAGGTGTTCCAGAAGATGATAAAGAAATGATAGCAGCTCCAGAAATACCAACTGATTTTAATCTACTACAGTAAGTAAATTATATTCTGATAATTTTTAAATACTTGTTTATTCCACAAAATGGGGAATGCATTAACTTCAGTTAAATTTCCTTCTGCTCGAGAAGATCTAATATATAAAATAGCTTTTATGCTTTGCAAGAGTTTATATCAGNANCNNNNNNNNNNCNGN
>Seq2
NNNANNNNNGNNNNGTATGANGTTTTGGGGAACATCTTAATTACTTATAATGCTAATATGAAGTTTTGTAATGAGTTAACCAAGCCTTTCTTTTAGAAAATATGGCAAAAATTAGAAACTCAATATAAATTTCTAAGGAAGGGTTTTAATTCTTATCTTTCTGTCACAGGGAGTCAGAAACACATTTTTCTTCTGACACAGATTTTGAAGATATCGAAGGAAAAAACCAAAAGCAAGGCAAAGGCAAAGTATGTATCAAATATTTGACTTTATTTTGTTTCCTAAGATCTCACACACACACAGATTTAAGTTATGTCTCAGATAGTTTTATCTTTTAAAAATGGCTTTTTAAGGGGGTGGGAGCTGATTGGTATGGTAANCAN
>Seq3
NNNNNGNNNNNNNNNTNNNTNNNTNNNAAGTGGATGGAATTCTTTAGGGCAAGTTTAAGCATGTTATGTACCCTATCAGCTACTTCTACTGTAGCTGTGTTTTGAACTCTCAAGGATAGTGATATAACTTAACCACCTCGTATTTTTTATGCAGACTTGTAAAAAAGGCAAAAAGGGCCCAGCAGAAAAGGGCAAAGGTGGAAATGGAGGAGGAAAACCTCCTTCTGGTCCAAACCGAATGAATGGTCATCACCAACAGAATGGAGTGGAAAACATGATGTTGTTTGAAGTTGTTAAAATGGGCAAGAGTGCTATGCAGGTAAGATTTATGTTGTTCTTCCCAGTTCATTTGTACATTTTAAACTTTAATGAGTTATATAGAGTGTAGCTCTGNNNNNNNNNNTTGCAA
>Seq4
NNNNNNNCNCNNNNNGNGNNNNCNAAGTGACTATTTGAGAGCTGCTGATTTCAAAATAAATATATCTTACCTTTACAGCCTGAACACTGAATAAAAAAGTTGATAAGGTCAAGAAGTGCTATATCTCGGTCATGCTTGTATGATTCTATCCAATCATCTACCACCGACTACAGCAGAGGGAAAAAAATAAAATCATTAGCTTCTTCTAATTTTCTCAAAATCAATTAAGTCTGATAAAGTCATAAAATTCAAGATTATATAGTATCACATTACTTTAATATAAATACTTATACACTGAAATTTAAAGTTCAATTTTAACAATAATAAAATAGAATCGAATTCAGTAAAACAATTATCTGATAACACAAAATGACCTATCAATCTTCTATTTATTTTGCATTGAAAAGAATGTGGNNN
>Seq5
NNNNNNNNNAANNNNNNNNNNNNNNNNNNNTNNNNANNNNNNNNNNNTAAGTTATCAAAACACTTAAGGTAGTAAGTTACCTCATCGAATTCTTCAGTCATTTTTCGAATTATCTCAGAGTTCTGCATATGTCTAAACATTTCTGCTGTGACAACTCCTGAAATTTGCAAATGTCAGAAGTTAATATATGGTGTGATAAAAAAATAAAGAAAACTTCCAAGTAAGTCTCTAACACTAAGAAGTCTATGGTCACACAATAAAAGGCATACTTCTTCAACCATCATCTAATAATCTTTACCATGATACTCTAATCTATAAATAAAGCACAAACAAATGCTATCTATTCTCAGTATGCACAAGAAAACAGCCCCATACTTCTGACAGATATCTTTTTTCCTAACACAATTAACTTTGGCCATTTCTANNNNNNNNNNTTNNNNAAN

OUTPUT:

>Seq1
TCCTTTCCGAATATTTTTGGTGCATTTGTAATAAATGTCATTTNTCTCCTTTTTAAAGGAATTGTCTTAGAAGAAAGAAGGCAAGCCACCATTTTACCCACGTAAATATATGAATATATTTCTGACATTGAGGTGTTCCAGAAGATGATAAAGAAATGATAGCAGCTCCAGAAATACCAACTGATTTTAATCTACTACAGTAAGTAAATTATATTCTGATAATTTTTAAATACTTGTTTATTCCACAAAATGGGGAATGCATTAACTTCAGTTAAATTTCCTTCTGCTCGAGAAGATCTAATATATAAAATAGCTTTTATGCTTTGCAAGAGTTTATATCAGNANC
>Seq2
GTATGANGTTTTGGGGAACATCTTAATTACTTATAATGCTAATATGAAGTTTTGTAATGAGTTAACCAAGCCTTTCTTTTAGAAAATATGGCAAAAATTAGAAACTCAATATAAATTTCTAAGGAAGGGTTTTAATTCTTATCTTTCTGTCACAGGGAGTCAGAAACACATTTTTCTTCTGACACAGATTTTGAAGATATCGAAGGAAAAAACCAAAAGCAAGGCAAAGGCAAAGTATGTATCAAATATTTGACTTTATTTTGTTTCCTAAGATCTCACACACACACAGATTTAAGTTATGTCTCAGATAGTTTTATCTTTTAAAAATGGCTTTTTAAGGGGGTGGGAGCTGATTGGTATGGTAANCA
>Seq3
AAGTGGATGGAATTCTTTAGGGCAAGTTTAAGCATGTTATGTACCCTATCAGCTACTTCTACTGTAGCTGTGTTTTGAACTCTCAAGGATAGTGATATAACTTAACCACCTCGTATTTTTTATGCAGACTTGTAAAAAAGGCAAAAAGGGCCCAGCAGAAAAGGGCAAAGGTGGAAATGGAGGAGGAAAACCTCCTTCTGGTCCAAACCGAATGAATGGTCATCACCAACAGAATGGAGTGGAAAACATGATGTTGTTTGAAGTTGTTAAAATGGGCAAGAGTGCTATGCAGGTAAGATTTATGTTGTTCTTCCCAGTTCATTTGTACATTTTAAACTTTAATGAGTTATATAGAGTGTAGCTCTG
>Seq4
AAGTGACTATTTGAGAGCTGCTGATTTCAAAATAAATATATCTTACCTTTACAGCCTGAACACTGAATAAAAAAGTTGATAAGGTCAAGAAGTGCTATATCTCGGTCATGCTTGTATGATTCTATCCAATCATCTACCACCGACTACAGCAGAGGGAAAAAAATAAAATCATTAGCTTCTTCTAATTTTCTCAAAATCAATTAAGTCTGATAAAGTCATAAAATTCAAGATTATATAGTATCACATTACTTTAATATAAATACTTATACACTGAAATTTAAAGTTCAATTTTAACAATAATAAAATAGAATCGAATTCAGTAAAACAATTATCTGATAACACAAAATGACCTATCAATCTTCTATTTATTTTGCATTGAAAAGAATGTGG
>Seq5
TAAGTTATCAAAACACTTAAGGTAGTAAGTTACCTCATCGAATTCTTCAGTCATTTTTCGAATTATCTCAGAGTTCTGCATATGTCTAAACATTTCTGCTGTGACAACTCCTGAAATTTGCAAATGTCAGAAGTTAATATATGGTGTGATAAAAAAATAAAGAAAACTTCCAAGTAAGTCTCTAACACTAAGAAGTCTATGGTCACACAATAAAAGGCATACTTCTTCAACCATCATCTAATAATCTTTACCATGATACTCTAATCTATAAATAAAGCACAAACAAATGCTATCTATTCTCAGTATGCACAAGAAAACAGCCCCATACTTCTGACAGATATCTTTTTTCCTAACACAATTAACTTTGGCCATTTCTA

sanger sequencing trim NNNN • 6.2k views
ADD COMMENT
0
Entering edit mode

just to clarify your post, what you want to happen is to trim back from both the 5' and 3' ends to the point where no more N's are visible within a certain distance

ADD REPLY
0
Entering edit mode

yes...that is correct.

ADD REPLY
2
Entering edit mode
10.6 years ago

Here's an awk solution (because why not). It looks at bins of 5 bases and will trim them off either end if they contain an N. You can modify this at will, of course. Just change foo.fa to whatever your file is called and then pipe things to a new file.

awk '{
header=$0;
getline;
for(five_prime=1;five_prime<length($1)-5;five_prime++) {
    s=substr($1,five_prime,5);
    if(index(s,"N")==0) break;
}
for(three_prime=length($1)-4;three_prime>five_prime;three_prime--) {
    s=substr($1,three_prime,5);
    if(index(s,"N")==0) break;
}
printf("%s\n%s\n",header,substr($1,five_prime,three_prime-five_prime+5));
}' foo.fa

Edit: Fixed an off-by-one error.

ADD COMMENT
0
Entering edit mode
10.6 years ago
xb ▴ 420

How about this,

FASTA/Q Trimmer

ADD COMMENT
0
Entering edit mode

Thanks. I did look at this and am using galaxy but it is not trimming the beginning and ends. It just removes sequences that have lots of NNNNNNN's in them.

ADD REPLY
0
Entering edit mode
10.6 years ago

A more generic solution for your problem could be to find the longest substring that is bounded by Ns. A simple python script like the one below could do that:

import sys
for line in sys.stdin:
   if line[0] == ">":
        print line 
        continue
    line = line.strip()
    pieces = line.split("N")
    sizes = sorted(((len(p), p) for p in pieces), reverse=True)
    longest = sizes[0][1]
    print longest

Run it with python trim.py < input.fasta

ADD COMMENT
0
Entering edit mode

wont this split the string if there are a few N's in between the sequences. I would like to trim the ends as much as possible

and tolerate some N's in the middle. Thanks

ADD REPLY
1
Entering edit mode

correct, like I said this will give you the longest substring that is bounded by Ns

it is just a different way to think about the problem, and when one does so they may identify different requirements

ADD REPLY

Login before adding your answer.

Traffic: 2202 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6