Convert list of DNA sequences in text format to a single length?
2
0
Entering edit mode
8.6 years ago

Hi, I have a text file with a long list of DNA sequences.

I would like to convert them all to the same length, with that length being the longest sequences. "D's" should be added to those sequences that are shorter.

Is there anyway to do this in R or Biophython, some script like:

1) Read sequences and find longest sequence

2) Loop through each sequence adding "D"s to match the length of the longest sequence

I was looking through the APE package in R as I imagine something must exist already to accomplish this.

Any help with be appreciated.

sequence R • 2.4k views
ADD COMMENT
0
Entering edit mode

Is your file in fasta format? Not clear from your question.

ADD REPLY
0
Entering edit mode

No, the file is not yet in fasta format. Just a text document.

ADD REPLY
2
Entering edit mode
8.6 years ago
Anima Mundi ★ 2.9k

Hello, here is a quick and dirty solution in Python (for an input file named foo.fasta):

maxl = 0
for line in open('foo.fasta'):
    if '>' not in line:
        if len(line) > maxl:
            maxl = len(line)

for line in open('foo.fasta'):
    if '>' not in line:
        print line.replace('\n','') + 'D'*(maxl - len(line))
    else:
        print line,

Hope it helps!

PS: the script assumes you have no other text than FASTA lines, and that your sequences are formatted as single lines.

ADD COMMENT
1
Entering edit mode
8.6 years ago
natasha.sernova ★ 4.0k

See this post.

How to copy all fasta-seqs from fasta-files with the seq-lengths between minlen and maxlen

There are many helpful script vertions inside.

I am using lh3-script in Perl.

It's almost what you need, isn't it?

ADD COMMENT
0
Entering edit mode

Thank you for this resource!

ADD REPLY

Login before adding your answer.

Traffic: 1964 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6