Question

Converting Files Newline With Specific Pattern

0

Entering edit mode

12.5 years ago

Assa Yeroslaviz ★ 1.9k

Hi,

I would like to know how to remove the newline from a certain part of my file, but not all of it.

I am piping the result of my program into sed in order to convert the file to a specific format. The input file looks like that:

>sctg_0002_0001  length=2745
TCCCCCTCCCGTACCGGTTTGCGCTATTATACCGGCCTTGAATCGAGCAAAGGCTCCAAACAATTTCATTACAAACAGATTGGGGATGTATGACGTGGCT
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
TTGACACGCTTGTTTCTGATGTCATCACCCATGAAGAGCTGTTATTTGGCCACCTGGCGTTCCTGCCTAAGCGTTGAGTGAATATTAAACACCTCTGCCC
>sctg_0003_0001  length=2175
CAACAACCACTCTTAGCGCTGCTTGCCGCTGCCGATACCGAACGGGATGCGGTAGTCGCTGCTCTGCTCACCCAGACTCACGGTCAGGTTGCCCTGAGTA
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
...

This is what I need to do:

convert the ">" symbol into the string "SEQUENCE_ID="
remove everything after the double spaces in the header, e.g. " length=2745"
add part of the next header into the actual one - this should looks like that sctg00020001e00030001b
- 0002_0001 is part of the first header
- 0003_0001 is part of the second header.
delete the newline symbols from the sequence itself as if to make the fastA in one single line.
add the string "SEQUENCE_TEMPLATE=" to the sequence line.
add the symbol "=" after each sequence line

This is what I have done so far

perl convert_FastA.pl ScaffoldContigs.fasta | sed -e '/^>/  s/>/SEQUENCE_ID=/'  | sed -e ':a;N;$!ba;/^SEQUENCE_ID=/ !  s/\n//'

the results of the first part looks like the sample above. rst sed command replace the ">" with the pattern needed.

At the end it should look like that:

SEQUENCE_ID=sctg_0002_0001e_0003_0001b
SEQUENCE_TEMPLATE=CCCCCTCCCGTACCGGTTTGCGCTA...
=
SEQUENCE_ID=sctg_0003_0001e_0001_0001b
SEQUENCE_TEMPLATE=CAACAACCACTCTTAGCGCTGCTTG... 
=

I tried to delete the newline with sed, but it didn't work how I imgined it. It delete either all of them or none. Besides I couldn't find any way to "save" the next line in order to put it in the header of the sequence before that.

I would appreciate any help I can get.

Thanks, Assa

awk perl conversion programming primer • 4.2k views

ADD COMMENT • link updated 12.5 years ago by Kenosis ★ 1.3k • written 12.5 years ago by Assa Yeroslaviz ★ 1.9k

1

Entering edit mode

I don't completely understand the part where you are adding the next header into the current one. So you just want to add the next header in the current header separated by the letter 'e' and append a letter 'b' to the end?

It might be better to just write a script for something complicated like this.

ADD REPLY • link 12.5 years ago by Damian Kao 16k

1

Entering edit mode

The title and tags associated with your post do not cover the load at all. For future reference, a better use of tags and titles helps to get the right people looking at your question :)

ADD REPLY • link 12.5 years ago by Whetting ★ 1.6k

0

Entering edit mode

Your specs say, add part of the next header into the actual one. What do you do when there is no next header, i.e., you're processing header n?

ADD REPLY • link 12.5 years ago by Kenosis ★ 1.3k

0

Entering edit mode

the last one takes the second part of the first header, but I can probably do it manually, if not available by script.

ADD REPLY • link 12.5 years ago by Assa Yeroslaviz ★ 1.9k

Ram · Answer 1 · 2013-03-05

Here's a Perl option that processes your fasta records in one pass and uses Bio::SeqIO:

use strict;
use warnings;
use v5.10;
use Bio::SeqIO;

my $in = Bio::SeqIO->new( -fh => \*ARGV, -format => 'Fasta' );
my ( @last, $firstHeader );

while ( my $seq = $in->next_seq() ) {
    if ( !@last ) {
        @last = ( $seq->id, $seq->seq );
        next;
    }

    my $currHeader = $seq->id;
    $currHeader =~ s/sctg/e/;
    $firstHeader //= $currHeader;

    print formatRec( $currHeader, @last );
    @last = ( $seq->id, $seq->seq );
}

print formatRec( $firstHeader, @last );

sub formatRec {
    my ( $currHeader, $secID, $secSec ) = @_;
    return "SEQUENCE_ID=$secID$currHeader"
      . "b\nSEQUENCE_TEMPLATE=$secSec\n=\n";
}

Usage: perl script.pl fastaFile [>outFile]

Sample output on your data (no newlines in the sequences):

SEQUENCE_ID=sctg_0002_0001e_0003_0001b
SEQUENCE_TEMPLATE=TCCCCCTCCCGTACCGGTTTGCGCTATTATACCGGCCTTGAATCGAGCAAAGGCTCCAAACAATTTCATTACAAACAGATTGGGGATGTATGACGTGGCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTGACACGCTTGTTTCTGATGTCATCACCCATGAAGAGCTGTTATTTGGCCACCTGGCGTTCCTGCCTAAGCGTTGAGTGAATATTAAACACCTCTGCCC
=
SEQUENCE_ID=sctg_0003_0001e_0004_0001b
SEQUENCE_TEMPLATE=CAACAACCACTCTTAGCGCTGCTTGCCGCTGCCGATACCGAACGGGATGCGGTAGTCGCTGCTCTGCTCACCCAGACTCACGGTCAGGTTGCCCTGAGTANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
=
SEQUENCE_ID=sctg_0004_0001e_0003_0001b
SEQUENCE_TEMPLATE=TCCCCCTCCCGTACCGGTTTGCGCTATTATACCGGCCTTGAATCGAGCAAAGGCTCCAAACAATTTCATTACAAACAGATTGGGGATGTATGACGTGGCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTGACACGCTTGTTTCTGATGTCATCACCCATGAAGAGCTGTTATTTGGCCACCTGGCGTTCCTGCCTAAGCGTTGAGTGAATATTAAACACCTCTGCCC
=

The script processes the file passed to it on the command line, passing its handle for the creation of a Bio::SeqIO object. Output is directed to STDOUT, so you can view the results on your screen or have it sent to a file by using the optional last parameter.

It saves the last record for printing. The first conditional statement checks if there is a last record. If not, it initializes the array that hold it and loops again. If a last record exists, then a substitution is done on the current header. The first header is stored for use later with the very last record.

A record is printed, the current record becomes the last one, and the next record is retrieved. The very last record is printed outside the while loop.

Hope this helps!

Edit:

Was inspired by Whetting's Python script and wanted to give such a try. The following Python script produces the same results as the Perl script:

import sys
import re
from Bio import SeqIO

handle = open(sys.argv[1], 'rU')
recs = [[rec.id, rec.seq] for rec in SeqIO.parse(handle, 'fasta')]
handle.close()

firstHeader = recs[1][0]

for i in range(0, len(recs)):
    addID = recs[i + 1][0] if i &lt; len(recs) - 1 else firstHeader
    recs[i][0] += re.sub(r'sctg', r'e', addID)
    print 'SEQUENCE_ID=' + recs[i][0] + 'b\nSEQUENCE_TEMPLATE=' \
        + recs[i][1] + '\n='

Usage: python script.py fastaFile [>outFile]

Instead of a single pass, the above first builds a list comprehension of IDs and sequences. The header used in the last record is first saved in firstHeader. The records are iterated over and a ternary expression is used to determine which ID to add to the current record, as the last record gets firstHeader. Next a substitution on the ID is done and the results are concatenated to the current record's ID. Finally, the record is printed.

score 1 · Answer 2 · 2013-03-05

Hi,
I do not know perl well enough to do this. however, I have some basic knowledge of python.
This script does what you want...

from Bio import SeqIO
out=open("out.txt","a")
ID=[]
for record in SeqIO.parse("input.fas","fasta"):
    ID.append("_".joinstrrecord.id).rsplit("_")[1:]))

n=1
for record in SeqIO.parse("input.fas","fasta"):
    if n==len(ID):
        #print >>out, n
        print >>out, "SEQUENCE_ID="+record.id+"e_"+ID[0]+"b"
        print >>out, "SEQUENCE_TEMPLATE ="+record.seq
        print >>out, "="
        n=n+1
        break
    else:
        #print >>out, n
        print >>out, "SEQUENCE_ID="+record.id+"e_"+ID[n]+"b"
        print >>out, "SEQUENCE_TEMPLATE ="+record.seq
        print >>out, "="
        n=n+1
out.close()

It will take your input in the file "input.fas"

and it returns:

SEQUENCE_ID=sctg_0002_0001e_0003_0001b
SEQUENCE_TEMPLATE =TCCCCCTCCCGTACCGGTTTGCGCTATTATACCGGCCTTGAATCGAGCAAAGGCTCCAAACAATTTCATTACAAACAGATTGGGGATGTATGACGTGGCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTGACACGCTTGTTTCTGATGTCATCACCCATGAAGAGCTGTTATTTGGCCACCTGGCGTTCCTGCCTAAGCGTTGAGTGAATATTAAACACCTCTGCCC
=
SEQUENCE_ID=sctg_0003_0001e_0004_0001b
SEQUENCE_TEMPLATE =CAACAACCACTCTTAGCGCTGCTTGCCGCTGCCGATACCGAACGGGATGCGGTAGTCGCTGCTCTGCTCACCCAGACTCACGGTCAGGTTGCCCTGAGTANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
=
SEQUENCE_ID=sctg_0004_0001e_0005_0001b
SEQUENCE_TEMPLATE =CAACAACCACTCTTAGCGCTGCTTGCCGCTGCCGATACCGAACGGGATGCGGTAGTCGCTGCTCTGCTCACCCAGACTCACGGTCAGGTTGCCCTGAGTANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
=
SEQUENCE_ID=sctg_0005_0001e_0002_0001b
SEQUENCE_TEMPLATE =CAACAACCACTCTTAGCGCTGCTTGCCGCTGCCGATACCGAACGGGATGCGGTAGTCGCTGCTCTGCTCACCCAGACTCACGGTCAGGTTGCCCTGAGTANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
=

ps: The new lines are gone. sequence is being split because I copy-pasted from terminal.