Separate Files
3
0
Entering edit mode
11.0 years ago
wangjununo • 0

I have some biology sequence data like follows:

>b_comp_seq1
ACGCGGGGGAATTT
>b_comp_seq_2
ACGGGCTTTCACC
.....
>b_comp_seq_64
ACCCGGGAATT

while I want extract these sequence with 4 sequence in a separate file with a name, that means I have 64 sequences I want them separate into 16 files with each have 4 sequences and with a different name. Is there any perl script or other way to do this? Thank you

perl • 3.3k views
ADD COMMENT
1
Entering edit mode

Why don't you try "split". Each entry takes 2 lines (1 header and 1 for sequence). You have total of 64x2= 128 lines. Now try this command on UNIX:

split --lines 8 Original_file (It should give you 16 files)

ADD REPLY
0
Entering edit mode

Unless I'm mistaken, that will not work. It's appears that the OP wants only four extracted sequences in each of the 16 files--no headers.

ADD REPLY
1
Entering edit mode

In that case he can make a new file using the following command: grep -v "^>" Original_file > Newfile. This file will only have sequences. Now he will have to use split --lines 4 instead of 8.

ADD REPLY
0
Entering edit mode

This is a nice solution.

ADD REPLY
0
Entering edit mode

There's probably not one floating around anywhere, but you could trivially write one.

ADD REPLY
0
Entering edit mode

Perhaps not so 'trival' for the OP...

ADD REPLY
1
Entering edit mode

Then the OP should find a different field. Those who can't program at least a little have no business in bioinformatics.

ADD REPLY
0
Entering edit mode

I don't know what "a little" means, in this context--either operationally or stipulatively. I also didn't know that bioinformatics requires a programming background for admission. If not, however, perhaps the OP is just now nurturing his/her emerging programming skills, since a certain level of proficiency ("a little"?) is required at some point in the OP's matriculation.

ADD REPLY
1
Entering edit mode

Well dpryan79 is not entirely wrong. He has helped a lots of beginners by answering their questions. But lately we have been getting many trivial questions. People don't try enough before posting their questions to the forum. The best thing would be to also post whatever you have tried so far along with the real question. This will show that the user has made sincere effort to resolve the problem.

ADD REPLY
0
Entering edit mode

Your point is well made, and I didn't mean to suggest that dpryan79 was wrong. Even in my solution I mentioned that it would be nice to see some problem-solving attempts. I sometimes have a hard time with the use of the word 'trivial' in programming contexts, since I know that what would be 'trivial' to some programmers would make my brain hurt.

ADD REPLY
0
Entering edit mode
11.0 years ago
Pavel Senin ★ 1.9k

I think this will work for you c-code

ADD COMMENT
0
Entering edit mode
11.0 years ago
Kenosis ★ 1.3k

It would be good to see your solution attempts. Having said that, the following will do what you need:

use strict;
use warnings;
use autodie;

my ( $fh, $n );

while (<>) {
    open $fh, '>', 'file' . ++$n . '.txt' unless ( $. - 1 ) % 8;
    print $fh $_ unless /^>/;
}

Usage: perl script.pl inFile

And as a one-liner:

perl -ne 'open $fh, ">", "file" . ++$n . ".txt" unless ( $. - 1 ) % 8; print $fh $_ unless /^>/' inFile
ADD COMMENT
0
Entering edit mode
11.0 years ago

linearize the fasta lines with awk and use the linux command 'split'

$ awk '/^>/ {printf("%s%s\n",(N==0?"":"\n"),$0); ++N; next;} {printf("%s",$0);} END {printf("\n");}' input.fasta |\
split -l 8 - FASTA.
ADD COMMENT
0
Entering edit mode

The OP wants only the "extracted" sequences in the files--no headers.

ADD REPLY

Login before adding your answer.

Traffic: 1475 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6