How to split sequence in fasta file based on long runs of Ns?
3
0
Entering edit mode
4.7 years ago
stacy734 ▴ 40

Hi all, I have a fasta file with some long runs of Ns. I would like to split the sequences with strings of 100 Ns or longer by replacing these strings with a header: >some_string It isn't necessary to have the header strings be unique, I can renumber the headers afterwards. One issue: the fasta is normally formatted for 70 nt in a line, so the 100 Ns might be spread over multiple lines. Can anyone suggest a unix script or software package? Thanks for any advice. Stacy

fasta unix • 2.0k views
ADD COMMENT
0
Entering edit mode

Thank you! This is very useful.

With appreciation, Stacy

ADD REPLY
0
Entering edit mode

Please consider upvoting and accepting answers if your problem was solved.

ADD REPLY
1
Entering edit mode
4.7 years ago
zubenel ▴ 120

If you install bioperl module Bio::SeqIO you could use the script below. Just save it as split_fasta.pl and run like ./split_fasta input.fasta output.fasta 100. In this way sequences from input.fasta file will be divided by 100 Ns and saved to file output.fasta.

If the sequence called "some_string" is splitted to 3 new sequences these are named as: "some_string_1", "some_string_2", "some_string_3".

#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;

# Quit unless 3 command-line arguments are provided
my $num_args = $#ARGV + 1;
if ($num_args != 3) {
    print "\nUsage: split_fasta.pl input_fasta output_fasta N_length\n";
    exit;
}
my $input_fasta  = $ARGV[0];
my $output_fasta = $ARGV[1];
my $N_length     = $ARGV[2];

# Reading the input fasta file
my $seqio_in = Bio::SeqIO->new(-file => $input_fasta, 
                               -format => "fasta" );

# Creating the output fasta file                             
my $seqio_out = Bio::SeqIO->new(-file => ">$output_fasta", 
                                -format => "fasta" );

while ( my $seq = $seqio_in->next_seq ) {

    # Split fasta sequence by Ns of length at least $N_length
    my @splitted_seqs = split /[Nn]{$N_length,}/, $seq->seq;
    my $count = 0;
    my $old_name = $seq->display_id;

    # Save original sequence to output fasta if it was not splitted
    if ($#splitted_seqs == 0) {
        $seqio_out->write_seq($seq);            
    }

    # Save splitted sequences to output fasta by adding number to sequence name
    else {
        foreach my $new_seq (@splitted_seqs) {
            my $new_name = $old_name . "_$count";
            $seq->display_id($new_name);
            $seq->seq($new_seq);
            $seqio_out->write_seq($seq);
            $count++;       
        }
    }

}
ADD COMMENT
0
Entering edit mode

This works perfectly fine on fasta files (have provided +1). But I have a fastq file, and want to split both sequences and quality at each Ns. Is it possible? Thanks

ADD REPLY
0
Entering edit mode
4.7 years ago
boaty ▴ 220

I suggest you to use SeqIO.par in biopython or read.fasta() in sequinr. Those tools will transform fasta 4 lines format into a list of object which concatenate fasta sequence line into a string. Then just use split method to cut them

ADD COMMENT
0
Entering edit mode
4.7 years ago
GenoMax 147k

Prior answer of interest: A: split fastq sequence by motif (string of Ns)

You can linearize the fasta files by using some code from @Pierre found here:

ADD COMMENT

Login before adding your answer.

Traffic: 1839 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6