How to split sequence in fasta file based on long runs of Ns?

0

Entering edit mode

5.2 years ago

stacy734 ▴ 40

Hi all, I have a fasta file with some long runs of Ns. I would like to split the sequences with strings of 100 Ns or longer by replacing these strings with a header: >some_string It isn't necessary to have the header strings be unique, I can renumber the headers afterwards. One issue: the fasta is normally formatted for 70 nt in a line, so the 100 Ns might be spread over multiple lines. Can anyone suggest a unix script or software package? Thanks for any advice. Stacy

fasta unix • 2.3k views

ADD COMMENT • link updated 3.9 years ago by shriram • 0 • written 5.2 years ago by stacy734 ▴ 40

0

Entering edit mode

Thank you! This is very useful.

With appreciation, Stacy

ADD REPLY • link 5.2 years ago by stacy734 ▴ 40

0

Entering edit mode

Please consider upvoting and accepting answers if your problem was solved.

ADD REPLY • link 5.2 years ago by ATpoint 88k

1

Entering edit mode

5.2 years ago

zubenel ▴ 120

If you install bioperl module Bio::SeqIO you could use the script below. Just save it as split_fasta.pl and run like ./split_fasta input.fasta output.fasta 100. In this way sequences from input.fasta file will be divided by 100 Ns and saved to file output.fasta.

If the sequence called "some_string" is splitted to 3 new sequences these are named as: "some_string_1", "some_string_2", "some_string_3".

#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;

# Quit unless 3 command-line arguments are provided
my $num_args = $#ARGV + 1;
if ($num_args != 3) {
    print "\nUsage: split_fasta.pl input_fasta output_fasta N_length\n";
    exit;
}
my $input_fasta  = $ARGV[0];
my $output_fasta = $ARGV[1];
my $N_length     = $ARGV[2];

# Reading the input fasta file
my $seqio_in = Bio::SeqIO->new(-file => $input_fasta, 
                               -format => "fasta" );

# Creating the output fasta file                             
my $seqio_out = Bio::SeqIO->new(-file => ">$output_fasta", 
                                -format => "fasta" );

while ( my $seq = $seqio_in->next_seq ) {

    # Split fasta sequence by Ns of length at least $N_length
    my @splitted_seqs = split /[Nn]{$N_length,}/, $seq->seq;
    my $count = 0;
    my $old_name = $seq->display_id;

    # Save original sequence to output fasta if it was not splitted
    if ($#splitted_seqs == 0) {
        $seqio_out->write_seq($seq);            
    }

    # Save splitted sequences to output fasta by adding number to sequence name
    else {
        foreach my $new_seq (@splitted_seqs) {
            my $new_name = $old_name . "_$count";
            $seq->display_id($new_name);
            $seq->seq($new_seq);
            $seqio_out->write_seq($seq);
            $count++;       
        }
    }

}

ADD COMMENT • link 5.2 years ago by zubenel ▴ 120

0

Entering edit mode

This works perfectly fine on fasta files (have provided +1). But I have a fastq file, and want to split both sequences and quality at each Ns. Is it possible? Thanks

ADD REPLY • link 3.9 years ago by shriram • 0

0

Entering edit mode

5.2 years ago

boaty ▴ 220

I suggest you to use SeqIO.par in biopython or read.fasta() in sequinr. Those tools will transform fasta 4 lines format into a list of object which concatenate fasta sequence line into a string. Then just use split method to cut them

ADD COMMENT • link 5.2 years ago by boaty ▴ 220

0

Entering edit mode

5.2 years ago

GenoMax 151k

Prior answer of interest: A: split fastq sequence by motif (string of Ns)

You can linearize the fasta files by using some code from @Pierre found here:

Linearize a fasta sequence

awk -f linearizefasta.awk < input.fa

or

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < input.fa

Format back to fasta

tr "\t" "\n" < linearized.tsv

if you know your fasta header have a length < 60

tr "\t" "\n" < linearized.tsv | fold -w 60

view raw README.md hosted with ❤ by GitHub

	/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;}
	{printf("%s",$0);}
	END {printf("\n");}

view raw linearizefasta.awk hosted with ❤ by GitHub

ADD COMMENT • link 5.2 years ago by GenoMax 151k

Login before adding your answer.