Split sequence according the 'N' base
5
0
Entering edit mode
6.5 years ago
Alex ▴ 50

Hi, I have some sequence and want to split the sequence by the N base using python or biopython, but I have not good idea how to split it. The sequence like this:

>chr1
ATCGCTGATCGATNCTAGCTAGCTAG
CGTAGCTGCTAGNCGTAGCCGTAGCT

what I need like this

>chr1_1

ATCGCTGATCGAT

>chr1_2

CTAGCTAGCTAGCGTAGCTGCTAG

>chr1_3

CGTAGCCGTAGCT

have any idea about it ?

Thanks

Alex

sequence • 3.5k views
ADD COMMENT
0
Entering edit mode

What do you want to happen when you reach 100 Ns in a row?

ADD REPLY
0
Entering edit mode

Sorry,I have forget it. if there exist many N base , we also want to split it util there have 'A','T','G'or 'C' base

ADD REPLY
4
Entering edit mode
6.5 years ago

A python solution

Dependencies: Biopython

#!/usr/bin/python2.7

from Bio import SeqIO
import getopt,sys,re


def usage():
    print "Usage: split.py -i <input_scaffold_fasta> -o <output_contig_fasta>"

try:
    options, remainder=getopt.getopt(sys.argv[1:], 'i:o:h')

except getopt.GetoptError as err:
    print str(err)
    usage()
    sys.exit()

for opt, arg in options:
    if opt in ('-i'):
        input_file=arg
    if opt in ('-h'):
        usage()
    sys.exit()
    elif opt in ('-o'):
        output_file=arg

out=open(output_file, 'w')

sequence = ''.join([str(record.seq).strip() for record in SeqIO.parse(input_file, "fasta")])

m=re.sub('[nN]+','\n',sequence).split('\n')

for i in range(0,len(m)):
    out.write('>contig_'+str(i)+'\n')
    out.write(m[i]+'\n')

Usage

[comp]$ python split.py -i input_fasta_file.fa -o output_fasta.fa

Output

>contig_0
ATCGCTGATCGAT
>contig_1
CTAGCTAGCTAGCGTAGCTGCTAG
>contig_2
CGTAGCCGTAGCT
ADD COMMENT
0
Entering edit mode

Thanks, looks great

Aelx

ADD REPLY
0
Entering edit mode

Alex, you should upvote OR accept a post as answer if any one or several posts helped or answered your question

click here to know how

Remember, you can accept multiple answers, as long as they provide valid solution to your top level problem. Accepting an answer is your single most important contribution to the community, as it helps others that face the same problem you did.

ADD REPLY
4
Entering edit mode
6.5 years ago
JC 13k

just for fun:

#!/usr/bin/perl

use strict;
use warnings;

$/ = "\n>";

while (<>) {
    s/>//g;
    my ($id, @seq) = split (/\n+/, $_);
    my @slices = split (/N+/, join("", @seq));
    my $num = 0;
    foreach my $seq (@slices) {
        $num++;
        print ">$id\_$num\n";
        while ($seq) {
            print substr($seq, 0, 80), "\n";
            substr($seq, 0, 80) = '';
        }
    }
}

Run:

$ perl split_by_Ns.pl < test.fa
>chr1_1
ATCGCTGATCGAT
>chr1_2
CTAGCTAGCTAGCGTAGCTGCTAG
>chr1_3
CGTAGCCGTAGCT
ADD COMMENT
4
Entering edit mode
6.5 years ago

Here's a way that doesn't use slow Python libraries, works with repeated stretches of Ns, and should be agnostic to Python 2 or 3:

#!/usr/bin/env python

import sys
import re

not_first = False

def process_seq(header, seq):
    # reduce repeated Ns                                                                                                                                                                                                                          
    seq = re.sub(r'[^\w\s]|(.)(?=\1)', '', seq)
    # get indices of Ns                                                                                                                                                                                                                           
    match_idx = 1
    posn_idx = 0
    for match in re.finditer(r'N(.*?)', seq):
        sys.stdout.write("%s_%d\n%s\n" % (header, match_idx, seq[posn_idx:match.start()]))
        posn_idx = match.end()
        match_idx += 1
    sys.stdout.write("%s_%d\n%s\n" % (header, match_idx, seq[posn_idx:]))

for line in sys.stdin:
    if line.startswith('>'):
        if not_first:
            process_seq(header, seq)
        header = line.rstrip()
        seq = ''
        not_first = True
    else:
        seq += line.rstrip()

if not_first:
    process_seq(header, seq)

Output:

$ ./split_on_Ns.py < input.fa
>chr1_1
ATCGCTGATCGAT
>chr1_2
CTAGCTAGCTAGCGTAGCTGCTAG
>chr1_3
CGTAGCGTAGC
ADD COMMENT
3
Entering edit mode
6.5 years ago

using awk: can handle multiple Ns, case insensitive for Ns:

$  awk '{gsub("[Nn]+","\n>\n");}1' test.fa | awk '/^>/ {if ($0 == ">") {$0=prev} prev=$0}1' | awk '/^>/ {getline seq} {if(seq!="") {print $0"\n"seq}}' | awk '(/^>/ && a[$0]++) {$0=$0"_"a[$0]}1' 
>chr1
ATCGCTGATCGAT
>chr1_2
CTAGCTAGCTAGCGTAGCTGCTAG
>chr1_3
CGTAGCCGTAGCT
>chr2
CCTGA
>chr2_2
GTAGT
>chr3
ATG
>chr4
C
>chr4_2
T

input: $ cat test.fa

>chr1
ATCGCTGATCGATNCTAGCTAGCTAGCGTAGCTGCTAGNCGTAGCCGTAGCT
>chr2
CCTGANGTAGT
>chr3
ATGNNNNNN
>chr4
NNNNCnT

make sure that sequences are single line. If they are not, use following command instead of above line:

$ seqkit seq -w0 test.fa | awk '{gsub("[Nn]+","\n>\n");}1' | awk '/^>/ {if ($0 == ">") {$0=prev} prev=$0}1' | awk '/^>/ {getline seq} {if(seq!="") {print $0"\n"seq}}' | awk '(/^>/ && a[$0]++) {$0=$0"_"a[$0]}1'

Download seqkit from here.

ADD COMMENT
1
Entering edit mode

Another way producing same result of yours.

seqkit locate -P -p '[^Nn]+' seqs.fa  --gtf > contigs.gtf

seqkit subseq --gtf contigs.gtf seqs.fa | sed -r 's/_.+//' | seqkit rename | seqkit seq -i
ADD REPLY
2
Entering edit mode
6.5 years ago
microfuge ★ 1.9k

Another approach based on UCSC utils (http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/) and bedtools . Only suitable for bigger files.

  • Convert fasta to twobit

faToTwoBit genome.fa genome.2bit

  • Get the locations of Ns in bed form

twoBitInfo -nBed genome.2bit N.bed

  • Use bedtools to get compliment of Ns and bedtools getfasta to get fasta for the compliments (The genome file can be generated with samtools faidx and cut ) samtools faidx genome.fa; cut -f 1,2 genome.fa.fai > genome.genome

bedtools complement -i N.bed -g genome.genome |bedtools getfasta -fi genome.fa -bed -

Advantage being speed and also getting the coordinates The output -

  • >chr1:0-13
  • ATCGCTGATCGAT
  • >chr1:14-38
  • CTAGCTAGCTAGCGTAGCTGCTAG
  • >chr1:39-52
  • CGTAGCCGTAGCT
ADD COMMENT

Login before adding your answer.

Traffic: 2691 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6