Merge contigs in fasta file
1
2
Entering edit mode
5.6 years ago
Rubal ▴ 350

Hello All,

I am running variant calling on some species whose reference genomes have a very high number of contigs (sometimes >400,000). The variant caller I am using splits the job by the number of chromosomes, and is overwhelmed when this number is too high. Therefore I would like to concatenate the contigs for a given species reference fasta file into ~30 contigs.

I believe I could use code such as below to merge all the contigs into one:

> grep -v "^>" test.fasta | awk 'BEGIN { ORS=""; print
> ">Sequence_name\n" } { print }' > new​.fasta

However I would like to merge them into ~30 contigs so the process can still be parallelised. I would also like to insert 1000 'N' characters between each of the merged contigs within these merged contigs, to avoid mapping issues that could be caused by merging contig sequences from different parts of the genome.

Does anyone have any advice for how to do this or know of any application that could do something similar?

Thanks in advance for your help.

fasta merge contigs • 6.4k views
ADD COMMENT
5
Entering edit mode
5.6 years ago

You can try this awk solution. But first you have to calculate how many contigs should be merge into a single sequence.

Save this script as merge_contigs.awk

function print_n(n)
 {
    text = ""
    for(j=0;j<n;j++) text=text"N"
    return text
 }

BEGIN {n=1} 
$0 ~ "^>" { 
    if(n<=i) {
        header = header""substr($0,2)" ";
        if(n > 1) seq=seq""print_n(1000)
        n++;
    } 
    else {
        print ">"header"\n"seq; 
        n=1; 
        header=substr($0,2); 
        seq=""
    } 

    next;
} 
{seq=seq$0} 
END {print ">"header"\n"seq;}

Run it like this:

$ awk -v i=140 -f merge_contigs.awk input.fa > output.fa

Set the value for -v i to the number of contigs that should get merged into a single sequence.

fin swimmer

ADD COMMENT
0
Entering edit mode

Thanks for this. Unfortunately when I save the file and try to run the script as suggested it only shows the awk help manual. It doesn't seem to recognise the -i option? The first lines of what is printed:

Usage: awk [POSIX or GNU style options] -f progfile [--] file ...
Usage: awk [POSIX or GNU style options] [--] 'program' file ...
POSIX options:      GNU long options:
    -f progfile     --file=progfile
    -F fs           --field-separator=fs
    -v var=val      --assign=var=val
ADD REPLY
0
Entering edit mode

Ooops, my mistake. It has to be -v i=140.

ADD REPLY
0
Entering edit mode

brilliant thank you

ADD REPLY
0
Entering edit mode

this script takes ages to run, is this normal? (i have a very large fragmented genome with 37mil contigs)

ADD REPLY
0
Entering edit mode

That is a very big file, with a lot of separate entries. These lightweight approaches are probably not appropriate for such a file. Can you do anything to reduce your dataset?

It sounds like you have bigger problems than merging contigs though. I'm not sure what your organism is, or the expected genome size, but 37M contigs sounds like a horrendous assembly to me (but I work in bacteria so I may be off base).

Forgive what might seem like a patronising question, but are you sure you mean _contigs_ and not reads?

ADD REPLY
0
Entering edit mode

yes its a horrendous genome ist from the european silver fir: https://www.g3journal.org/content/9/7/2039

those are the quast results:

########
QUAST Results
########

All statistics are based on contigs of size >= 500 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs).

Assembly                    Abal.1_1   
# contigs (>= 0 bp)         37192295   
# contigs (>= 1000 bp)      1276678    
# contigs (>= 5000 bp)      529013     
# contigs (>= 10000 bp)     343016     
# contigs (>= 25000 bp)     145508     
# contigs (>= 50000 bp)     46234      
Total length (>= 0 bp)      18167382048
Total length (>= 1000 bp)   13017811908
Total length (>= 5000 bp)   11361640463
Total length (>= 10000 bp)  10034318481
Total length (>= 25000 bp)  6872368770 
Total length (>= 50000 bp)  3406852776 
# contigs                   1887964    
Largest contig              297427     
Total length                13450974050
GC (%)                      38.76      
N50                         25814      
N75                         9780       
L50                         139726     
L75                         348468     
# N's per 100 kbp           1703.76  
ADD REPLY
0
Entering edit mode

OP is a little bit old. If you are trying to merge contigs, you can use contig assemblers like hera. If you are trying to group like sequences, you can use CD-HIT. If you are trying to merge sequence by partial or full ID (fasta header), you can use tools like seqkit. Please open a new post with example data and expected output.

ADD REPLY
0
Entering edit mode

thanks i will do that.

ADD REPLY

Login before adding your answer.

Traffic: 2261 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6