Hello All,
I am running variant calling on some species whose reference genomes have a very high number of contigs (sometimes >400,000). The variant caller I am using splits the job by the number of chromosomes, and is overwhelmed when this number is too high. Therefore I would like to concatenate the contigs for a given species reference fasta file into ~30 contigs.
I believe I could use code such as below to merge all the contigs into one:
> grep -v "^>" test.fasta | awk 'BEGIN { ORS=""; print
> ">Sequence_name\n" } { print }' > new.fasta
However I would like to merge them into ~30 contigs so the process can still be parallelised. I would also like to insert 1000 'N' characters between each of the merged contigs within these merged contigs, to avoid mapping issues that could be caused by merging contig sequences from different parts of the genome.
Does anyone have any advice for how to do this or know of any application that could do something similar?
Thanks in advance for your help.
Thanks for this. Unfortunately when I save the file and try to run the script as suggested it only shows the awk help manual. It doesn't seem to recognise the -i option? The first lines of what is printed:
Ooops, my mistake. It has to be
-v i=140
.brilliant thank you
this script takes ages to run, is this normal? (i have a very large fragmented genome with 37mil contigs)
That is a very big file, with a lot of separate entries. These lightweight approaches are probably not appropriate for such a file. Can you do anything to reduce your dataset?
It sounds like you have bigger problems than merging contigs though. I'm not sure what your organism is, or the expected genome size, but 37M contigs sounds like a horrendous assembly to me (but I work in bacteria so I may be off base).
Forgive what might seem like a patronising question, but are you sure you mean _contigs_ and not reads?
yes its a horrendous genome ist from the european silver fir: https://www.g3journal.org/content/9/7/2039
those are the quast results:
OP is a little bit old. If you are trying to merge contigs, you can use contig assemblers like hera. If you are trying to group like sequences, you can use CD-HIT. If you are trying to merge sequence by partial or full ID (fasta header), you can use tools like seqkit. Please open a new post with example data and expected output.
thanks i will do that.