Genome Assembly Sorting
2
0
Entering edit mode
6 weeks ago
Umer ▴ 130

HI,

I have generated some genome assemblies using Nanopore data with FLYE assembler.

I have also performed 3x rounds of polishing with Racon (using nanopore data) and 5x rounds with Pilon (using illumina data)

Now t have looked at my fasta files and they are not sorted (as longest to shortest contigs) and the contig numbering is also not in numerical order.

>contig_1
>contig_102
>contig_103
>contig_104
>contig_105
>contig_106
>contig_107
>contig_11
>contig_110
>contig_111
>contig_112
>contig_113
>contig_114
>contig_117
>contig_120
>contig_121
>contig_122
>contig_124
>contig_125
>contig_128

My Questions

  1. Do I need to perform this contig renaming and sorting at the first step where i get the genome assemblt fasta files?
  2. Or I should sort the final polished assembly (longest to shortest) and rename them ?
  3. Is their any specific way to raname contigs ?
genome assembly sorting • 537 views
ADD COMMENT
1
Entering edit mode

If you look, your contigs do appear to be sorted by numerical value, its just not a "natural" sort because the numbers are not to the same significant figures or zero-padded.

As genomax points out though, this rarely matters.

ADD REPLY
0
Entering edit mode

Hi, I understand your point, but if you see the above shared answern this is what i mean.

ADD REPLY
3
Entering edit mode
6 weeks ago
GenoMax 147k

This is --> https://stackoverflow.com/questions/45950646/what-is-lexicographical-order

Why are you bothered by it? If you want to sort your contigs by length then use --> Order contig by size

ADD COMMENT
0
Entering edit mode

HI, the reason I am concerned is that

When i generate samtools faidx index of original assembly they look like this

contig_1    21276   10  60  61
contig_102  5361    21653   60  61
contig_103  7061    27116   60  61
contig_104  121625  34307   60  61
contig_105  41647   157972  60  61
contig_106  59630   200326  60  61
contig_107  62762   260962  60  61
contig_11       216926  324782  60  61
contig_110  11781   545336  60  61
contig_111  1667    557326  60  61
contig_112  2011    559033  60  61

when i sort the assembly.fasta via seqkit sort -l -r assembly.fasta > assembly_sort.fasta and then index using samtools the contigs rearrange according to length like this

contig_43   6501589       11    60  61
contig_5    5413231 6609970         60  61
contig_18   5374465 12113433    60  61
contig_2    5140464 17577483    60  61
contig_4    4441101 22803632    60  61
contig_34   4232623 27318763    60  61
contig_28   3504136 31621941    60  61
contig_17   3273534 35184491    60  61
contig_47   3179198 38512595    60  61
contig_32   2708330 41744791    60  61
contig_25   2470542 44498271    60  61

so this made me think i i have to sort the assembled genome, if yes, then before polishing steps or after polishing its also OK ?

ADD REPLY
1
Entering edit mode

You are using seqkit with options to sort the data based on length (-l) and reverse sort (-r). So it is not surprising that you are getting a different list of contigs. If you used -n (or just did seqkit sort), your sort result should look identical.

This is simply rearranging the data according to its length without changing it.

ADD REPLY
2
Entering edit mode
6 weeks ago

I'd use a tool like RagTag if you have a related genome to scaffold the contigs into pseudochromosomes.

https://github.com/malonge/RagTag

Then you don't need to worry too much about contig names.

ADD COMMENT

Login before adding your answer.

Traffic: 1906 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6