Question

How to find out if contig size is limited by repeats

1

Entering edit mode

4.4 years ago

lagartija ▴ 160

Hi !

I am doing metagenomics and to try to guess if my contig size is limited by repeats (i.e if their coverage is high enough, have we reach the genome size or are there still repeats that complicate the assembly).

Do you know how I could do this ? For the moment I have tried the EMBOSS tools to find repeats (palindrome, einverted and equicktandem) but I get a lot of repeats everywhere and it is very hard to know if the repeats at the ends of the contigs are significant.

thank you for your help,

Assembly metagenomics • 1.1k views

ADD COMMENT • link 4.4 years ago by lagartija ▴ 160

2

Entering edit mode

Probably the best thing to do would be to take the assembly graph and view it in bandage. If you see lots of 'bubbles' and forks, then you have repetitive regions at the ends of those particular contigs.

ADD REPLY • link 4.4 years ago by Joe 21k

0

Entering edit mode

Yes, excellent idea, I did not think about that. If there was a way to make that high throuput for metagenomics it would be perfect. Maybe a way bandage could create one image per graph in command line or a wayI could get one "tangleness score" per graph it would be perfect

ADD REPLY • link 4.4 years ago by lagartija ▴ 160

0

Entering edit mode

You could parse the graph file yourself I'm sure and get a breakdown of the 'tangleness' on a per-contig basis.

I can't say I've ever looked deep into a graph file though so I'm not sure what's involved.

ADD REPLY • link 4.4 years ago by Joe 21k

0

Entering edit mode

Good idea : Here is what a gfa file (for spades) looks like :

S ... sequences
L       73      +       5       -       99M
L       3277    +       5       -       99M
L       325     -       338     -       99M
...
P       NODE_1_length_3725_cov_13.774407_1      271+    *
P       NODE_2_length_2509_cov_13.712863_1      307+    *
P       NODE_3_length_2404_cov_11.192191_1      342+    *
P       NODE_4_length_2263_cov_11.379852_1      340+,325+       *
P       NODE_5_length_2136_cov_11.691213_1      336+    *
...

I could say here NODE_1_length_3725_cov_13.774407_1 has a "tangleness" of 1 wheras NODE_4_length_2263_cov_11.379852_1 has a "tangleness" of 2. This gives me an overall image of the complexity of the assembly but it I can't know if the complexity comes from repeats at the end or the middle of my assembly ~

ADD REPLY • link updated 4.3 years ago by Joe 21k • written 4.3 years ago by lagartija ▴ 160

0

Entering edit mode

Well my intuition (which could be wrong) would be if the repeats occur in the middle of the contig, it wouldn't form a contig in the first place, as the bubble would cause the contig to be split, so by definition, the breaks must be at the ends of the contigs no?

ADD REPLY • link 4.3 years ago by Joe 21k

0

Entering edit mode

Looks like you won't even have to write your own parser: https://github.com/ggonnella/gfapy

Just wrap some analysis logic around it and you should be off to the races..

ADD REPLY • link 4.3 years ago by Joe 21k