Entering edit mode
4.4 years ago
lagartija
▴
160
Hi !
I am doing metagenomics and to try to guess if my contig size is limited by repeats (i.e if their coverage is high enough, have we reach the genome size or are there still repeats that complicate the assembly).
Do you know how I could do this ? For the moment I have tried the EMBOSS tools to find repeats (palindrome, einverted and equicktandem) but I get a lot of repeats everywhere and it is very hard to know if the repeats at the ends of the contigs are significant.
thank you for your help,
Probably the best thing to do would be to take the assembly graph and view it in
bandage
. If you see lots of 'bubbles' and forks, then you have repetitive regions at the ends of those particular contigs.Yes, excellent idea, I did not think about that. If there was a way to make that high throuput for metagenomics it would be perfect. Maybe a way bandage could create one image per graph in command line or a wayI could get one "tangleness score" per graph it would be perfect
You could parse the graph file yourself I'm sure and get a breakdown of the 'tangleness' on a per-contig basis.
I can't say I've ever looked deep into a graph file though so I'm not sure what's involved.
Good idea : Here is what a gfa file (for spades) looks like :
I could say here
NODE_1_length_3725_cov_13.774407_1
has a "tangleness" of 1 wherasNODE_4_length_2263_cov_11.379852_1
has a "tangleness" of 2. This gives me an overall image of the complexity of the assembly but it I can't know if the complexity comes from repeats at the end or the middle of my assembly ~Well my intuition (which could be wrong) would be if the repeats occur in the middle of the contig, it wouldn't form a contig in the first place, as the bubble would cause the contig to be split, so by definition, the breaks must be at the ends of the contigs no?
Looks like you won't even have to write your own parser: https://github.com/ggonnella/gfapy
Just wrap some analysis logic around it and you should be off to the races..