Entering edit mode
6.7 years ago
FAST_GENOME
▴
60
Dear All, What is the drawback of mapping reads to a very fragmented genome reference?The genome is about 200M, and about ~10,200 contigs. What is acceptable contigs number? Could you please give me some pointer? Thanks.
AL
One problem would be two reads mapping on two different contigs (which are supposed to follow each other in the original genome but could not be connected due to missing data/repetitiveness/etc.), which would cause the pair to be labeled as not mapping properly and could screw with statistics. Of course, with this kind of mismatching pairs you might actually obtain information to anchor the two contigs together.
Another problem with fragmented references is that you don't have any meaningful of topology, e.g. gene order etc.
In general, what you want is a low number of long contigs/scaffolds. Can you run e.g. QUAST to get some other descriptive statistics for your reference? 200M and 10k contigs means your contigs have an average length of 20kbp. The question is now how does the length distribution look like? It is also important that your reference covers enough of the gene space, otherwise, what can you do with it? You could run BUSCO to check which core genes are contained in the reference.
Thank you so much for your reply
None on the face of it. If that is what you have to work with. You should get alignments in any case. If there are duplicated regions (that have not been cleaned up) then you will get reads multi-mapping to those regions when in reality they may be from a unique region. If parts of genome are missing then those reads will not map. You may also see discordant alignments (if you have PE reads) when parts of the genome are in two contigs where they should be in one.
Thank you so much for your reply