I have several questions regarding the scaffolds and contigs.
Are they pieces of DNA that are assembled into contigs and scaffolds but cannot be confidently combined with chromosome assembly?
Do they already have corresponding sequences in the assembled chromosomes, or is their sequence composition completely different than the chromosomes already assembled?
And finally, should we remove them from the reference genome before aligning?
Any continuous piece of DNA that is obtained by reliably overlapping shorter reads can be considered a contig. That means that a 200bp of continuous DNA would be a contig, but it would take a much longer piece to call it a scaffold. I don't know if there is a formal cutoff when a config becomes a scaffold, but let's just say that a scaffold is definitely a contig, while the reverse is not necessarily true.
Not sure I understand this question. If you already have a fully assembled chromosome and scaffold/contigs are not in it, that would mean they are potentially parts of a different chromosome. If you are talking about a chromosome from a reference assembly, then most scaffolds/contigs should be in it assuming an assembly without contamination.
It won't harm the alignment to reference genome whether you remove them or not, though the final aligned fraction will be smaller if these scaffolds/contigs have no matches in the reference.
I am assuming these are scaffolds/contigs that are not part of the assembled chromosomes. Is that correct? If so, what could these sequences be? Are they actually part of the chromosomes but not confidently assembled? Can parts of these sequences be the same as the ones in assembled chromosomes?
I read that but what does "not yet mapped" mean? That still leaves my questions unanswered:
Are they actually part of the chromosomes (do we have gaps in the current chromosome assembly that these sequences could be later mapped confidently)? Can parts of these sequences be the same as the ones in assembled chromosomes?
Thanks for the reply. My second question is actually about the fasta entries in the "nonchromosomal" file in the following link: ftp://ftp.ensembl.org/pub/release-98/fasta/homo_sapiens/dna/
I am assuming these are scaffolds/contigs that are not part of the assembled chromosomes. Is that correct? If so, what could these sequences be? Are they actually part of the chromosomes but not confidently assembled? Can parts of these sequences be the same as the ones in assembled chromosomes?
In that directory you referenced there is a
README
file, and it says inside of it:Non-chromosomal assembly sequences: e.g. mitochondrial genome, sequence contigs not yet mapped on chromosomes
I read that but what does "not yet mapped" mean? That still leaves my questions unanswered:
Are they actually part of the chromosomes (do we have gaps in the current chromosome assembly that these sequences could be later mapped confidently)? Can parts of these sequences be the same as the ones in assembled chromosomes?