Entering edit mode
7 months ago
njornet
▴
20
I want to use sequencing data for clinical diagnosis. I use minimap2 to align and qualimap, among others, for the report of the alignment. I download reference within the alignment if it is not provided. I use ncbi-genome-download to download the GRCh38.p14 GCF_000001405.40 assembly which contains a bunch of unlocalized scafools. Should I get rid of these before aligning or keep them? I won't use these regions as I want to detect variations in already annotated genes.
Thank you!
Yes you can remove those if you are only interested in annotated genes.
Ok thank you! Do you know if there is an easy way of doing so with ncbi-genome-download or I need to use another tool?
These regions are present in the genome and removing them might incorrectly redirect alignments to known genes. I don't see why you would remove. Align to everything, then subset to known genes.
OP is working with long reads so the chances of reads aligning to unlocalized scaffolds are relatively small. Doing some testing may be sufficient to check if this. If reads are aligning to unlocalized scaffolds preferentially then they are not likely to be of any clinical use.
njornet: Do you have a specific reason for this? Technically as ATPoint says you should leave those in.
I don't want reads incorrectly aligned to these scaffolds and loose information of relevant regions, but as you said that probably won't happen with long reads. Although I've aligned one run we did a while ago (with the old chemistry) and I got much more depth in some of these scaffolds, which I would guess it's not abnormal, but I don't know.
Another thing is, I have a pipeline which takes as input a reference genome or download it as I explain in the post if it's not provided. I want to select the regions of interest using a bed file, I use samtools view, but the name of the chromosomes/sequence in the bed file must be the same as in the reference file. However the name of the chromosomes in the fasta files are different from reference to reference. I wanted to modified the reference file by changing lines with the pattern "chromosome {number}" assuming all reference files follow this pattern. If I do that, I will be changing the name of the chromosomes from whatever the name in that ref file is to "chr{number}" for example, which is what I have in my bed file, but I the unlocalized scaffolds and extra sequences also have the pattern "chromosome {number}", so that lines will also be changed to "chr{number}". I don't know if the explanation was clear, probably not... Maybe I'm over-complicating things and I should just modify manually the bed or the ref file idk.