I have a reference genome and a bunch of sequenced and assembled genomes of the same species (human). The sample genomes are lifted over to allow comparison with the reference genome. Some sites of known insertions have already been masked by repeatmasker.
I would like to detect novel insertions using BEDtools.
Please do not add answers unless you're answering the top level question. Instead, use Add Comment or Add Reply as appropriate.
"My advisor told me" is not a good enough reason to stick to a tool. Have you done your research to figure out if the tool can do what you need it to do, and how to do it as well as the limitations the tool poses?
No I have not. He said it's all I need. Would bedtools not be a good choice then? Or would it work better under certain conditions? If I were to choose bedtools how would I do it?(I'm not sure if he would accept something else)
It is unclear what data you even have especially the format. Do you have BED files, BAM files, Fastq? In general BEDtools is not among the standard tools for insertion search.
I have a BED file for the reference, FASTA for reference genome, BAM for other genomes. I'm not sure if there are BED files available for the other genomes, I think so but I'll have to check since I haven't received all the files yet. Assuming I had BED files for all, how would I proceed? What if I didn't have BED files?
I am not an expert in comparing genomes but I am reasonably sure that comparing intervals is (probably?) not accurate. Shouldn't you do like global alignments or local realignments of interesting regions? Intervals can be influenced by a lot of factors such as stretches of repetitive nucleotides or Ns. If variants indeed exist should be determine by comparing sequences, not intervals.
Hi,
I'm not 100% sure that I undertand what you're trying to do, but in any case bedtools would not be useful here because a bed file wouldn't contain the information you are looking for.
I am assuming that when you say "genomes are lifted over" you mean that they were mapped/aligned to the reference. If that's the case, then you should have the mapping result which could be of various formats, depending on the alignment software used (e.g. sam/bam or one of the Blast outputs). What you are looking for are alignment gaps, or in other words cases where the edges of a sample contig were mapped to the reference, but some "bulges" occur in the middle. You want the sequences of those bulges.
I had a similar problem lately and created this script. You are welcome to use it, but this will only work if you have sam/bam file as input. Also, it prints out fasta sequences - not sure if that's what you want.
Does this make sense?
I use a script like yours to find previously aligned sequence data to find the start and end positions of the alignments made.
Then afterwards maybe using another script/next step of last one, get the start:end positions + sequences of the areas not shown, the bulges.
And I have the extracted bulge sequences I can then closely examine them through other means like a prediction program followed by experimental validation.
Is that what you mean? Is there something I'm missing?
Well, it's been a while and I'm not sure where the script I linked to is... However, you can look at this one instead. It expects a Minimap2 PAF file as input. Hope this helps.
Just curious - is there a reason you're looking for solutions using bedtools?
My advisor told me it's what I should use.
Please do not add answers unless you're answering the top level question. Instead, use
Add Comment
orAdd Reply
as appropriate."My advisor told me" is not a good enough reason to stick to a tool. Have you done your research to figure out if the tool can do what you need it to do, and how to do it as well as the limitations the tool poses?
No I have not. He said it's all I need. Would bedtools not be a good choice then? Or would it work better under certain conditions? If I were to choose bedtools how would I do it?(I'm not sure if he would accept something else)
It is unclear what data you even have especially the format. Do you have BED files, BAM files, Fastq? In general BEDtools is not among the standard tools for insertion search.
I have a BED file for the reference, FASTA for reference genome, BAM for other genomes. I'm not sure if there are BED files available for the other genomes, I think so but I'll have to check since I haven't received all the files yet. Assuming I had BED files for all, how would I proceed? What if I didn't have BED files?
In case, you have annotation (BED format) for each genomes. you can use bedtools to compare between genomes. like:
intersect
,I am not an expert in comparing genomes but I am reasonably sure that comparing intervals is (probably?) not accurate. Shouldn't you do like global alignments or local realignments of interesting regions? Intervals can be influenced by a lot of factors such as stretches of repetitive nucleotides or Ns. If variants indeed exist should be determine by comparing sequences, not intervals.