Hi guys,
I have an insertion in my genome and I want to identify the insertion site. As far as I understand I should find the read pairs where one read is mapped to my insertion sequence, but mate is unmapped. I found these reads using samtools:
samtools view -F 4 -f 8 alignment.sam > alignment_filtered.sam
Now I need to find the unmapped mates. How can I do that? As far as I understand I should then 'de novo' build the insertion site sequences (left and rigth from my insertion) using the unmapped mates. I would be grateful if anyone could suggest me the tool for that and also comment on the whole procedure I am using.
Thank you in advance
Thank you very much Brian! Do you know how could I build the insertion cite sequences using the unmapped reads? Have you ever worked with any de novo alignment tools? Or maybe it is better to map these reads to the reference genome? Thanks in advance.
I would do this:
That will give you the insertion and its genomic context assembled together. Alternatively, rather than assembling, you could re-map the insert-mapped reads to the reference:
Running the generated shellscript bs.sh will generate a sorted, indexed bam file that you can observe in a genome browser (as suggested by Genomax) such as IGV, which will clearly show the exact site of the insertion.
Thank you very much, Brian! I just realized that this is you, who developed this aligner! Amazing! :) May I ask you one more question? As far as I understand bbmap is developed for short reads and my reads are length of 100. Is it ok?
bbmap will align reads of any length up to 6kbp (you will need to use
mapPacbio.sh
variant for reads over 600bp).Hi guys. I finally finished the computations and generated both assembled.fa and genomeMapped.sam for two of my insertion sequences. Now I want to look at genomeMapped.sam using IGV. I've opened the sam file and noticed that RNAME there is the name of my insertion sequence. How can I see the insertion site on my reference genome then?
One more question is related to assembled.fa. It is a file in a format of:
I want to compare two insertion cites and find out whether they are different or not. Can I do this by comparing two assembled.fa somehow or I should better use sam file and IGV for that?
Thank you in advance!
Am I right that it should be
instead of your script (without ref=reference_genome.fasta )?
There can be no spaces between options and = sign.
assembled.fa
(inspite of what the name says) appears to be in fastq format. How did that happen? I think you may have typed the name in asassembled.fq
when you ran thetadpole.sh
command.You mean use it as reference for the next stage? Like
where assembled.fa is converted from fastq to fa? I thought about using reference genome, not assembled sequence as I want to find out on what chromosome my insertion is located, what genes are nearby, etc. If I use assembled.fa, as far as I understand, I will not be able to get this information.
C: Identification of the sequence insertion site in the genome says that you do one of those two (or both if you want to). They are independent options though. That is the way I read it.
Sorry for that. May I ask you whether it will work the way I want it to work if I use whole reference genome sequence and reference for bbmap.sh?
If you use the whole genome as a reference for BBMap, then you will be able to look at the sorted, indexed bam in IGV and see the exact genomic context of the insertion event. Basically, there will be a point that reads don't span. This will be much easier to find if you only use pairs where at least one read maps to the insertion sequence, because then the only part of the genome that will have any coverage will be the part around the insertion.
Thank you so much @Brian and @genomax2! I very much appreciate your help!
@Brian would hopefully be along in 2-3 hours to answer that.
You are right! It was my typo!
If the BAM file represents a pretty small insertion, how can it be found once opened in IGV? Thanks.
You could visually check with the
alignment_filtered.sam
file above using a genome browser. The site should be easily apparent. What kind of genome is this?Human genome, if I understood your question correctly
Then you would need to locate reads (by name, you can search in IGV) that correspond to the ones present in
unmapped.sam
file. Those would be the ones mapped at/near the insertion site.I will try this! Thank you!