Hi,
I'm looking to get all the unmapped regions from a sequenced BAM file. The purpose of this is that I'm trying to find genetic recombination of a gene/region, so I'm guessing that a successful recombination would result in the read being marked as unmapped "trash" and that I could search through the unmapped regions and find it. The basic idea is
Ref Genome w/ Mapped Reads
AAA*****************************BBB
Unmapped Read
AAA**BBB
A couple of problems I'm having is that while I can use
samtools view -b -f 4 ORIGINAL.bam > UNMAPPED.bam
to presumably get all the unmapped regions in the bam file, I am unsure on how to go about searching through the reads individually and looking for recombination (where AAA and BBB would be closer than normal, present on the same read). I have a java program to take in reads as strings and identify recombination if the AAA and BBB regions are sufficiently close, recording the distance as well.
I'm basically stuck on extracting the reads, and also I've been wondering if this is the best way to go about finding recombination. I haven't been in the field long enough but I would assume that this or a similar workflow would be pretty common.
Thanks!
How divergent are the sequences that you suspect to be recombining? If divergence is low (below 10% identity), the read will map randomly between them.
I'm looking at VDJ recombination specifically, so I'm guessing the divergence would be high for the DJ sequence to move very close to the V region. I guess I could pull the mapped regions as well to make sure I get the reads that have low divergence but what I'm really having a hard time doing is getting the unmapped regions.
dnlsmy Have you tried running sensitive SV detection software?
I have not, would recombination be detected using that software? My main worry is that since the BAM file is already sequenced, recombination of VDJ regions would not be able to be mapped on the reference sequence (since the regions would be much larger than a simple SNV). My understanding is that in that case, the BAM file would store those reads as unmapped and searching through those for the V regions would allow me to check the degree of recombination (by checking to see how far along the D/J regions were upstream from the read).
If you mapped with BWA mem, there is a good chance these reads have split read mappings. Here is a little test to see if your hypothesis is correct.
This will tell you if there are extra unmapped reads in the regions.
HINT: UNMAPPED READS ARE ASSIGNED A POSTION and will be returned in the region.
getting the split reads:
Yes our lab develops SV software and we always have to filter out Ab parts and TCRs for our blood samples.
This is true for an SV caller which is sensitive for NAHR and another which is more sensitive to NHEJ.
https://en.wikipedia.org/wiki/Structural_variation#Software has a decent list however I do recommend Lumpy too.
SV callers are full of noise and the raw results are difficult to interpret, however for your question it seems that an SV caller will help you get an idea of probable sites of VDJ recombination.
Delly and Lumpy are two SV callers that can detect complex breaks and translocations. Delly was heavily used in the upcoming 1000 genomes phase 3 SV paper.