I am working on some data but need a little bit of help with a bit of an unusual task. We are looking at where lentiviral DNA has inserted itself in our host genome, and to do this we need to find the boundaries of the junctions between viral/human DNA. I need to be able to look at only the soft/hard clipped reads within my bam file. Does anyone know a way to do this with awk or some other tool? Thanks! What I have tried:
I know it was more than a year, but I'll just leave the answer here maybe it will help someone.
With the modified code of yours, the script will only print the alignments that have hardclip or softclip, that means you'll loose the header, which is what samtools complains about after you pipe tha awk result to it.
The solution is to either get the header first, concatenate it with the awk output and export as BAM:
Have you considered using a SV breakpoint calling tool (using a reference that includes both host and viral sequence), or a viral integration detection tool? Both of the above approaches should work if your integration site is clonally expanded. If not, there's other software designed for this purpose, although they usually require special sequencing protocol (e.g HIV integration site detection).
To actually answer you question, you can use the ExtractSVReads tool within gridss to do this natively within bam:
If you want multithreaded I/O the command-line is a pretty annoying (since you're running a tool within gridss, not the whole pipeline): you need to replace the java -Xmx4g -cp gridss.jar bit with:
If you want multithreaded I/O the command-line is a pretty annoying (since you're running a tool within gridss, not the whole pipeline): you need to replace the
java -Xmx4g -cp gridss.jar
bit with: