Hello, I am struggling to processing and analyse bam files (from bwa alignment), to extracting the chimeric read alignment. I am aligning human cell line RNA-seq data (paired end) to virus, aimed to find the viral integration sites in the genome. For that, after reading a bit here from following links like: Confirming Viruses detected from NGS data
RNA Seq analysis of virally infected cells
For find out the chimeric reads (partially mapping to one and partially to other genome),I have concatenated the two genomes (human + viral) and prepared the index using bwa index
bwa index -a bwtsw human_virus.fa
and then performed the mapping using bwa-mem and piping the output samtools of alignment to
bwa mem -R '@RG\tID:ID\tSM:SM\tPL:Illumina' index/human_virus.fa input_pair_1.fastq input_pair_2.fastq | samtools view -bS > input.bam
I have sorted and indexed the resulting input.bam (input_sorted.bam and input.bai)
Now for the post alignment processing, I am trying samtools view -f 0x04 (for unmapped reads) and flagstat to find other statistics.
My question question is how can I extract the numbers of reads that are partially mapped to either of the two genome, how many reads mapped to viral genome and which to human. Conceptually, I might not be understanding this problem, that is why I cab not implement proper samtool view options.
Plus I have approx 50 samples how could I evaluate how many reads have mapped to the viral genome in each sample.
I would really appreciate your inputs.
Thanks Pierre for your help . I am trying to understand the output of SamJdk, basically it has given me all he SQ tag with reference sequence name and corresponding length.
Additionally is it possible to retrieve the chimeric reads using samtools FLAG options. As in the manual states definition of a chimeric alignment is:
that's the purpose of
SAMUtils.getOtherCanonicalAlignments
. But you need some code to get the difference between virus/host.yes that make sense. Thank you so much for your help
Hi Pierre, thank you for the answer, but I still don't get the idea. I mapped the paired-end reads to the host genome to get a bam file, but I don't know how to input the virus genome (or mapped result) to the expression? Thank you.
Hi Pierre, so does samjdk support single-end reads for extracting chimeric reads? I've aligned my virus to the human reference, sorted the unmapped reads into a BAM file, and then aligned them as single-end to the virus reference. Now I have the virus-mapped BAM and I'm trying to extract chimeric reads. I'm not sure if I can use a similar script as the one you provided for paired-end reads. Any suggestions please?