Question

Extracting chimeric reads from mapping

0

Entering edit mode

7.3 years ago

Anushka ▴ 20

Hello, I am struggling to processing and analyse bam files (from bwa alignment), to extracting the chimeric read alignment. I am aligning human cell line RNA-seq data (paired end) to virus, aimed to find the viral integration sites in the genome. For that, after reading a bit here from following links like: Confirming Viruses detected from NGS data

RNA Seq analysis of virally infected cells

For find out the chimeric reads (partially mapping to one and partially to other genome),I have concatenated the two genomes (human + viral) and prepared the index using bwa index

bwa index -a bwtsw human_virus.fa

and then performed the mapping using bwa-mem and piping the output samtools of alignment to

bwa mem -R '@RG\tID:ID\tSM:SM\tPL:Illumina' index/human_virus.fa input_pair_1.fastq input_pair_2.fastq | samtools view -bS > input.bam

I have sorted and indexed the resulting input.bam (input_sorted.bam and input.bai)

Now for the post alignment processing, I am trying samtools view -f 0x04 (for unmapped reads) and flagstat to find other statistics.

My question question is how can I extract the numbers of reads that are partially mapped to either of the two genome, how many reads mapped to viral genome and which to human. Conceptually, I might not be understanding this problem, that is why I cab not implement proper samtool view options.

Plus I have approx 50 samples how could I evaluate how many reads have mapped to the viral genome in each sample.

I would really appreciate your inputs.

RNA-Seq alignment samtools bwa • 8.2k views

ADD COMMENT • link updated 13 months ago by doggie • 0 • written 7.3 years ago by Anushka ▴ 20

score 1 · Accepted Answer · 2017-09-22

1

Entering edit mode

7.3 years ago

Pierre Lindenbaum 164k

My question question is how can I extract the numbers of reads that are partially mapped to either of the two genome, how many reads mapped to viral genome and which to human. Conceptually, I might not be understanding this problem, that is why I cab not implement proper samtool view options.

using samjdk: http://lindenb.github.io/jvarkit/SamJdk.html .

For paired end we're searching for reads mapping one organism, and the mate mapping another. We also look for split reads.

java -jar dist/samjdk.jar -e '
    final Set<String> virus_chrom = new HashSet<>(Arrays.asList("viral_seg1", "viral_seg2"));
    if(record.getReadUnmappedFlag()) return false;
    if(record.getReadPairedFlag() && !record.getMateUnmappedFlag())
        {
        if( virus_chrom.contains(record.getReferenceName()) != 
            virus_chrom.contains(record.getMateReferenceName())
            )
            {
            return true;
            }
        }
    for(final SAMRecord other: SAMUtils.getOtherCanonicalAlignments(record))
        {
        if( virus_chrom.contains(record.getReferenceName()) != 
            virus_chrom.contains(other.getReferenceName())
            )
            {
            return true;
            }
        }
    return false;' input.bam

ADD COMMENT • link 7.3 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks Pierre for your help . I am trying to understand the output of SamJdk, basically it has given me all he SQ tag with reference sequence name and corresponding length.

Additionally is it possible to retrieve the chimeric reads using samtools FLAG options. As in the manual states definition of a chimeric alignment is:

Typically, one of the linear alignments in a chimeric alignment is considered the “representative” alignment, and the others are called “supplementary” and are distinguished by the supplementary alignment flag. All the SAM records in a chimeric alignment have the same QNAME and the same values for 0x40 and 0x80 flags

ADD REPLY • link 7.3 years ago by Anushka ▴ 20

1

Entering edit mode

Additionally is it possible to retrieve the chimeric reads using samtools FLAG options

that's the purpose of SAMUtils.getOtherCanonicalAlignments . But you need some code to get the difference between virus/host.

ADD REPLY • link 7.3 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

yes that make sense. Thank you so much for your help

ADD REPLY • link 7.3 years ago by Anushka ▴ 20

0

Entering edit mode

Hi Pierre, thank you for the answer, but I still don't get the idea. I mapped the paired-end reads to the host genome to get a bam file, but I don't know how to input the virus genome (or mapped result) to the expression? Thank you.

ADD REPLY • link 5.6 years ago by junghaowang • 0

0

Entering edit mode

Hi Pierre, so does samjdk support single-end reads for extracting chimeric reads? I've aligned my virus to the human reference, sorted the unmapped reads into a BAM file, and then aligned them as single-end to the virus reference. Now I have the virus-mapped BAM and I'm trying to extract chimeric reads. I'm not sure if I can use a similar script as the one you provided for paired-end reads. Any suggestions please?

ADD REPLY • link 13 months ago by doggie • 0