Question

Calculate Mapping Rate for an alignment to each sequence of a multi-fasta reference

0

Entering edit mode

6.7 years ago

kspata ▴ 90

Hi,

I have aligned a sample PE RNA-seq data to a multi-fasta reference. I wish to find out mapping rate for each sequence of the multi-fasta reference. How can I extract that information from SAM or BAM file?

I have bowtie2 mapping rate for the overall alignment.

**6259700** reads; of these:
  6259700 (100.00%) were paired; of these:
    6192017 (98.92%) aligned concordantly 0 times
    **2317** (0.04%) aligned concordantly exactly 1 time
    **65366** (1.04%) aligned concordantly >1 times
    ----
    6192017 pairs aligned concordantly 0 times; of these:
      **151** (0.00%) aligned discordantly 1 time
    ----
    6191866 pairs aligned 0 times concordantly or discordantly; of these:
      12383732 mates make up the pairs; of these:
        12379013 (99.96%) aligned 0 times
        **479** (0.00%) aligned exactly 1 time
        **4240** (0.03%) aligned >1 times
1.12% overall alignment rate

I looked up the formula on this link [http://seqanswers.com/forums/showthread.php?t=21333][1] to calculate mapping rate from this information.

{[(2317+65366)*2]+(151*2)+(470)+(4240)}/[6259700*2]
ANS: 1.121%

I realized that I will need to extract the information in BOLD for each reference sequence from a single BAM/SAM file to calculate mapping rate.

Can anyone please give an advice on how to achieve this?

RNA-Seq SAMtools flagstat bowtie2 Mapping Rate • 7.8k views

ADD COMMENT • link updated 6.7 years ago by h.mon 35k • written 6.7 years ago by kspata ▴ 90

1

Entering edit mode

However you calculate it, your mapping rate is really low, indicating problems with your data. Is this just an example, or are these numbers from your data?

ADD REPLY • link 6.7 years ago by h.mon 35k

0

Entering edit mode

Thank you for replying. These are the numbers from my data. The multi fasts file reference which I used is a synthetic construct. The total mRNA data was mapped to this reference. The data is Miseq with PE 300. What can I do to increase the mapping rate.? I used bowtie2 with -I 0 and -X 1000 with -fr option.

ADD REPLY • link 6.7 years ago by kspata ▴ 90

2

Entering edit mode

With a <1% mapping rate you have a serious issue with your data. Try the following:

Run fastqc on your input files. Are there any errors or warning?

Align to the reference genome for your species with default bowtie2 setting. Are you now getting most of your reads mapped? If so, your synthetic construct is likely to be garbage. Work out what when wrong and make a new one.

If you still have 99% of your reads not mapping to your reference genome then it's likely to be an issue with library preparation/sequencing. The most likely explanation is that your library preparation failed (e.g did you accidentally wash all your DNA away in one of your steps?) and there's nothing you can do bioinformatically to recover.

Try BLASTing a few hundred of the unmapped reads. Do they actually come from your species?

ADD REPLY • link 6.7 years ago by d-cameron ★ 2.9k

0

Entering edit mode

The multi fasts file reference which I used is a synthetic construct.

Could you add further details? Is this a complete genome, with a synthetic construct added? Or your reference genome consists of a synthetic construct only? What is the organism you are studying?

ADD REPLY • link 6.7 years ago by h.mon 35k

0

Entering edit mode

The reference is sequence of 4 transcripts. I performed a BLAST and came to know it mapped closely to immunoglobulin light and heavy chain. This is all the information which I could gather.

ADD REPLY • link 6.7 years ago by kspata ▴ 90

1

Entering edit mode

You can't map a RNAseq dataset to a reference of just four transcripts, there probably will be a lot of spurious mappings due to relatively similar transcripts mapping to your "reference", as the correct genes are not present in your reference.

Why are you doing this? What exactly do you expect to accomplish?

ADD REPLY • link 6.7 years ago by h.mon 35k

0

Entering edit mode

I wish to know variants in my data with respect to this reference. But even though the alignment rate is low I was interested to know if mapping rates for each sequence in a multi-fasta reference file can be calculated using SAM or BAM file?

ADD REPLY • link 6.7 years ago by kspata ▴ 90

0

Entering edit mode

You first need to address your very low mapping rate as addressed by d-cameron

You also should align to the full genome using a spliced aligner such as HISAT2 or STAR.

ADD REPLY • link 6.7 years ago by WouterDeCoster 47k

0

Entering edit mode

Thank you for the suggestion. I will use these tools and align input data to a complete genome.

ADD REPLY • link 6.7 years ago by kspata ▴ 90

1

Entering edit mode

When you say overall alignment rate (OVR), you need to consider a lot of factors

uniquely mapped reads
multi-mapped reads
paired reads and singletons
directionality and distance of reads while aligning (concordant and discordant reads)

All of these will contribute to the overall alignment rate which I have tried explaining here

Calculating OVR for all the sequences individually will be losing a lot of that information (for example, exon skipping, probable duplications, translocations etc.). If you want to know the just the number of raw reads, than that's a separate stuff.

ADD REPLY • link 6.7 years ago by lakhujanivijay 5.9k

0

Entering edit mode

6.7 years ago

Tm ★ 1.1k

You can extract the number of reads mapping to each of your reference sequence using samtools idxstats. In output file following are the columns:

represents reference sequence name
reference sequence length
number of reads mapping to reference genome
number of unmapped reads whose mate is mapped

Using 3rd column info, you can calculate the percentage of mapped reads on each reference.

ADD COMMENT • link 6.7 years ago by Tm ★ 1.1k

0

Entering edit mode

These are the results for

 samtools idxstats input.sorted.bam

    X1       3580    9483    0
    X2     2836    8110    0
    X3     2188    8220    0
    X4   1468    6698    0
    *       0       0       0

These are the results of samtools flagstat

samtools flagstat input.sorted.bam

32511 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
32511 + 0 mapped (100.00%:-nan%)
32511 + 0 paired in sequencing
16715 + 0 read1
15796 + 0 read2
31500 + 0 properly paired (96.89%:-nan%)
31522 + 0 with itself and mate mapped
989 + 0 singletons (3.04%:-nan%)
10 + 0 with mate mapped to a different chr
6 + 0 with mate mapped to a different chr (mapQ>=5)

How can I calculate mapping rate say for X1 from this data? Will it be

(number of reads mapped to X1/ Total mapped reads)*100  
(9483/32511)*100 = 29.17%

Please let me know if this is correct.

ADD REPLY • link 6.7 years ago by kspata ▴ 90

score 4 · Accepted Answer · 2018-05-09

I will summarize / add recommendations and leave further comments. I will make some assumptions, as not all details are clear from your post. My main assumption is that your data is whole transcriptome RNA-seq, not some kind of targeted library.

1) Map the reads to the whole genome / whole transcriptome, not to an artificial, hugely down-sized artificial reference. When using an artificial, small reference, most likely there will be spurious mappings of somewhat similar reads, which would map to the correct location, in case the correct location were present.

2) Do not use bowtie if mapping RNAseq to a reference genome - if using a reference transcriptome it can be used.

3) To evaluate mapping rate to regions of interest, you can use samtools or bedtools specifying the interval of interest to count reads mapping to specif genes. As Vijay pointed out, you will have to think about what include / exclude when calculating mapping rates.

4) Mapping reads to immunoglobulins is complicated, as these regions may undergo somatic recombination - you did not say the source material of your reads. See, for example, the first paragraph from the "Results" section from the paper Immunoglobulin transcript sequence and somatic hypermutation computation from unselected RNA-seq reads in chronic lymphocytic leukemia.

Hope I've helped.