Hi,
I have aligned a sample PE RNA-seq data to a multi-fasta reference. I wish to find out mapping rate for each sequence of the multi-fasta reference. How can I extract that information from SAM or BAM file?
I have bowtie2 mapping rate for the overall alignment.
**6259700** reads; of these:
6259700 (100.00%) were paired; of these:
6192017 (98.92%) aligned concordantly 0 times
**2317** (0.04%) aligned concordantly exactly 1 time
**65366** (1.04%) aligned concordantly >1 times
----
6192017 pairs aligned concordantly 0 times; of these:
**151** (0.00%) aligned discordantly 1 time
----
6191866 pairs aligned 0 times concordantly or discordantly; of these:
12383732 mates make up the pairs; of these:
12379013 (99.96%) aligned 0 times
**479** (0.00%) aligned exactly 1 time
**4240** (0.03%) aligned >1 times
1.12% overall alignment rate
I looked up the formula on this link [http://seqanswers.com/forums/showthread.php?t=21333][1] to calculate mapping rate from this information.
{[(2317+65366)*2]+(151*2)+(470)+(4240)}/[6259700*2]
ANS: 1.121%
I realized that I will need to extract the information in BOLD for each reference sequence from a single BAM/SAM file to calculate mapping rate.
Can anyone please give an advice on how to achieve this?
However you calculate it, your mapping rate is really low, indicating problems with your data. Is this just an example, or are these numbers from your data?
Thank you for replying. These are the numbers from my data. The multi fasts file reference which I used is a synthetic construct. The total mRNA data was mapped to this reference. The data is Miseq with PE 300. What can I do to increase the mapping rate.? I used bowtie2 with -I 0 and -X 1000 with -fr option.
With a <1% mapping rate you have a serious issue with your data. Try the following:
Run fastqc on your input files. Are there any errors or warning?
Align to the reference genome for your species with default bowtie2 setting. Are you now getting most of your reads mapped? If so, your synthetic construct is likely to be garbage. Work out what when wrong and make a new one.
If you still have 99% of your reads not mapping to your reference genome then it's likely to be an issue with library preparation/sequencing. The most likely explanation is that your library preparation failed (e.g did you accidentally wash all your DNA away in one of your steps?) and there's nothing you can do bioinformatically to recover.
Try BLASTing a few hundred of the unmapped reads. Do they actually come from your species?
Could you add further details? Is this a complete genome, with a synthetic construct added? Or your reference genome consists of a synthetic construct only? What is the organism you are studying?
The reference is sequence of 4 transcripts. I performed a BLAST and came to know it mapped closely to immunoglobulin light and heavy chain. This is all the information which I could gather.
You can't map a RNAseq dataset to a reference of just four transcripts, there probably will be a lot of spurious mappings due to relatively similar transcripts mapping to your "reference", as the correct genes are not present in your reference.
Why are you doing this? What exactly do you expect to accomplish?
I wish to know variants in my data with respect to this reference. But even though the alignment rate is low I was interested to know if mapping rates for each sequence in a multi-fasta reference file can be calculated using SAM or BAM file?
You first need to address your very low mapping rate as addressed by d-cameron
You also should align to the full genome using a spliced aligner such as HISAT2 or STAR.
Thank you for the suggestion. I will use these tools and align input data to a complete genome.
When you say overall alignment rate (OVR), you need to consider a lot of factors
All of these will contribute to the overall alignment rate which I have tried explaining here
Calculating OVR for all the sequences individually will be losing a lot of that information (for example, exon skipping, probable duplications, translocations etc.). If you want to know the just the number of raw reads, than that's a separate stuff.