I mapped pair end reads to CDS transcripts using BWA.
If I do sorting of bam fine by not using -n flag than its worting:
For sorting: /home/yog/software/samtools-1.3.1/samtools sort 216_5W_Ca1.bam -O BAM -o 216_5W_Ca1.sort.bam
For indexing: /home/yog/software/samtools-1.3.1/samtools index 216_5W_Ca1.sort.bam
For read count: /home/yog/software/samtools-1.3.1/samtools idxstats 216_5W_Ca1.sort.bam > readcount[bam_idxstats]
But read count I am getting like this:
Rs025080 1341 239 27
Rs035250 621 0 0
Rs035280 408 0 0
Rs035290 318 0 0
Rs035300 456 87 0
Why there is three count column for each transcripts.
I quick look into the manual gives you: "Retrieve and print stats in the index file. The output is TAB-delimited with each line consisting of reference sequence name, sequence length, # mapped reads and # unmapped reads".
How do you have # unmapped reads per chromosome xD
Here are two reasons: Samtools Idxstats
Yes the 'Read is unmapped but has the chromosome of it's mapped pair" was my first thought, but then the sum of mapped/unmapped per chromosome would be an even number - but for the last entry it is not, so i don't think that's what's going on.
I didn't know about the BWA concatenating contigs though, that's new to me - thanks for the info :) And that could certainly be what's causing it. I also didn't know the SAM spec calls for unmapped pairs to be on the same chromosome - I always thought they came at the end with all the other unmapped reads. Two biologically meaningless but bioinformatically important bits of information today :)
It stands for "number" in many manuals.
That's not what John meant, but rather "if a read is unmapped - how do you know it belongs to a certain chromosome" ;-)
BWA for instance assigns both the unmapped flag and a position. In case the read maps somewhere but is not fulfilling the criteria of being reported as mapped (e.g. too many mismatches).
So the read "almost" maps there :p
Here I map raw reads to a CDS transcripts of a draft genome. So there are around 30% reads of a paired-end library do not map to CDS.
But I am not sure how number were given of unmapped reads according to CDS transcripts ID.
Could you reformulate this question in a more explicit fashion? I am not sure if I got the point. I also have a question: when you say:
Do you mean against a transcriptome, or against a genome with a GFF/GTF/GFF3 file that indicates the CDS of the transcripts, and you are considering only those?
I took CDS from this database. but its genome is not completely sequenced. it's a draft genome.
http://radish-genome.org/Data_resource/
As far as I know, the notation CDS defines those regions of the exons that are translated, because exons also have UTRs. So you might have some RNASeq reads that belong to UTRs and are therefore not present in your CDS sequences. Could this be the case?