While playing around with the ENCODE RNA-seq data-sets, I noticed that some of the pair-end files have weirdly set flags.
The following few lines are from the CSHL/wgEncodeCshlLongRnaSeqAdrenalAdult8wksAlnRep1.bam file.
PAN_0073:1:69:16755:10476#0 163 chr1 3190766 255 76M = 3190867 177 GTGGAATAATTTGTTAATTGTGAAGTGTATGGTTTTGTATTTTGAAACCAAACAACAGTAGCTGAGGTAGTTAAAT hhhghhhhhhhhhhhhhhhhhhgchhhhhhhhhhhhhghhhhhhghghhhgehghhhhhghhhhhhhedhghggef XS:A:-
PAN_0073:1:69:16755:10476#0 115 chr1 3190867 255 76M = 3190766 -177 TGAGAGAATGGAGAACCAATGTAAGGAGCCCAGACTCTTGCCATCTGGAAGCAGGCTCACCAAGTATGATGGTTTC ahhfhhhhfehehhhghghgfhhhhhhghhhchghghhhehhfggghhhhghghhghhghhhhhhhdhhhhhhhhh XS:A:-
PRESLEY_0042_FC627A8AAXX:2:95:2727:20071#0 163 chr1 3195839 255 76M = 3195919 156 GCCACTAATTGAGAAGAACTATCAGAGGGAAGTTTTTCTTGGAAAGAGCCAGTCTTGACATGAAGCTTCCTACGTG fggggggggggggfcgggggfggggcffdfggggggggggggggggegggggggggfgggggeggggggggggggg XS:A:-
PRESLEY_0042_FC627A8AAXX:2:95:2727:20071#0 115 chr1 3195919 255 76M = 3195839 -156 CCTTCTTTCCATGGTAGCCAGGCCTTGCCCTTTCATAAGAAGACATGTGAAGTACCATAATTATGGAGTGGCAGAG hebaee``bb[ahahhghghhgffhgehfhhhhfhhffafchcfhghhhhhghhghhhhhhhhghhhhhhhhfghh XS:A:-
As you can see, the flag for the forward read is set to 163, which is ok, but the one on the reverse strand is set to 115, which designates that the second read in the pair is mapped to the wrong strand.
Is this a feature or a bug of ENCODE data?
If it is a bug, it would be extremely helpful if somebody from the consortium could use their supercomputing powers to fix them.
I'm sorry to bother you, but in the reads you posted do not have the right flags. Correctly mapped paired end reads should have 2 sets of flags: 99 - 147 83 - 163
This link shows it in a nice way: http://ppotato.files.wordpress.com/2010/08/sam_output2.png
Your statement and the link are not correct in general. "Mapped in proper pair" is solely the judgement of the aligner, per the samtools spec (the full flag description is "each segment properly aligned according to the aligner"). So the reads can map to the same strand and be properly paired, if the aligner allows that. I assume this is strand-specific RNA-seq or something like that and that the aligner reflects that. Read more about the protocol and aligner options to be sure.