Hi,
I have BAM files for pairs-end reads aligned using TopHat2. The BAM files were sorted based on read name using samtools -n
. But when I performed read count analysis, I found almost all reads to have missing mates. Below is the error message:
/.local/lib/python2.7/site-packages/HTSeq-0.6.1-py2.7-linux-x86_64.egg/HTSeq/__init__.py:622: UserWarning: 197873794 reads with missing mate encountered.
Here are the top few lines from BAM file. The read names from mates are differing by .1 and .2 extension. Is this an issue? I will highly appreciate if you could help me out in this regard.
@HD VN:1.0 SO:queryname
@SQ SN:chr1 LN:249250621
@SQ SN:chr10 LN:135534747
@SQ SN:chr11 LN:135006516
@SQ SN:chr12 LN:133851895
@SQ SN:chr13 LN:115169878
@SQ SN:chr14 LN:107349540
@SQ SN:chr15 LN:102531392
@SQ SN:chr16 LN:90354753
@SQ SN:chr17 LN:81195210
@SQ SN:chr18 LN:78077248
@SQ SN:chr19 LN:59128983
@SQ SN:chr2 LN:243199373
@SQ SN:chr20 LN:63025520
@SQ SN:chr21 LN:48129895
@SQ SN:chr22 LN:51304566
@SQ SN:chr3 LN:198022430
@SQ SN:chr4 LN:191154276
@SQ SN:chr5 LN:180915260
@SQ SN:chr6 LN:171115067
@SQ SN:chr7 LN:159138663
@SQ SN:chr8 LN:146364022
@SQ SN:chr9 LN:141213431
@SQ SN:chrM LN:16571
@SQ SN:chrX LN:155270560
@SQ SN:chrY LN:59373566
@PG ID:TopHat VN:2.0.10
.
.
.
SRR1660308.1.1 81 chr8 145140609 50 38M138N2M = 145140331 -456 GTGCCCACTNCTGCGCTTGCCAAGCCTCATGGGCCCCGGA I
IGGHB;<0#FJIIGJJJIJIJJJJJJHHHHHFFFFFCCC AS:i:0 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:9G30 YT:Z:UU XS:A:+ NH:i:1
SRR1660308.1.2 161 chr8 145140331 50 39M = 145140609 456 CACTGCTGGCGATTTATGCAGCTGGCCTGGCCCTGCCCC 22AFHFFHJIJGIIBEF
BGIJJIEGIJ@HIHIJJG?FIE AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:39 YT:Z:UU XS:A:+ NH:i:1
SRR1660308.3.1 97 chr17 40925519 50 28M204N12M = 40926678 1198 GGCAGTATCGCTTCCCACCCTTCTTTACGTNACAACCGAA @
CCFFFBDDFHHHIIIIIIIIIIIIIIIII#1:CFHGFIG AS:i:0 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:30T9 YT:Z:UU XS:A:+ NH:i:1
SRR1660308.3.2 145 chr17 40926678 50 39M = 40925519 -1198 GAGTCGATCCAGATTGTATTAGAGGAACTGAGGAAGAAA GGFAC94IGCJIGIGII
IJHEGGHGE<C4IGFDHDDA22 AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:39 YT:Z:UU XS:A:+ NH:i:1
SRR1660308.4.1 83 chr5 154194381 50 40M = 154194287 -134 TTTATTTTTNTCTGACCTTTCCATCCTTGAAAAAACGGGG JJJJJHG?1
#GJIGGJJIJIIJJJIIJHHHHHFFDFFCCC AS:i:-6 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:9C25T4 YT:Z:UU XS:A:+ NH:i:1
SRR1660308.4.2 163 chr5 154194287 50 39M = 154194381 134 CAGCCTCTCCCCACTCCCCTTGCCCCACACCCTTTACTC 22AFHHHHJJJJJJJ3C
GHJIIHIJIIJJJGHHIGGJII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:39 YT:Z:UU XS:A:+ NH:i:1
SRR1660308.5.1 99 chrM 8899 3 30M = 9023 154 CCTAGCCCACTTCTTACCACAAGGCACAAC =;:B7A++::;4D4+<CF@?D3AF=+<B)? AS:i:-2 X
N:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:28C1 YT:Z:UU NH:i:2 XS:A:- HI:i:1
Best regards,
Anil
Thanks a lot for you suggestion. Can awk function work with bam (binary)
No, but I suspect you can pipe to the dexseq counting script.
The easy way could be to convert BAM to SAM and then perform awk function
Nice catch, what should the header look like? Guess I ran in to the same problem.