Cigar String really high N
1
0
Entering edit mode
20 months ago

I'm trying to find gene fusions, and when looking at the reads aligned by star I find that some reads will map 3 bases on a gene and then have no other base correct. When I look at the cigar I see the following 3M<high number > 400000bp>N47M. To see this reads I have used samtools view, why does this happen. This can be an example:

Analising fusion:  RP11-96H19.1::RP11-446N19.1

READS IN BAM

['SRR064438.11916954', 83, 12, 46387924, 255, '49M1S', '=', 46387805, -168, 'TAAGACCAGACCAAATCAAACCAAACCAAGCAAACCACGGGGAATGGAGA', 'A@@>9BBA@@B@B@B@BAB>B??@@BCCBABBB@BB@B?BBABBCBACAB', 'NH:i:1', 'HI:i:1', 'AS:i:97', 'nM:i:0', 'NM:i:0']

['SRR064439.375905', 163, 12, 46387934, 255, '39M11S', '=', 46652514, 264630, 'CCAAATCAAACCAAACCAAGCAAACCACGGGGAATGGAGATTATTGCCTG', 'ABBBBB@BBBAABBB@@BBBABBB@@BABBBB@ABBBACBBBBBBB=5BB', 'NH:i:1', 'HI:i:1', 'AS:i:84', 'nM:i:0', 'NM:i:0']
***
**['SRR064438.12267353', 99, 12, 46387943, 255, '30M264417N20M', '=', 46652487, 264594, 'ACCAAACCAAGCAAACCACGGGGAATGGAGATTATTGCCTGCTCCTCCAA', 'BBBBBBA@BBB?AAB@?B?A@@A@AAAA<A?@A@A?<<8=*<@=5><8<=', 'NH:i:1', 'HI:i:1', 'AS:i:95', 'nM:i:0', 'NM:i:0']
['SRR064439.1473779', 163, 12, 46387962, 255, '11M82180N39M', '=', 46874958, 487046, 'GGGGAATGGAGGTCATGTGAGCACACAGCATAAAGGCAGCTGCCCACAAG', 'BCBBCCCBBCB@:CBBB>>BBACABCBBC?ACBC<ABBA94?:7>AA9B;', 'NH:i:1', 'HI:i:1', 'AS:i:97', 'nM:i:0', 'NM:i:0']
['SRR064439.2800317', 355, 12, 46387970, 3, '3M486844N47M', '=', 46874828, 486908, 'GAGGACCTGATGATTGATTTAGCATCTTTGGCATCCGGCCACTGCTCTGC', 'B@BAAABB>?A@@AAA?BBAB>A?@@BB???BA@B@::@AAA<<B=@;:;', 'NH:i:2', 'HI:i:2', 'AS:i:97', 'nM:i:0', 'NM:i:0']**
***
['SRR064438.4051038', 99, 12, 46652386, 255, '7S43M', '=', 46652461, 123, 'GGGGACTACAGATTATTGCCTGCTCCTCCAAGCCCTTCACTGTAGAATGG', 'BBBB@=BAAAB@BB@BBB<;?A?B?8@@=@@@?8;B@<;<BB??B;?@A@', 'NH:i:1', 'HI:i:1', 'AS:i:89', 'nM:i:0', 'NM:i:0']


REFERENCE:

TAAGACCAGACCAAATCAAACCAAACCAAGCAAACCACGGGGAATGGA==GGTAGGTGAATAGCGCCAAAGAGAATGATGGCTCACAACACTTCTAAGCA

READS:

TAAGACCAGACCAAATCAAACCAAACCAAGCAAACCACGGGGAATGGA==GA
***
                    CCAAATCAAACCAAACCAAGCAAACCACGGGGAATGGA==GATTATTGCCTG

                             ACCAAACCAAGCAAACCACGGGGAATGGA==GATTATTGCCTGCTCCTCCAA

                                                GGGGAATGGA==GGTCATGTGAGCACACAGCATAAAGGCAGCTGCCCACAAG
***
                                                        GA==GGACCTGATGATTGATTTAGCATCTTTGGCATCCGGCCACTGCTCT GC

The 2,3,4th reads are the ones that bring up this problem, should I account them as mapping here or not? Is there something I'm missing?

CIGAR • 910 views
ADD COMMENT
0
Entering edit mode
20 months ago

When aligning RNA-seq with STAR a cigar string like 30M264417N20M usually means: 30 mapped bases (exonic), followed by a 264417 bp intron, and then another 20 mapped bases (exonic).

This was likely a 50bp illumina sequencing read that spanned an exon-exon junction, so when you align it to the genome there will be a huge gap where the intron was.

ADD COMMENT

Login before adding your answer.

Traffic: 2063 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6