Transcript sequence aligns twice on the reference
1
0
Entering edit mode
4.1 years ago
pablo ▴ 310

Hi,

I generated transcript sequences with the Isoseq3 pipeline from PacBio. Then, I aligned these sequences against my fasta reference, with pbmm2 align tool ; it worked well.

My question is : why some of my full length transcripts (76 transcripts out of 94818) align twice (sometimes more, up to 4 times) on the reference whereas all the others align only once. It looks weird that a full-length transcript (~1500nt) with a high quality sequence can match with two different spots of the reference, right?

Probably I can get the answer with the header BAM files (I don't show the whole sequence):

transcript/66333        16      Super-Scaffold_100015   1701509 60      1795S67=62N55=76N136=57N221=120N99=132N131=67N301=1X139N128=100N110=1366N211=2358N38=1967N109=1023N291=      *       0       0       CGTACGGAAACCAAAAAAACCTATTCGTCGGTGGACGGCAGGTTTTCGGTGTGTAGTCAGAGCTTTAGATCGTTGGCTATTTTTGACGCAATGTTCTTGAACCCGATGGACGTGCCCGATATGCGCTTAAACCGGACGCCGTTCAGCGACAGCCTGGGCAGCTTGCACACCTCGATCTCCCACTGGACCAGCGAGTCGGTGTTCGGGTCGCCGTGCACGCACAGCAACAGGAAGCGCTCGCGCTGTTCGTAATCGCAATTGTTGGCGTCCAGCACCTCTCTTATTTCAGCCATAATCTCGTTTGGGTCTCTTGTAGACGTCGTTTTCATACTCCATGTGAATCTCAACGATCTCGGTTTCATTTGATCGTCATTCACTGTATTACTAACAATATTATTTTTGGGCAGATTTTGATCCATTGGCCGTTTAACGAACTTGGATGATATTTTTGAAAAGAATGATGGCCTTTGCACAGATGGATCATGTGGACTGCCAGTGTTGTTAGTAGGTCC

transcript/66333        2048    Super-Scaffold_100015   1701660 60      1898S35=76N136=57N221=120N100=132N128=67N298=139N135=100N109=1366N210=2358N39=1967N109=1023N256=1I18=        *       0       0       TGTGTAATTTTTTTTCGTACGGAAACCAAAAAACCTATTCGTCGGTGGACGGCAGGTTTTCGGTGTGTAGTCAGAGCTTTAGATCGTTGGCTATTTTTGACGCAATGTTCTTGAACCCGATGGACGTGCCCGATATGCGCTTAAACCGGACGCCGTTCAGCGACAGCCTGGGCAGCTTGCACACCTCGATCTCCCACTGGACCAGCGAGTCGGTGTTCGGGTCGCCGTGCACGCACAGCAACAGGAAGCGCTCGCGCTGTTCGTAATCGCAATTGTTGGCGTCCAGCACCTCTCTTATTTCAGCCATAATCTCGTTTGGGTCTCTTGTAGACGTCGTTTTCATACTCCATGTGAATCTCAACGATCTCGGTTTCATTTGATCGTCATTCACTGTATTACTAACAATATTATTTTTGGGCAGATTTTGATCCATTGGCCGTTTAACGAACTTGGATGATATTTTTGAAAAGAATGATGGCCTTTGCACAGATGGATCATGTGGACTGCCAGTGGCTGGTGATGCAGGTGATACAGTGACACCCCTTGCTGGATCCATTA

Moreover, this is the second transcript with the header transcript/66333 2048 Super-Scaffold_100015 1701660 60 1898S35=76N136=57N221=120N100=132N128=67N298=139N135=100N109=1366N210=2358N39=1967N109=1023N256=1I18= * 0 0 which correspond to the transcript generated with Isoseq. The first transcript has a different sequence compared to the transcript generated with Isoseq. Why?

Best

alignement bam • 759 views
ADD COMMENT
0
Entering edit mode
4.1 years ago
JC 13k

Blasting your sequence I got a Serine/threonine-protein kinase MARK2-like hit (99.6% identity), so could be you have some gene families which are commonly duplicated in genomes or have highly conserved domains.

ADD COMMENT

Login before adding your answer.

Traffic: 1727 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6