Question

Identifying covered genes from nanopore cDNA-seq reads

0

Entering edit mode

6.3 years ago

Florian Bernard ▴ 90

I am working with long reads generated by Nanopore technology.

I tried to write a small script in python (using pysam) that would allow, for every read, to 'detect' to which gene they actually map to. To do that, I was using the following informations: the chromosome, the strand, and the start and end genomic position of my read.

However, I just came to realize there is a flaw in my design. For a given gene, I will get reads that map to both strands (because of the way the library is prepared and the way the molecule is sequenced). So I cannot use the 'strand' information to detect which gene my read is actually mapping to. And I'm afraid that if I only use the chromosome information and the genomic coordinates, I will end up with 'false positive' in places where I have nested genes...

Would anyone have an idea how I could fix that ? Thanks in advance !

RNA-Seq gene alignment • 1.3k views

ADD COMMENT • link updated 6.3 years ago by WouterDeCoster 47k • written 6.3 years ago by Florian Bernard ▴ 90

2

Entering edit mode

This could help you: "nanopore_transcript_abundance.py estimates transcript abundance from reads mapped to a transcriptome. "

ADD REPLY • link 6.3 years ago by WouterDeCoster 47k

0

Entering edit mode

I see, it actually takes all possible genes and then compute the one that match the best. It's a great idea. You can see my post below where I figured it wasn't necessarily a problem to have more than one possible gene for a given read. It's not perfect but if I reduce the list of possible genes from 20K (every genes possible in C. elegans) to just a few, then I can actually carry on with downstream analysis in a more efficient way ! Thanks for your help.

ADD REPLY • link 6.3 years ago by Florian Bernard ▴ 90

0

Entering edit mode

What is the actual aim? Especially with long reads, you will see lots of reads overlapping multiple genes, e.g in human NKIRAS1 and RPL15.

ADD REPLY • link 6.3 years ago by ATpoint 85k

0

Entering edit mode

To know which gene my reads correspond to. It is long reads but coming from RNAs, so there shouldn't be a problem with overlapping.. unless with nested genes. if gene A is on the opposite strand of gene B, then by using genomic position I cannot distingues them unless I know the strand.

ADD REPLY • link 6.3 years ago by Florian Bernard ▴ 90

0

Entering edit mode

Can you clarify if you are using sequencing cDNA or RNA? With nanopore this terminology starts to get important.

ADD REPLY • link 6.3 years ago by WouterDeCoster 47k

1

Entering edit mode

In a perfect world I would only be using direct-RNA sequencing and the problem would be gone : for a given gene, all the reads would map to the 'correct' strand . But I've got some data generated with the classical 1D ligation kit (so depending which strand will actually enter the nanopore I will get reads mapping to one or the other strand)..

ADD REPLY • link 6.3 years ago by Florian Bernard ▴ 90

score 1 · Answer 1 · 2018-08-18

As pointed by some of you, there will be cases where it's absolutely impossible to determine if one read belong to gene A or gene B without having a strand information that I can actually rely on (like with data generated by direct-RNA or direct-cDNA kits). And after some thinking, I realized it's not necessarily a problem.

My "gene prediction" function is part of a larger script and it is just supposed to ease downstream computational work by first identifying the gene my reads correspond to, so that I can focus the downstream analysis on that particular gene and not all 20K genes in C. elegans. But I eventually figured that even if my "gene prediction" algorithm returns 2 or 3 possible genes, it is still a great improvement.

So I guess for all ambiguous cases where I can't rely on the strand information I will just make the function return all possible genes for a given read. And maybe add the possibility to input the type of kit used to generate the data and if it is direct-RNA or direct-cDNA kit then make it so the function takes in account the strand information and then only returns one possible gene.

Thanks a lot to everyone for your help, it made me see the problem in a different way and made me realized it wasn't necessarily a problem !

score 0 · Answer 2 · 2018-08-17

0

Entering edit mode

6.3 years ago

Friederike 9.0k

As ATpoint notes, there is no perfect solution because the ambiguity cannot be solved (at least not by looking at every read individually). Check out the code and info provided by htseqcount, they demonstrate the different strategies you could use. Which one you end up using should depend on the question you're trying to address.

ADD COMMENT • link 6.3 years ago by Friederike 9.0k

0

Entering edit mode

I am more concerned by case when one gene is entirely overlapping with another on the opposite strand ( see figure C here: http://researcherslinks.com/current-issues/Association-of-Types-Overlapping-Genes/20/1/983/figures). Which doesn't seem to be a case covered by htseqcount ?

ADD REPLY • link 6.3 years ago by Florian Bernard ▴ 90

1

Entering edit mode

I think, the second case from the bottom ("ambiguous") seems to fit the bill: from the read's perspective, gene A and gene B are completely overlapping and it has no way of telling where it's from unless it has the strand information handy.

ADD REPLY • link 6.3 years ago by Friederike 9.0k

0

Entering edit mode

I think for those cases, given your data, you simply have to admit you don't know, at least for those reads that doesn't contain any anchoring sequence that allows you to pin-point it to one gene or the other (i.e., the case that is shown third from the bottom of that table by HTSeqcount). Just out of curiosity: how many reads actually contain completely ambiguous sequences?

ADD REPLY • link 6.3 years ago by Friederike 9.0k

0

Entering edit mode

I would need to look more closely at the data but according to this papers (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2747821/), C. elegans has ~450 nested genes. I would need to look at how high this genes are expressed but the chances that some reads may map to more than one gene are somewhat high.. But you're right, there will be cases where it's absolutely impossible to determine if it's one or the other without having a strand information that I can rely on. And after some thinking, I realized it's not necessarily a problem (see my more detailed answer in the post below). Thanks for your time !

ADD REPLY • link 6.3 years ago by Florian Bernard ▴ 90