I am working with long reads generated by Nanopore technology.
I tried to write a small script in python (using pysam) that would allow, for every read, to 'detect' to which gene they actually map to. To do that, I was using the following informations: the chromosome, the strand, and the start and end genomic position of my read.
However, I just came to realize there is a flaw in my design. For a given gene, I will get reads that map to both strands (because of the way the library is prepared and the way the molecule is sequenced). So I cannot use the 'strand' information to detect which gene my read is actually mapping to. And I'm afraid that if I only use the chromosome information and the genomic coordinates, I will end up with 'false positive' in places where I have nested genes...
Would anyone have an idea how I could fix that ? Thanks in advance !
This could help you: "nanopore_transcript_abundance.py estimates transcript abundance from reads mapped to a transcriptome. "
I see, it actually takes all possible genes and then compute the one that match the best. It's a great idea. You can see my post below where I figured it wasn't necessarily a problem to have more than one possible gene for a given read. It's not perfect but if I reduce the list of possible genes from 20K (every genes possible in C. elegans) to just a few, then I can actually carry on with downstream analysis in a more efficient way ! Thanks for your help.
What is the actual aim? Especially with long reads, you will see lots of reads overlapping multiple genes, e.g in human NKIRAS1 and RPL15.
To know which gene my reads correspond to. It is long reads but coming from RNAs, so there shouldn't be a problem with overlapping.. unless with nested genes. if gene A is on the opposite strand of gene B, then by using genomic position I cannot distingues them unless I know the strand.
Can you clarify if you are using sequencing cDNA or RNA? With nanopore this terminology starts to get important.
In a perfect world I would only be using direct-RNA sequencing and the problem would be gone : for a given gene, all the reads would map to the 'correct' strand . But I've got some data generated with the classical 1D ligation kit (so depending which strand will actually enter the nanopore I will get reads mapping to one or the other strand)..