Hi,
I have some paired end reads which are messing up my assembly process. Some reads span across two distinct genes, which confuses my transcript assembly software. I end up with a very long transcript merging both genes.
I have a bed file which contains the positions of all the transcripts I should reasonably get (i.e., a list of separate intervals).
I can easily identify the erroneous transcripts with bedtools intersect (a single transcript intersect two different genes). Now, is there a way to find which pairs of reads are causing this issue, and how can I delete them? If I consider only the fused genes, is there a way to delete a pair of aligned reads whose inner mate distance is greater than a given value?
I would rather delete them directly from my current sorted bam files rather than from the FastQ file I used (the alignment and sorting process is a bit long in my case).
Thanks a lot
An example: top annotation is mm10, bottom is obtained de novo. You are looking at olfactory receptor genes which are close and similar.
What program are you using to generate the erroneous transcripts?
In this case it is IsoSCM. However, a previous assembly was realized using Cufflinks in an article. According to their materials section: ad hoc perl scripts were used to further refine the gene models produced for VR and OR genes, deleting those predictions that fuse adjacent receptor genes or that are antisense to the annotated gene.
I have created various alignments with more or less stringent limits on maximal intron size (7,10 and 25kb respectively), using STAR. I can choose to only consider models derived from a specific alignment, BUT here I would rather like to exclude a few faulty reads.
I have to pick between fused genes and a limit on intron size which might affect valid transcripts.
It is an effective approach, but in IGV it is very easy to see information on reads (like the position of a mate if it has been aligned). I have just no idea how to access this information (apart from using a parser on the sam file which would be ugly).
Ok, Bamutils has a filter option which allows me to exclude reads not in a region. I can try to consider only one BAM file with 25kb introns, and for each fused genes, take a window of say 10kb. It is still the same idea as before...