Question

Dealing with transcriptome sequences that are smaller than their respective genes

0

Entering edit mode

14 months ago

langziv ▴ 70

Hi

I did blastn of a transcriptome that was generated with Trinity against the assembly-level annotated genome of the bacterium I work with.
Out of the 11007 matches in the blastn results, 2474 are smaller than their respective CDSs from the annotated genome.

Would it be true to say that these 2474 sequences are of defective mRNA?

EDIT

Specifically, some of such sequences are smaller then their respective CDSs in the annotated genome, and others are under 100 bases.
Is there an approach to filter such sequences properly, so that coding ones are kept and the rest are filtered out?

blastn Trinity Transcriptome De-novo-transcriptome-assembly RNA-seq • 1.7k views

ADD COMMENT • link updated 13 months ago by Ram 44k • written 14 months ago by langziv ▴ 70

0

Entering edit mode

Would it be true to say that these 2474 sequences are of defective mRNA?

Not necessarily. They may be incomplete reverse transcription products (so not defective mRNA per se) or fragments produced during library prep.

ADD REPLY • link 14 months ago by GenoMax 147k

0

Entering edit mode

Thanks.
Is there an approach or a guideline to filter such sequences?

ADD REPLY • link 14 months ago by langziv ▴ 70

score 0 · Answer 1 · 2023-09-20

0

Entering edit mode

14 months ago

i.sudbery 20k

There are many reasons why a transcript might be shorter than the annotated gene.

The most obvious are:

A miss-assembly by trinity - it's generally bad practice to assume that the results produced by any tool are 100% correct.
As GenoMax said, it could be a produce of incomplete RT. It could also be a product of RNA degradation (i.e. ex-vivo) between the lysing of the cell and the RT, or fragmentation during library prep.
We should also talk about what you mean by "gene" and "transcript". At least in eukaryotes, genes can have multiple transcription start sites and multiple poly A sites. In humans, most genes have multiple TSSs and polyA sites. The region of a genome marked as a "gene" in an annotation is usually the region from the most 5' TSS to the most 3' polyA. But in many cases, there is no transcript that covers the whole "gene". Consider the following:

transcript 1: |>>>>>>>>>>>>>>|---------------|>>>>>>|-------------|>>>>>>>>>>>>>>|
transcript 2:    |>>>>>>>>>>>|------------------------------------|>>>>>>>>>>>>>>|
transcript 3:                    |>>>>>|-----|>>>>>>|---------------------|>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>|
"gene region" |>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>|

All of the transcripts are shorter than the "gene". Its possible that one of these transcripts produces the "functional" protein, and that the others are defective. But its equally possible t hat they all produce functional proteins. They might even produce the same protein (if the differences are all in the untranslated regions), or they might produce different proteins with different functions. Its very difficult to tell from purely RNAseq data.

ADD COMMENT • link 14 months ago by i.sudbery 20k

0

Entering edit mode

Thanks!
I'm working with a bacterium - K. Pneumoniae.

ADD REPLY • link 14 months ago by langziv ▴ 70

0

Entering edit mode

If you are working with bacteria then a reference genome is certainly available so why did you choose to do assembly? You could have also used a tool like Rockhopper meant for bacterial RNAseq assembly than trinity.

ADD REPLY • link 14 months ago by GenoMax 147k

0

Entering edit mode

I have an assembly, not a genome, so I asked here on Biostars and searched the internet and Trinity was the best choice, according to the answers I got and information I read.
I also tried working with Rockhopper before using Trinity, and it didn't work because the strain was not recognized by the program. I asked the program developer and got no response, so I figured I should find something else.

ADD REPLY • link 14 months ago by langziv ▴ 70

0

Entering edit mode

You have RNAseq data that you are starting this analysis with or DNAseq? If you have RNAseq data then K. pneumoniae is a common bacterium and there are 17K+ genomes available at NCBI. Are you not able to use one of the existing genome assemblies?

ADD REPLY • link 14 months ago by GenoMax 147k

0

Entering edit mode

I have RNA-seq. Rockhopper didn't recognize the stain's name, and using another K. pneumoniae genome is not an option.

ADD REPLY • link 14 months ago by langziv ▴ 70

0

Entering edit mode

I modified the question so that instead of genes the question refers to CDSs.

ADD REPLY • link 14 months ago by langziv ▴ 70

0

Entering edit mode

If it's difficult to tell from RNA-seq data, what other approaches I should use?

ADD REPLY • link 14 months ago by langziv ▴ 70

0

Entering edit mode

How is it possible to tell in case of a mis assembly?

ADD REPLY • link 14 months ago by langziv ▴ 70

0

Entering edit mode

Align your data to one of the genomes available at NCBI. There is no way your strain is completely different than the common reference. If it is then it may not be K. pneumoniae.

ADD REPLY • link 14 months ago by GenoMax 147k

0

Entering edit mode

I have aligned the RNA to the strain's genome from NCBI. The issue is that some RNA sequences that match CDSs from the genome, match them partially, and I'm looking for a way to tell which of these partially matching RNAs are functional, and which aren't and should be discarded.

ADD REPLY • link 14 months ago by langziv ▴ 70