Question

High downstream gene expression

0

Entering edit mode

22 months ago

yoser4 ▴ 10

hello everyone.

I am a novice in bioinformatics. I want to ask some questions.

I have some RNA-seq sequencing data. I use the bamCoverage tool to convert the bam file after STAR analysis into bigwig format, and then import it into UCSC Genome Browser. I found that there is a long sequence downstream of the gene with high expression. My problem is:

What is the bioinformatics explanation for this highly expressed sequence downstream of the gene?
Is this highly expressed sequence some components? (Under the guidance of others, I learned the question about poly (A) tails, and I don't know if I can explain it)
How can I read the following information about UCSC Genome Browser? (such as GC percent, or Repeat)

With regard to the above questions, can anyone recommend relevant articles or posts? I want to learn.

enter image description here Any help will be appreciated!

downstream High Gene expression • 1.2k views

ADD COMMENT • link updated 22 months ago by i.sudbery 20k • written 22 months ago by yoser4 ▴ 10

score 2 · Answer 1 · 2023-02-03

You don't show which species this is. I've taken a quick look at the human and mouse genome at those locations and neither have any genes around there. Looking at the pattern of expression, it looks to me like you have a series of short exons, followed by a long terminal exon, which you are marking as downstream of the gene.

Most genes in eukaryotes do have a long terminal exon, which includes the final part of the coding sequence, as well as the 3' UTR. In humans for example, the average UTR is as long as the coding sequence, and mostly is found in a single exon.

In less well studied genomes, the location of genes is often identified by aligning protein sequences from related organisms to the genome, and finding regions that could give rise to proteins of a similar sequence. This works pretty well, particularly when combined with gene prediction software working purely form sequence, and any EST/cDNA data that might exist for the species. However, while it works well for find open-reading frames/coding sequences (CDS), it works really badly, or not at all for annotating UTRs. Often in such genomes the region that is annotated as the "gene" just spans from the Start Codon to the Stop codon. But the expressed parts of genes span from the transcription start site to the transcription termination site, which might be several Kb up and downstream of the start/stop codons.

In better annotated genomes, much effort has been put into identifying the UTR sequence by using lots of cDNA/RNAseq/CAGE data, but such data just doesn't exist for non-model organisms. Even in humans its only been in the last few years that the UTR annotations have been anything like reliable. Even then, we are coming to realised that UTRs are highly variable from cell type to cell type and condition to condition, and that existing annotations only cover a subset of the total possible variation.

score 1 · Answer 2 · 2023-02-03

Impossible to tell from the information you are showing. Could be misalignments, read-through transcription (detection pipeline) and some more weird stuff. In particular, since the chromosome in question is chrX. Interpreting results of RNA-seq on the sex chromosomes warrants special caution, since there are e.g. several long non-coding RNAs like Xist that could give rise to spurious signals. Also mind that accurate assemblies of sex chromosomes are hard, so for many less studied organisms, the provided reference genome sequences should be used with a grain of salt. Best to thoroughly check the scientific literature first. For some organisms, the UCSC genome browser has a literature track, which allows you to find scientific publications easily that mention a specific region or sequence.
It is surely no polyA-tail. While polyadenylation is very relevant, it happens post-transcriptionally. Only some artificial expression vector systems encode the poly-A signal genetically as part of the vector backbone.
You can click on every track in the genome browser to obtain more information about it, e.g. the window size used to calculate the GC percent or the tools used to call and classify the repeats. To download the information, use the "Table Browser" in the "Tools" menu.

Good luck!