Hello all - I just got into bioinformatics and stumbled into a problem that I've been trying to resolve for days. I am attempting to re-analyze an RNA-seq dataset. The data is available as fastq at ebi.ac.uk, with the accession number: SRR5217848 (where I've downloaded it from: ENA Browser (ebi.ac.uk)). I uploaded it to Galaxy, ran FastQC, everything looked okay, ran Trimmomatic, took the paired data (as opposed to the unpaired which I understand I do not need per se for downstream analysis) and reached the point where I should align the reads to the genome. The genome I have is in .fasta format, I it downloaded from ucsc (https://hgdownload.soe.ucsc.edu/hubs/GCA/002/082/055/GCA_002082055.1/GCA_002082055.1.fa.gz) and also a .gtf file for annotation as to where the exons/CDSs/mRNA are, also taken from ucsc (https://hgdownload.soe.ucsc.edu/hubs/GCA/002/082/055/GCA_002082055.1/genes/GCA_002082055.1_nHd_3.1.xenoRefGene.gtf.gz). I decided to use RNA STAR from Galaxy for the alignment -> For "Custom or in-built referece genome", I chose "Use reference genome from history and build temporary index" and uploaded the .fasta genome into " Select a reference genome"; for "Build index with or without known splice junctions annotation", I chose "build index with gene-model" and uploaded the annotated .gtf file to " Gene model (gff3,gtf) file for splice junctions". In the end when I get the result, I downloaded the RNA STAR .bam file and opened with a viewer (I'm using SeqMonk from the Babraham Institute), STAR seems to have aligned reads to the whole genome and not exclusively to the exons/CDS/mRNA which are described in the .gtf file (I am uploading a screenshot from the SeqMonk viewer). Could there be a reason for that and how can I address this problem appropriately?
Thanks a lot :)! You're literally saving my life as this research is very important to me.
I'm not sure how STAR on galaxy works, but STAR will output a log file summarizing some alignment stats. Do you have access to this, and can you share it here?
There is a log that has come out of the analysis, I am pasting here the data:
I am not sure whether this is what you were looking for. Moreover, I started wondering do alignment algorythms such as STAR need annotation files such as .gtfs or is a .fasta of the genome sufficient and STAR decides on its own where exons are?