Question

STAR aligned my .fastq reads to the whole genome as opposed to only to the coding sequences

0

Entering edit mode

2.3 years ago

Aleksandar • 0

Hello all - I just got into bioinformatics and stumbled into a problem that I've been trying to resolve for days. I am attempting to re-analyze an RNA-seq dataset. The data is available as fastq at ebi.ac.uk, with the accession number: SRR5217848 (where I've downloaded it from: ENA Browser (ebi.ac.uk)). I uploaded it to Galaxy, ran FastQC, everything looked okay, ran Trimmomatic, took the paired data (as opposed to the unpaired which I understand I do not need per se for downstream analysis) and reached the point where I should align the reads to the genome. The genome I have is in .fasta format, I it downloaded from ucsc (https://hgdownload.soe.ucsc.edu/hubs/GCA/002/082/055/GCA_002082055.1/GCA_002082055.1.fa.gz) and also a .gtf file for annotation as to where the exons/CDSs/mRNA are, also taken from ucsc (https://hgdownload.soe.ucsc.edu/hubs/GCA/002/082/055/GCA_002082055.1/genes/GCA_002082055.1_nHd_3.1.xenoRefGene.gtf.gz). I decided to use RNA STAR from Galaxy for the alignment -> For "Custom or in-built referece genome", I chose "Use reference genome from history and build temporary index" and uploaded the .fasta genome into " Select a reference genome"; for "Build index with or without known splice junctions annotation", I chose "build index with gene-model" and uploaded the annotated .gtf file to " Gene model (gff3,gtf) file for splice junctions". In the end when I get the result, I downloaded the RNA STAR .bam file and opened with a viewer (I'm using SeqMonk from the Babraham Institute), STAR seems to have aligned reads to the whole genome and not exclusively to the exons/CDS/mRNA which are described in the .gtf file (I am uploading a screenshot from the SeqMonk viewer). Could there be a reason for that and how can I address this problem appropriately?

Thanks a lot :)! You're literally saving my life as this research is very important to me. enter image description here

read RNA-seq STAR Galaxy alignment • 1.1k views

ADD COMMENT • link updated 2.3 years ago by lieven.sterck 15k • written 2.3 years ago by Aleksandar • 0

0

Entering edit mode

I'm not sure how STAR on galaxy works, but STAR will output a log file summarizing some alignment stats. Do you have access to this, and can you share it here?

ADD REPLY • link 2.3 years ago by rpolicastro 13k

0

Entering edit mode

There is a log that has come out of the analysis, I am pasting here the data:

    Started job on |    Aug 29 07:55:52
                         Started mapping on |   Aug 29 07:58:05
                                Finished on |   Aug 29 08:21:08
   Mapping speed, Million of reads per hour |   23.51

                      Number of input reads |   9030721
                  Average input read length |   194
                                UNIQUE READS:
               Uniquely mapped reads number |   6882484
                    Uniquely mapped reads % |   76.21%
                      Average mapped length |   191.54
                   Number of splices: Total |   3902790
        Number of splices: Annotated (sjdb) |   346288
                   Number of splices: GT/AG |   3819213
                   Number of splices: GC/AG |   33882
                   Number of splices: AT/AC |   359
           Number of splices: Non-canonical |   49336
                  Mismatch rate per base, % |   0.45%
                     Deletion rate per base |   0.03%
                    Deletion average length |   1.87
                    Insertion rate per base |   0.03%
                   Insertion average length |   1.60
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   128787
         % of reads mapped to multiple loci |   1.43%
    Number of reads mapped to too many loci |   1682
         % of reads mapped to too many loci |   0.02%
                              UNMAPPED READS:
  Number of reads unmapped: too many mismatches |   0
       % of reads unmapped: too many mismatches |   0.00%
            Number of reads unmapped: too short |   2014375
                 % of reads unmapped: too short |   22.31%
                Number of reads unmapped: other |   3393
                     % of reads unmapped: other |   0.04%
                                  CHIMERIC READS:
                       Number of chimeric reads |   0
                            % of chimeric reads |   0.00%

I am not sure whether this is what you were looking for. Moreover, I started wondering do alignment algorythms such as STAR need annotation files such as .gtfs or is a .fasta of the genome sufficient and STAR decides on its own where exons are?

ADD REPLY • link updated 2.3 years ago by lieven.sterck 15k • written 2.3 years ago by Aleksandar • 0

score 0 · Answer 1 · 2022-08-29

0

Entering edit mode

2.3 years ago

swbarnes2 14k

The STAR alignment stats look fine. The gtf is a guide to help the reads be spliced properly, but if the best place to align a read is intergenic, it will be placed there. Which is what you want; you don't want to be forcing reads to align to the wrong place because you don't like where they really belong.

I don't quite get what the top visualization means but the bottom one looks like a variable pile up of reads on exons.

ADD COMMENT • link 2.3 years ago by swbarnes2 14k

0

Entering edit mode

I get what you're saying - thanks a lot for the reply. In regards with the visualisation, the blue lane where it says mRNA on the far left is where the genes/exons/CDS are annotated in this genome assembly; in the white lane where it says "Galaxy-12-[RNA_STAR...]" those are the reads from the RNA-seq dataset. What surprised me was that there was an abundance of reads from the RNA-seq that were mapping outside of genes. That's all.

ADD REPLY • link 2.3 years ago by Aleksandar • 0

0

Entering edit mode

There may be an experimental explanation for that. A simple one could be contamination with DNA or expression from regions that were not known to be expressed before (two extremes to consider). Since you are using data downloaded from SRA (that you did not generate) you should consider all the outside possibilities.

ADD REPLY • link 2.3 years ago by GenoMax 147k