Question

Very low unique mapped reads - STAR alignment

0

Entering edit mode

6.3 years ago

jsl ▴ 50

Hi all, while looking at my log.final.out file, it dawned on me that my uniquely mapped reads is really low. See below:

       Mapping speed, Million of reads per hour |       113.33

                          Number of input reads |       11050047
                      Average input read length |       300
                                    UNIQUE READS:
                   Uniquely mapped reads number |       24100
                        Uniquely mapped reads % |       0.22%
                          Average mapped length |       279.06
                       Number of splices: Total |       31927
            Number of splices: Annotated (sjdb) |       6179
                       Number of splices: GT/AG |       11720
                       Number of splices: GC/AG |       315
                       Number of splices: AT/AC |       119
               Number of splices: Non-canonical |       19773
                      Mismatch rate per base, % |       1.65%
                         Deletion rate per base |       0.02%
                        Deletion average length |       3.24
                        Insertion rate per base |       0.02%
                       Insertion average length |       2.50
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       16802
             % of reads mapped to multiple loci |       0.15%
        Number of reads mapped to too many loci |       630
             % of reads mapped to too many loci |       0.01%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.00%
                 % of reads unmapped: too short |       99.61%
                     % of reads unmapped: other |       0.02%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%

I used STAR to make an index and subsequently aligned it. My command for STAR is as follow

STAR --genomeDir /home/user/scratch60/hg38_index \
    --runThreadN 6 \
    --readFilesIn /home/user/scratch60/SRR7059136.fastq \
    --outFileNamePrefix /home/user/scratch60/STARresults/SRR7059136  \
    --outSAMtype BAM SortedByCoordinate \
    --outSAMunmapped Within \
    --outSAMattributes Standard

If some one could shed some light onto this, that would be much appreciated. Thanks!

RNA-Seq • 8.5k views

ADD COMMENT • link updated 6.3 years ago by h.mon 35k • written 6.3 years ago by jsl ▴ 50

1

Entering edit mode

have you done QC of fastq files? And are you sure that you are mapping to a correct reference genome?

ADD REPLY • link 6.3 years ago by grant.hovhannisyan ★ 2.6k

1

Entering edit mode

"Reads too short" doesn't really mean that, it just means "didn't map".

As grant says, the first things to check are: is the fastq okay, or is it low quality garbage?

Second, are you sure this is the right reference?

ADD REPLY • link 6.3 years ago by swbarnes2 14k

0

Entering edit mode

Thanks guys, these fastq files are from published GEO SRA https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3109339 and I obtained the Fastq via fastq-dump.

They used GENCODE GRCh38 v26 gene annotation, while I used these to make the index,

GCF_000001405.26_GRCh38_genomic.fna 
GCF_000001405.26_GRCh38_genomic.gff

but shouldn't make a difference?

ADD REPLY • link 6.3 years ago by jsl ▴ 50

1

Entering edit mode

Could you naively try to blast a few unmapped reads?

ADD REPLY • link 6.3 years ago by WouterDeCoster 47k

0

Entering edit mode

yeah, they do belong to the right GRCh38 assembly... i'm puzzled!

ADD REPLY • link 6.3 years ago by jsl ▴ 50

0

Entering edit mode

% of reads unmapped: too short | 99.61%

You lose a lot of reads here...

ADD REPLY • link 6.3 years ago by WouterDeCoster 47k

score 3 · Accepted Answer · 2018-08-18

Solution: Download the properly separated forward and reverse fastq files from ENA, or use fastq-dump --split-files. Downloading from ENA is faster and less error-prone. Then map the files as paired-end reads, not as single-end.

Diagnostic:

The page you linked says:

Data processing Paired-end reads were mapped to reference genome using STAR version 2.5.2b

How did you download the fastq files? You are mapping as single-end, but the original data is paired end:

SRR7059136 - Illumina HiSeq 4000 paired end sequencing

I suspect by not splitting the forward and reverse reads, the reads that originally overlaped now become artifact inverted duplicates. Look at the coordinates of this blast, made with the first read from the record you linked, but not split as it should.

 Homo sapiens clone LA13_165F6 sequence
Sequence ID: KC876030.1 Length: 36220 Number of Matches: 2
Related Information
Range 1: 22403 to 22552
Next Match
Previous Match
Alignment statistics for match #1 Score Expect  Identities  Gaps    Strand
260 bits(288)   1e-65   147/150(98%)    0/150(0%)   Plus/Plus

Query  1      TCTCAGGCTCCCTCTCCGGAATCGAACCCTGATTCCCCGTCACCCGTGGTCACCATGGTA  60
              ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  22403  TCTCAGGCTCCCTCTCCGGAATCGAACCCTGATTCCCCGTCACCCGTGGTCACCATGGTA  22462

Query  61     GGCACGGNGACTACCATCGAAAGTTGATAGGGCAGACGTTCGAATGGGTCGTCGCCGCCA  120
              ||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  22463  GGCACGGCGACTACCATCGAAAGTTGATAGGGCAGACGTTCGAATGGGTCGTCGCCGCCA  22522

Query  121    CGGGGGGCGTGCGNTCGGNCCGAGGTTATC  150
              ||||||||||||| |||| |||||||||||
Sbjct  22523  CGGGGGGCGTGCGATCGGCCCGAGGTTATC  22552


Range 2: 22408 to 22557
Next Match
Previous Match
First Match
Alignment statistics for match #2 Score Expect  Identities  Gaps    Strand
210 bits(232)   2e-50   134/150(89%)    0/150(0%)   Plus/Minus

Query  151    CTCTNGATAACCTCGGGCCGATNGCACGCCCCCNGTGGCGGCGACGACCCATTCGANCGA  210
              |||| ||||||||||||||||| |||||||||| |||||||||||||||||||||| || 
Sbjct  22557  CTCTAGATAACCTCGGGCCGATCGCACGCCCCCCGTGGCGGCGACGACCCATTCGAACGT  22498

Query  211    NTNCCCTNTCAACNNTNGATNGTAGNCGCCGTGCCTACCATGGTGACCACGGGAGACGGG  270
               | |||| |||||  | ||| |||| ||||||||||||||||||||||||||| ||||||
Sbjct  22497  CTGCCCTATCAACTTTCGATGGTAGTCGCCGTGCCTACCATGGTGACCACGGGTGACGGG  22438

Query  271    GAATCAGGGTTCGATTCNGGAGAGGGAACC  300
              ||||||||||||||||| ||||||||| ||
Sbjct  22437  GAATCAGGGTTCGATTCCGGAGAGGGAGCC  22408