Question

Only 1% of reads are used as "input reads" in STAR

0

Entering edit mode

6.5 years ago

caggtaagtat ★ 1.9k

Hi everybody,

I just did the alignment of my samples and for one of the samples, STAR used only 1% of the reads of the trimmed fastq file for mapping. Does someone know, what the reason could be for that? The references I used worked just fine for the rest of the data and I ran out of ideas where the error lies.

The FASTQ file contains around 300 million reads and STAR only uses 3 million. This is the command I used (its the last alignment step of a 2 pass run):

STAR --outFilterType BySJout --outFilterMismatchNmax 10 --outFilterMismatchNoverLmax 0.04 --alignEndsType EndToEnd -runThreadN 8 --outSAMtype BAM SortedByCoordinate --alignSJDBoverhangMin 4 --alignIntronMax 300000 --alignSJoverhangMin 8 --alignIntronMin 20 --genomeDir /path/to/Genome/ --sjdbOverhang 149 --quantMode GeneCounts --sjdbGTFfile /path/to/hg91.gtf --readFilesIn /path/to/file.fq > STAR.log

This is the Final log of the STAR run:

                             Started job on |   May 14 16:56:28
                         Started mapping on |   May 14 16:59:07
                                Finished on |   May 14 17:02:06
   Mapping speed, Million of reads per hour |   65.72

                      Number of input reads |   3267930
                  Average input read length |   134
                                UNIQUE READS:
               Uniquely mapped reads number |   3111505
                    Uniquely mapped reads % |   95.21%
                      Average mapped length |   135.04
                   Number of splices: Total |   1497184
        Number of splices: Annotated (sjdb) |   1497124
                   Number of splices: GT/AG |   1483304
                   Number of splices: GC/AG |   12329
                   Number of splices: AT/AC |   1093
           Number of splices: Non-canonical |   458
                  Mismatch rate per base, % |   0.18%
                     Deletion rate per base |   0.01%
                    Deletion average length |   1.85
                    Insertion rate per base |   0.01%
                   Insertion average length |   1.51
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   116052
         % of reads mapped to multiple loci |   3.55%
    Number of reads mapped to too many loci |   492
         % of reads mapped to too many loci |   0.02%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |   0.75%
             % of reads unmapped: too short |   0.41%
                 % of reads unmapped: other |   0.06%
                              CHIMERIC READS:
                   Number of chimeric reads |   0
                        % of chimeric reads |   0.00%

Any help is greatfully appriciated!

RNA-Seq STAR mapping input • 2.9k views

ADD COMMENT • link 6.5 years ago by caggtaagtat ★ 1.9k

0

Entering edit mode

Is there a chance of the input file being somehow corrupt? Do you see any errors anywhere?

ADD REPLY • link 6.5 years ago by GenoMax 147k

0

Entering edit mode

Mapping with salmon worked and I don't see any errors during trimming

Edit: with salmon 330 million rads were mapped

ADD REPLY • link 6.5 years ago by caggtaagtat ★ 1.9k

0

Entering edit mode

There's likely to still be an error in the fastq file that salmon happens to work around. Don't use SortedByCoordinate and look to see if the last read in the output file is around the 3.2 millionth in the file.

ADD REPLY • link 6.5 years ago by Devon Ryan 104k

0

Entering edit mode

I will try that, thank you.

ADD REPLY • link 6.5 years ago by caggtaagtat ★ 1.9k

0

Entering edit mode

3 million out of 300 million is 1%, not 10%. Do you really have one sample with 300 million reads for RNAseq?

ADD REPLY • link 6.5 years ago by h.mon 35k

0

Entering edit mode

Oh your right, I edited it in the question

ADD REPLY • link 6.5 years ago by caggtaagtat ★ 1.9k

score 2 · Accepted Answer · 2018-05-18

Ok I learned, that in fact my FASTQ files was corrupt after using sortMeRNA to remove reads from rRNA

It seems like there can be an error during the run, where it inserts a blank line in your FASTQ file, which leads to STAR cutting the whole file at that position. I hope after removing of the line everything should be fine.

Salmon did not have any problems with the extra blank line

Here is the link to the answer of matt.shenton who knew about this error!

Edit: Removing the blank line did the trick