Hi everybody,
I just did the alignment of my samples and for one of the samples, STAR used only 1% of the reads of the trimmed fastq file for mapping. Does someone know, what the reason could be for that? The references I used worked just fine for the rest of the data and I ran out of ideas where the error lies.
The FASTQ file contains around 300 million reads and STAR only uses 3 million. This is the command I used (its the last alignment step of a 2 pass run):
STAR --outFilterType BySJout --outFilterMismatchNmax 10 --outFilterMismatchNoverLmax 0.04 --alignEndsType EndToEnd -runThreadN 8 --outSAMtype BAM SortedByCoordinate --alignSJDBoverhangMin 4 --alignIntronMax 300000 --alignSJoverhangMin 8 --alignIntronMin 20 --genomeDir /path/to/Genome/ --sjdbOverhang 149 --quantMode GeneCounts --sjdbGTFfile /path/to/hg91.gtf --readFilesIn /path/to/file.fq > STAR.log
This is the Final log of the STAR run:
Started job on | May 14 16:56:28
Started mapping on | May 14 16:59:07
Finished on | May 14 17:02:06
Mapping speed, Million of reads per hour | 65.72
Number of input reads | 3267930
Average input read length | 134
UNIQUE READS:
Uniquely mapped reads number | 3111505
Uniquely mapped reads % | 95.21%
Average mapped length | 135.04
Number of splices: Total | 1497184
Number of splices: Annotated (sjdb) | 1497124
Number of splices: GT/AG | 1483304
Number of splices: GC/AG | 12329
Number of splices: AT/AC | 1093
Number of splices: Non-canonical | 458
Mismatch rate per base, % | 0.18%
Deletion rate per base | 0.01%
Deletion average length | 1.85
Insertion rate per base | 0.01%
Insertion average length | 1.51
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 116052
% of reads mapped to multiple loci | 3.55%
Number of reads mapped to too many loci | 492
% of reads mapped to too many loci | 0.02%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.75%
% of reads unmapped: too short | 0.41%
% of reads unmapped: other | 0.06%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
Any help is greatfully appriciated!
Is there a chance of the input file being somehow corrupt? Do you see any errors anywhere?
Mapping with salmon worked and I don't see any errors during trimming
Edit: with salmon 330 million rads were mapped
There's likely to still be an error in the fastq file that salmon happens to work around. Don't use
SortedByCoordinate
and look to see if the last read in the output file is around the 3.2 millionth in the file.I will try that, thank you.
3 million out of 300 million is 1%, not 10%. Do you really have one sample with 300 million reads for RNAseq?
Oh your right, I edited it in the question