Question

[STAR] Why are there so many more mapped reads than unmapped reads in my .sam output file?

0

Entering edit mode

7.6 years ago

Kristin Muench ▴ 640

Hello,

I ran STAR with the command:

STAR --runThreadN ${SLURM_NPROCS:-1} \\
         --genomeDir \$STAR_HG19_GENOME \\
         --readFilesIn ${workingDir}/$Read1 ${workingDir}/$Read2 \\
         --outSAMunmapped Within \\
         --outReadsUnmapped Fastx \\
         --outFileNamePrefix ${workingDir}/${outputFileLoc}/${sample}

Since the output SAM file should include both mapped and unmapped transcripts within the file (--outSAMunmapped Within), I tried extracting the mapped and unmapped reads using these commands:

samtools view -F4 sample.bam > sample.mapped.sam
samtools view -f4 sample.bam > sample.unmapped.sam

Here is a sample of what the output files looked like.

head sample.mapped.sam
NS500127:25:HJM5YBGX2:1:11101:20416:1050    163 chr1    76779521    255 1S41M33S    =   76779576    130 NCAGCGTTCCTTTTCCNGCTGGNTNTGCNTNNNNTNAANNAANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN #AAAAEEEEEEEEEAE#EEEEE#E#EEE#E####E#EE##EE################################# NH:i:1  HI:i:1  AS:i:102    nM:i:0
NS500127:25:HJM5YBGX2:1:11101:20416:1050    83  chr1    76779576    255 75M =   76779521    -130    GCTACTAAACTGCTTTGGACAACCTGGTACAAAGTGGATACCATTCTCCTACACATACAGGCGGCCCCTNCGAAC E6EEEEAAEEE6EEEE6EEAEE/EEAEEEEEEEAE<EEEE6EEEEEEEEEEEEEEEEEEEEEE/EEA6E#AAAA/ NH:i:1  HI:i:1  AS:i:102    nM:i:0

tail sample.unmapped.sam
NS500127:25:HJM5YBGX2:4:23612:23666:18750   77  *   0   0   *   *   0   0   AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEA/EE NH:i:0  HI:i:0  AS:i:96 nM:i:0  uT:A:1
NS500127:25:HJM5YBGX2:4:23612:23666:18750   141 *   0   0   *   *   0   0   CCACCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTT    A6/AA///E6/EE/E///E//EEEEE//AAA/AAEEEEEA//////////A/////////<E/////A///////A    NH:i:0  HI:i:0  AS:i:96 nM:i:0  uT:A:1

My Log.progress.out file reports that I have 81.7% alignment of reads. However, I notice that sample.unmapped.sam has way more lines (=reads) than sample.mapped.sam.

wc -l *mapped*
   5407730 sample.mapped.sam
   8218835 sample.unmapped.sam

How could this be? Any ideas?

(Another point of confusion - the mate1/mate2 files associated with this file, which should also contain these unmapped reads thanks to --outReadsUnmapped Fastx, were completely blank. For another test file, however, they had 12,000,000-23,000,000 lines. Very confusing!)

Thank you, Kristin

RNA-Seq star • 4.4k views

ADD COMMENT • link updated 7.6 years ago by Jeffin Rockey ★ 1.3k • written 7.6 years ago by Kristin Muench ▴ 640

0

Entering edit mode

Can you paste the Log.final.out here?

ADD REPLY • link 7.6 years ago by Jeffin Rockey ★ 1.3k

0

Entering edit mode

Sure:

$ cat Log.final.out
                                     Started job on |   Feb 23 14:01:52
                                 Started mapping on |   Feb 23 14:03:02
                                        Finished on |   Feb 23 15:46:23
           Mapping speed, Million of reads per hour |   49.63

                          Number of input reads |   85488542
                      Average input read length |   150
                                    UNIQUE READS:
                   Uniquely mapped reads number |   69865284
                        Uniquely mapped reads % |   81.72%
                          Average mapped length |   149.76
                       Number of splices: Total |   31974041
            Number of splices: Annotated (sjdb) |   31393833
                       Number of splices: GT/AG |   31667307
                       Number of splices: GC/AG |   222211
                       Number of splices: AT/AC |   23142
               Number of splices: Non-canonical |   61381
                      Mismatch rate per base, % |   0.44%
                         Deletion rate per base |   0.01%
                        Deletion average length |   1.45
                        Insertion rate per base |   0.01%
                       Insertion average length |   1.33
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   10382834
             % of reads mapped to multiple loci |   12.15%
        Number of reads mapped to too many loci |   174608
             % of reads mapped to too many loci |   0.20%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |   0.00%
                 % of reads unmapped: too short |   5.89%
                     % of reads unmapped: other |   0.04%
                                  CHIMERIC READS:
                       Number of chimeric reads |   0
                            % of chimeric reads |   0.00%

ADD REPLY • link 7.6 years ago by Kristin Muench ▴ 640

0

Entering edit mode

You have in fact 93 % alignment. Need more information. Would you please provide the output of the below for checks ?

*

samtools view  -c -F260 sample.bam
samtools view  -c -F256 sample.bam
samtools view  -c -f4 sample.bam
samtools view  -c -F4 sample.bam

*

ADD REPLY • link 7.6 years ago by Jeffin Rockey ★ 1.3k

0

Entering edit mode

I see - so alignment is not just Uniquely mapped reads %, but Uniquely mapped reads % + % of reads mapped to multiple loci?

Also, I didn't produce any .bam files with STAR - just .sam - so I'll run the commands you suggested on that.

samtools view  -c -F260 sample.bam
[E::sam_parse1] SEQ and QUAL are of different length
[W::sam_read1] Parse error at line 161121672
[main_samview] truncated file.

samtools view  -c -F256 sample.bam
[E::sam_parse1] SEQ and QUAL are of different length
[W::sam_read1] Parse error at line 161121672
[main_samview] truncated file.

samtools view  -c -f4 sample.bam
[E::sam_parse1] SEQ and QUAL are of different length
[W::sam_read1] Parse error at line 161121672
[main_samview] truncated file.

samtools view  -c -F4 sample.bam
[E::sam_parse1] SEQ and QUAL are of different length
[W::sam_read1] Parse error at line 161121672
[main_samview] truncated file.

Hmmm, I sense a theme...What does it mean for SEQ and QUAL to be of different length? Does the "truncated file" refer to the .fastq or some other file? These same .fastqs aligned normally with TopHat...

ADD REPLY • link 7.6 years ago by Kristin Muench ▴ 640

score 2 · Accepted Answer · 2018-02-28

2

Entering edit mode

7.6 years ago

Jeffin Rockey ★ 1.3k

Yes, the alignment is unique + multiple loci . Too many loci should not be included.

Your bam seems to be truncated or so,most likely while converting from sam. Would you please do that conversion of STAR output bam completely using samtools view -bhS -o sample.bam sample.sam and do the countings again.It should be fine then.

ADD COMMENT • link 7.6 years ago by Jeffin Rockey ★ 1.3k

0

Entering edit mode

UPDATE: You're right - there was a step when I wasn't converting .bam to .sam appropriately. Thanks for your help!

ADD REPLY • link 7.6 years ago by Kristin Muench ▴ 640