paired reads (headers) show up in 3 lines of sam file instead of expected 2
1
0
Entering edit mode
8.5 years ago
raplayer ▴ 10

Problem:

Try to count features of sam file:

htseq-count -m union -r name -t exon -i gene_id sorted1.mm10.sam mm10.gtf > sorted1.mm10.counts
100000 GFF lines processed.
200000 GFF lines processed.
300000 GFF lines processed.
400000 GFF lines processed.
500000 GFF lines processed.
600000 GFF lines processed.
690428 GFF lines processed.
Error occured when processing SAM input (line 108 of file sorted1.mm10.sam):
  'pair_alignments' needs a sequence of paired-end alignments
  [Exception type: ValueError, raised in __init__.py:603]

check line 108 (compensating for header lines):

head -170 sorted1.mm10.sam | tail | awk '{print $1}'
MONK:312:C2GR3ACXX:6:1101:1609:83468
MONK:312:C2GR3ACXX:6:1101:1609:83468
MONK:312:C2GR3ACXX:6:1101:1609:87644
MONK:312:C2GR3ACXX:6:1101:1609:87644
MONK:312:C2GR3ACXX:6:1101:1609:89965
MONK:312:C2GR3ACXX:6:1101:1609:89965
MONK:312:C2GR3ACXX:6:1101:1609:93571
MONK:312:C2GR3ACXX:6:1101:1609:93571
MONK:312:C2GR3ACXX:6:1101:1609:93571
MONK:312:C2GR3ACXX:6:1101:1610:15680

So we see there are 3 alignments for "1609:93571"...

MONK:312:C2GR3ACXX:6:1101:1609:93571    97  chr5    87555882    60  101M    =   87559761    3942    CTTTTTCTCTNATGTTAGTGCCTGACAAAAGAGACTGAGAAGAAACCAACTCTCCCTAGATCTCTGATCTAAACTTCAGTGTTGAATCTTCCATTTTCTTG   CCCFFFFFHH#2AFHIIIIJJJJJJJIJJJJIIIIJJJIJJJJJJJIJJJJJJJJIJJJJJJJJJJIJJIGHGHHHFFFFFFFDCEDCEEEDADFEEEDDC   NM:i:1  MD:Z:10A90  AS:i:99 XS:i:19
MONK:312:C2GR3ACXX:6:1101:1609:93571    145 chr5    87559761    60  38S63M  =   87555882    -3942   CCAGCTCTTCACATGATCATACCAGGGACCAAAGCTCACTTGTCCAGCCATGAATTTCTCTAGGAACTCTTCCCAGGTGCCAGGCTCTGGGTGGATTTTTG   DDDCDCDDCCCDDEDD@DC@;DEECECBECFHEAA>E@GHAF@JIIIJIIHHJJIHEHIHEJIJIIJIJIFJJJIIJJJJJIJJJJJIHHHHHFFFFFCCC   NM:i:0  MD:Z:63 AS:i:63 XS:i:19 SA:Z:chr5,87558545,-,40M61S,60,0;
MONK:312:C2GR3ACXX:6:1101:1609:93571    401 chr5    87558545    60  40M61H  =   87555882    -2703   CCAGCTCTTCACATGATCATACCAGGGACCAAAGCTCACT    DDDCDCDDCCCDDEDD@DC@;DEECECBECFHEAA>E@GH    NM:i:0  MD:Z:40 AS:i:40 XS:i:19 SA:Z:chr5,87559761,-,38S63M,60,0;

check input fastq:

grep "1609:93571" sorted1.fastq 
@MONK:312:C2GR3ACXX:6:1101:1609:93571/1
@MONK:312:C2GR3ACXX:6:1101:1609:93571/2

Questions:

What would cause a third alignment to be placed into a sam file from a pair of reads?

Is there a way to force only 1 alignment for each read (I thought I read in the manual this is the default)?

Is there a way to force htseq-count to ignore orphaned alignments like this?

Thanks!

bwa mem alignment sam • 2.2k views
ADD COMMENT
2
Entering edit mode
8.5 years ago

The third alignment with a flag of 401 is not a primary alignment. This can mean different things depending on the aligner you used to generate the sam file. Some aligners will output multiple alignments and designate the best one as primary.

If you want to filter out all non-primary alignments you can:

samtools view -F 256 alignments.sam > filtered.sam
ADD COMMENT

Login before adding your answer.

Traffic: 1832 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6