When using my current sam file (which I generated from a cram file, which in turn was generated from a bam file) in my analysis I kept getting following error:
terminate called after throwing an instance of 'std::out_of_range'
what(): basic_string::substr: __pos (which is 18446744073709551615) > this->size() (which is 51)
or
terminate called after throwing an instance of 'std::out_of_range'
what(): basic_string::substr: __pos (which is 18446744073709551615) > this->size() (which is 101)
After some digging, I found out that the following read caused the problem:
HWI-ST1222:6:1101:1083:200937#0 141 * 0 0 * * 0 0 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB SM:Z:TCGA-DE-A0Y3-10A-01D-A10R RG:Z:120617_SN1222_0117_BD1414ACXX_6
I have no clue how the sequencer could even generate such a read but I assume that the fact that it includes only Ns makes my analysis fall apart.
So my question is: How can I filter out reads with only Ns from the sam file?
By the way, following neighboring read gave no trouble:
HWI-ST1222:6:1101:1083:200937#0 77 * 0 0 * * 0 0 CAGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB SM:Z:TCGA-DE-A0Y3-10A-01D-A10R RG:Z:120617_SN1222_0117_BD1414ACXX_6
I am currently using whole-genome sequencing data to detect human endogenous retroviruses (HERVs) with the Specific Transposable Element Aligner (HERV-K).
The simplest solution would be to remove unmapped reads as
samtools view -F 4 -o out.bam in.bam
Not an optimal solution as I expect to find relevant sequences particularly in the unmapped reads.
If you don't want to remove all unmapped reads then try the following which should only remove reads that have one or more N's (Note: this will remove reads that may have mapped otherwise but have an N). Not tested but should work. From BBTools package (
samtools
should be in your$PATH
):Adding
mappedonly=t
will achieve the same thing as @ATPoint's solution. You could also setmaxns=N
whereN
is a higher number like the last example you posted (if you want to keep that read).The problem is that the read length can be variable and therefore I cannot determine a maximum number of Ns that I would want to exclude.
That is something you are going to need to decide for any solution (even if it is something you custom code). I would say try using
reformat.sh
with 0,1,2 etc N's to see how many alignments you lose.