Question

SAM to BAM conversion issues

2

Entering edit mode

8.4 years ago

ioannis ▴ 50

Hello community,

I have just started to analyse my data. As it seems it's not that easy to start bioinfomatics with this kind of data. I have been analysing methylation reads but for some reason samtools is giving me this error:

[samopen] SAM header is present: 5724 sequences.
[sam_read1] reference '256' is recognized as '*'.
Parse error at line 5732: unmatched CIGAR operation
Aborted (core dumped)

From the beginning: The fastq file has all the reads from the run and I extract only the reads that start with ''CCGG'' because of the MspI restriction enzyme. Please correct me if any of the commands I am using is wrong.

cat bsq_S1_R1_001.fastq | paste - - - - | awk -F '\t'  '(substr($2,1,4)=="CCGG")' | tr "\t" "\n" > bsq_tagged.fastq

After this step, every read in the new fastq file starts with CCGG and looks like this:

@NB501461:9:HGFFTBGX2:1:11101:22748:1061 1:N:0:CGATGT
CCGGGATCCACCTTTACACACCCTCATCCAAGCATGTCAAACTCGGCGCCCGTCTCTGCCGCGTCCACCAGGGA
+
AAAAA#EEEEEEEEEEEAE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

So I trim this fastq file with trim_galore --rrbs and -a '3' adapter_sequence' to bsq_tagged_trimmed.fq

I align the bsq_tagged_trimmed.fq to the reference genome using bowtie2-2.3.1 and I get the bsq_tagged_trimmed.sam file. The .sam file has 5724 headers and then all the aligned scaffolds with the scaffold positions in a format like this:

@HD     VN:1.0  SO:unsorted
@SQ     SN:Scaffold   LN:13623339
.
.
and so on

Then I use samtools view -b -S -o out.bam in.sam and I get the error.

I have been using for some weeks now these parameters for alignment and trimming and I never had any problem. Thing changed: Since I update the bowtie2 to version 2.3.1 everything is going bad. I also use samtools 0.1.19 but when I ran the example files from bowtie2 directory, samtools works fine. Could this be compatibility issue of trim_galore and cutadapt with the latest bowtie2 version? How can I see the line 5732 that has unmatched CIGAR operation?

software error • 3.6k views

ADD COMMENT • link updated 8.4 years ago by John 13k • written 8.4 years ago by ioannis ▴ 50

2

Entering edit mode

That error is indicative of wrong genome reference file. Check that you are using right reference in bowtie - especially '256' looks like a weird name for reference.

How can I see the line 5732 that has unmatched CIGAR operation?

try to see the 5732th line of sam by head (also do another check by taking the number of header lines into account)

ADD REPLY • link 8.4 years ago by Santosh Anand 5.8k

0

Entering edit mode

I will try to run one more time my pipeline with a different genome as a reference. I am working with a non-model organism and the genome is just in scaffolds. :(

ADD REPLY • link 8.4 years ago by ioannis ▴ 50

2

Entering edit mode

You should also upgrade to latest samtools. At this time 0.1.19 is really outdated.

ADD REPLY • link 8.4 years ago by GenoMax 153k

1

Entering edit mode

I am not the only one using the server and several people probably need this version of samtools to analyse their data. Since the example file from bowtie2 is running on samtools then i guess the version is not the problem. I will ask if it is possible to update but if I can not really prove that the samtools version causes errors then I will probably get a NO. :/

ADD REPLY • link 8.4 years ago by ioannis ▴ 50

0

Entering edit mode

There's no reason that they can't provide multiple versions, I think I give my people 5 versions of samtools to choose from...

ADD REPLY • link 8.4 years ago by Devon Ryan 105k

0

Entering edit mode

You can always install it just for yourself as user. Conda is great for that, too.

ADD REPLY • link 8.4 years ago by WouterDeCoster 48k

score 3 · Answer 1 · 2017-04-09

3

Entering edit mode

8.4 years ago

Devon Ryan 105k

MspI isn't a blunt cutter, it's more likely that reads will start with CGG. Anyway, I would suggest that you not do any filtering based on how reads start, you're just tossing data that could be useful.

I assume you're not directly aligning your data with bowtie2, but rather doing so via bismark or something like that (don't directly use bowtie2, the results will be complete crap). Make sure you're using the most recent version of whatever you're using for alignment.

[sam_read1] reference '256' is recognized as '*'.

The error message above looks like the output has a few columns missing (I assume that 256 is supposed to be the flag).

To see a random line, one can use awk: awk '{if(NR==5732) print}' alignments.sam.

ADD COMMENT • link 8.4 years ago by Devon Ryan 105k

0

Entering edit mode

The data consists mainly of hydroxymethylated positions cut with MspI but the adapters are not the usual Illumina ones. Both adapters are modified and the hydroxymethylated (5hmC) reads should be reads that start with ''CCGG''. [According to the literature: Petterson,A. et al. (2014). "RRHP: a tag-based approach for 5-hydroxymethylcytosine mapping at single-site resolution." Genome Biol 15(9): 456.] I extract these "CCGG" reads from the raw data, then trim the 3' adapter. I build an index using the bowtie2-build with the reference genome (downloaded from ensembl) and then bowtie2 to align my reads according to this index. The results are a bit crap because 8% is not aligned, 48% is aligned once and 44% is aligned more than once. So basically I can use only the 48% of my reads. According to the literature they use bowtie 0.12.8 for the alignment. If I am not wrong, biomark is for bisulfite treated reads but I have no bisulfite treatment in my library preparation. I have only enzymatic restrictions that cut hydroxymethylCytosines. Either way, I am currently aligning again with a new index from a different genome reference. I will post back the results. Thank you very much for your input.

ADD REPLY • link 8.4 years ago by ioannis ▴ 50

0

Entering edit mode

Normally when people mention doing an MspI digest, they're doing RRBS, which is the source of the confusion.

ADD REPLY • link 8.4 years ago by Devon Ryan 105k

score 0 · Answer 2 · 2017-04-09

Ok, so I changed from the top.level to non-chromosomal (it's in scaffolds) reference genome. I built a new index out of it with bowtie2-build but i get the same error with both genome ref indices.

I checked the lines 5731, 5732, 5733 and I get this:

5731

NB501461:9:HGFFTBGX2:1:11101:24840:1059 256     GL831471.1      515062  0       75M     *       0  CCGGCNGACAGACTCGACATTCCCAGTGAACTGAGGAAGAAAAGACGGGGGTGTCGTGCTGGGAAGGAGCGCCGT      AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE AS:i:-26        XS:i:-21        XN:i:0  XM:i:6      XO:i:0  XG:i:0  NM:i:6  MD:Z:5G8T11C2G13T29A1   YT:Z:UU

5732

@NB501461:9:HGFFTBGX2:1:11101:24840:1059 1:N:0:CGATGT%0ACCGGCNGACAGACTCGACATTCCCAGTGAACTGAGGAAGAAAAGACGGGGGTGTCGTGCTGGGAAGGAGCGCCGT%0A+%0AAAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE%0A

5733

NB501461:9:HGFFTBGX2:1:11101:24840:1059 272     GL831679.1      58773   0       75M     *       0  ACGGCGCTCCTTCCCAGCACGACACCCCCGTCTTTTCTTCCTCAGTTCACTGGGAATGTCGAGTCTGTCNGCCGG      EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE#AAAAA AS:i:-26        XS:i:-21        XN:i:0  XM:i:6      XO:i:0  XG:i:0  NM:i:6  MD:Z:14A5A24C15T7C2T2   YT:Z:UU

Well obviously something is wrong...it seems like the specific read has no scaffold or position in the genome, however it has been reported in the "aligned" file. It starts with a "@" instead of the true header. I guess samtools is fine, but bowtie2 messes up the alignment. Any suggestions?

score 0 · Answer 3 · 2017-04-09

Maybe i've got it wrong, but don't you want to identify your special CCGG reads after removing adapters?

If that's the case, i'd follow Devon's advice and just start over. Remap with a proper bisulphite mapper (I use BWA-meth, but i don't know if it works with 5hmC), do the QC trimming the Bisulphite-seq way, which is usually a little more aggressive than standard, and then once you have a normal, clean, mapped BAM file of BS reads, then go through that BAM file and tag all the reads that you think are important (i.e. with "important:Z:CGG" and "important:Z:CCGG") by using the mapping positions, the reference genome, and knowledge of how many bases have been trimmed off the end. This will be better than using the sequence and looking for CCGG implicitly, as low-quality sequencing of CCGG at the start of the read will hide it from you, which is a particular concern for BS-Seq. It will ignore reads that start with CGG/CCGG but do not map, which is probably a good thing.

This is all under the assumption that you have [sequence adapter 1][CCGG......][sequence adapter 2] and the important CCGG isn't in the adapter somewhere