SAM to BAM conversion issues
3
2
Entering edit mode
7.7 years ago
ioannis ▴ 50

Hello community,

I have just started to analyse my data. As it seems it's not that easy to start bioinfomatics with this kind of data. I have been analysing methylation reads but for some reason samtools is giving me this error:

[samopen] SAM header is present: 5724 sequences.
[sam_read1] reference '256' is recognized as '*'.
Parse error at line 5732: unmatched CIGAR operation
Aborted (core dumped)

From the beginning: The fastq file has all the reads from the run and I extract only the reads that start with ''CCGG'' because of the MspI restriction enzyme. Please correct me if any of the commands I am using is wrong.

cat bsq_S1_R1_001.fastq | paste - - - - | awk -F '\t'  '(substr($2,1,4)=="CCGG")' | tr "\t" "\n" > bsq_tagged.fastq

After this step, every read in the new fastq file starts with CCGG and looks like this:

@NB501461:9:HGFFTBGX2:1:11101:22748:1061 1:N:0:CGATGT
CCGGGATCCACCTTTACACACCCTCATCCAAGCATGTCAAACTCGGCGCCCGTCTCTGCCGCGTCCACCAGGGA
+
AAAAA#EEEEEEEEEEEAE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

So I trim this fastq file with trim_galore --rrbs and -a '3' adapter_sequence' to bsq_tagged_trimmed.fq

I align the bsq_tagged_trimmed.fq to the reference genome using bowtie2-2.3.1 and I get the bsq_tagged_trimmed.sam file. The .sam file has 5724 headers and then all the aligned scaffolds with the scaffold positions in a format like this:

@HD     VN:1.0  SO:unsorted
@SQ     SN:Scaffold   LN:13623339
.
.
and so on

Then I use samtools view -b -S -o out.bam in.sam and I get the error.

I have been using for some weeks now these parameters for alignment and trimming and I never had any problem. Thing changed: Since I update the bowtie2 to version 2.3.1 everything is going bad. I also use samtools 0.1.19 but when I ran the example files from bowtie2 directory, samtools works fine. Could this be compatibility issue of trim_galore and cutadapt with the latest bowtie2 version? How can I see the line 5732 that has unmatched CIGAR operation?

software error • 2.7k views
ADD COMMENT
2
Entering edit mode

That error is indicative of wrong genome reference file. Check that you are using right reference in bowtie - especially '256' looks like a weird name for reference.

How can I see the line 5732 that has unmatched CIGAR operation?

try to see the 5732th line of sam by head (also do another check by taking the number of header lines into account)

ADD REPLY
0
Entering edit mode

I will try to run one more time my pipeline with a different genome as a reference. I am working with a non-model organism and the genome is just in scaffolds. :(

ADD REPLY
2
Entering edit mode

You should also upgrade to latest samtools. At this time 0.1.19 is really outdated.

ADD REPLY
1
Entering edit mode

I am not the only one using the server and several people probably need this version of samtools to analyse their data. Since the example file from bowtie2 is running on samtools then i guess the version is not the problem. I will ask if it is possible to update but if I can not really prove that the samtools version causes errors then I will probably get a NO. :/

ADD REPLY
0
Entering edit mode

There's no reason that they can't provide multiple versions, I think I give my people 5 versions of samtools to choose from...

ADD REPLY
0
Entering edit mode

You can always install it just for yourself as user. Conda is great for that, too.

ADD REPLY
3
Entering edit mode
7.7 years ago

MspI isn't a blunt cutter, it's more likely that reads will start with CGG. Anyway, I would suggest that you not do any filtering based on how reads start, you're just tossing data that could be useful.

I assume you're not directly aligning your data with bowtie2, but rather doing so via bismark or something like that (don't directly use bowtie2, the results will be complete crap). Make sure you're using the most recent version of whatever you're using for alignment.

[sam_read1] reference '256' is recognized as '*'.

The error message above looks like the output has a few columns missing (I assume that 256 is supposed to be the flag).

To see a random line, one can use awk: awk '{if(NR==5732) print}' alignments.sam.

ADD COMMENT
0
Entering edit mode

The data consists mainly of hydroxymethylated positions cut with MspI but the adapters are not the usual Illumina ones. Both adapters are modified and the hydroxymethylated (5hmC) reads should be reads that start with ''CCGG''. [According to the literature: Petterson,A. et al. (2014). "RRHP: a tag-based approach for 5-hydroxymethylcytosine mapping at single-site resolution." Genome Biol 15(9): 456.] I extract these "CCGG" reads from the raw data, then trim the 3' adapter. I build an index using the bowtie2-build with the reference genome (downloaded from ensembl) and then bowtie2 to align my reads according to this index. The results are a bit crap because 8% is not aligned, 48% is aligned once and 44% is aligned more than once. So basically I can use only the 48% of my reads. According to the literature they use bowtie 0.12.8 for the alignment. If I am not wrong, biomark is for bisulfite treated reads but I have no bisulfite treatment in my library preparation. I have only enzymatic restrictions that cut hydroxymethylCytosines. Either way, I am currently aligning again with a new index from a different genome reference. I will post back the results. Thank you very much for your input.

ADD REPLY
0
Entering edit mode

Normally when people mention doing an MspI digest, they're doing RRBS, which is the source of the confusion.

ADD REPLY
0
Entering edit mode
7.7 years ago
ioannis ▴ 50

Ok, so I changed from the top.level to non-chromosomal (it's in scaffolds) reference genome. I built a new index out of it with bowtie2-build but i get the same error with both genome ref indices.

I checked the lines 5731, 5732, 5733 and I get this:

5731

NB501461:9:HGFFTBGX2:1:11101:24840:1059 256     GL831471.1      515062  0       75M     *       0  CCGGCNGACAGACTCGACATTCCCAGTGAACTGAGGAAGAAAAGACGGGGGTGTCGTGCTGGGAAGGAGCGCCGT      AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE AS:i:-26        XS:i:-21        XN:i:0  XM:i:6      XO:i:0  XG:i:0  NM:i:6  MD:Z:5G8T11C2G13T29A1   YT:Z:UU

5732

@NB501461:9:HGFFTBGX2:1:11101:24840:1059 1:N:0:CGATGT%0ACCGGCNGACAGACTCGACATTCCCAGTGAACTGAGGAAGAAAAGACGGGGGTGTCGTGCTGGGAAGGAGCGCCGT%0A+%0AAAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE%0A

5733

NB501461:9:HGFFTBGX2:1:11101:24840:1059 272     GL831679.1      58773   0       75M     *       0  ACGGCGCTCCTTCCCAGCACGACACCCCCGTCTTTTCTTCCTCAGTTCACTGGGAATGTCGAGTCTGTCNGCCGG      EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE#AAAAA AS:i:-26        XS:i:-21        XN:i:0  XM:i:6      XO:i:0  XG:i:0  NM:i:6  MD:Z:14A5A24C15T7C2T2   YT:Z:UU

Well obviously something is wrong...it seems like the specific read has no scaffold or position in the genome, however it has been reported in the "aligned" file. It starts with a "@" instead of the true header. I guess samtools is fine, but bowtie2 messes up the alignment. Any suggestions?

ADD COMMENT
2
Entering edit mode

If you're not using the most recent version, then upgrade. If it still occurs then use something else (bbmap, bwa, hisat, whatever).

ADD REPLY
0
Entering edit mode

I would check that there isn't something wrong with the fastq file you used for alignment with bowtie. Use 'grep' to find all occurrences of the read ID that appears in line 5732 (in the sam file) in your fastq file.

ADD REPLY
0
Entering edit mode
7.7 years ago
John 13k

Maybe i've got it wrong, but don't you want to identify your special CCGG reads after removing adapters?

If that's the case, i'd follow Devon's advice and just start over. Remap with a proper bisulphite mapper (I use BWA-meth, but i don't know if it works with 5hmC), do the QC trimming the Bisulphite-seq way, which is usually a little more aggressive than standard, and then once you have a normal, clean, mapped BAM file of BS reads, then go through that BAM file and tag all the reads that you think are important (i.e. with "important:Z:CGG" and "important:Z:CCGG") by using the mapping positions, the reference genome, and knowledge of how many bases have been trimmed off the end. This will be better than using the sequence and looking for CCGG implicitly, as low-quality sequencing of CCGG at the start of the read will hide it from you, which is a particular concern for BS-Seq. It will ignore reads that start with CGG/CCGG but do not map, which is probably a good thing.

This is all under the assumption that you have [sequence adapter 1][CCGG......][sequence adapter 2] and the important CCGG isn't in the adapter somewhere

ADD COMMENT
1
Entering edit mode

Around 80% of the raw data are reads that start with CCGG. According to the protocol the reads that are hydroxymethylated should start only with CCGG. According to literature, trimming is done with trim_galore using --rrbs -a '3'adapter' The only reason I tried to extract these reads from the beggining (first step according to literature) was to avoid trimming and mapping reads that I am not interested in. I think the problem is not the alignment method but the latest bowtie2 version. Since I update the bowtie2 I get inconsistent .sam files. Either way, I have nothing to lose. I will go back to the bcl files, create fastq files and extract CCGG reads after trimming. Thanks for your input.

ADD REPLY

Login before adding your answer.

Traffic: 1894 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6