Flawed Sam Header Regex?
2
0
Entering edit mode
13.4 years ago

According to the SAM format white paper (http://samtools.sourceforge.net/SAM1.pdf) header lines should be matched by /^@[A-Za-z][A-Za-z](\t[A-Za-z][A-Za-z0-9]:[ -~])+$/ or /^@CO\t.*/. However, this does not seem to be the case with the example in the format white paper - or with real world data.

ruby -e 'puts "@HD\tVN:1.3\tSO:coordinate" =~ /^@[A-Za-z][A-Za-z](\t[A-Za-z][A-Za-z0-9]:[ -~])+$/'

So is the SAM header or the regex flawed?

Cheers

Martin

sam format • 2.1k views
ADD COMMENT
2
Entering edit mode
13.4 years ago
Michael Barton ★ 1.9k

I use rubular for testing regexes. You could add your test SAM string and then play with the regex to get the correct match?

ADD COMMENT
0
Entering edit mode

Hey, that is pretty cool!

ADD REPLY
1
Entering edit mode
13.4 years ago
brentp 24k

Hm, looks like the regex is flawed. Whereas, it's currently:

/^@[A-Za-z][A-Za-z](\t[A-Za-z][A-Za-z0-9]:[ -~])+$/

It seems it should be:

/^@[A-Za-z][A-Za-z](\t[A-Za-z][A-Za-z0-9]:[ -~]+)+$/

Where the extra '+' allows more than 1 character following the :

ADD COMMENT
0
Entering edit mode

Funny, to my understanding [ -~]+ allows one or more of that group of chars space, dash and tilde. How it matches '1.3' and 'coordinate' baffles me.

ADD REPLY
0
Entering edit mode

@masha it does look odd, but it's the same format as [A-Z] it means match any characters between (including endpoints) " " and "~". ord(" ") == 32 and ord("~") == 126

ADD REPLY
0
Entering edit mode

Of cause. I submitted the question to the samtools mailing list. Waiting for an answer.

ADD REPLY

Login before adding your answer.

Traffic: 2572 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6