Question

sed newbie question: removing a specific letter in a word from fasta heading using

0

Entering edit mode

11.3 years ago

akhst7 ▴ 40

Hi,

I've been trying to remove the letters "chr" from fasta headings of hg19 ref. genome. I tried a following sed script;

sed 's/[chr,]//g'

I am just wondering if there are any other sed or awk scripts to the same.

Thanks.

RNA-Seq • 5.6k views

ADD COMMENT • link updated 3.9 years ago by Ram 45k • written 11.3 years ago by akhst7 ▴ 40

1

Entering edit mode

Try more. Here is a link that can help you http://www.brunolinux.com/02-The_Terminal/Find_and%20Replace_with_Sed.html

Remember to use "g" for global replacement.
Perform hit and trial using a small dataset so that you can perform hit and trial quickly.
Most of the sed commands don't produce new file but modify the existing file , so make sure you have the back up of the original file before you try.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 11.3 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Ashutosh, unless you use -i, all sed commands write to stdout and do not affect the original file.

ADD REPLY • link 11.3 years ago by Ram 45k

1

Entering edit mode

Yes, you are right. It was a warning for newbies who may copy and use a wrong command and accidentally modifying their original file.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 11.3 years ago by Ashutosh Pandey 12k

0

Entering edit mode

I agree, but IMHO, we must advice them to check for the -i option rather than saying that sed overwrites files. Anyway, moot point as OP looks pretty at home with UNIX (grep at least)

ADD REPLY • link 11.3 years ago by Ram 45k

0

Entering edit mode

Why other scripts, does this not work?

ADD REPLY • link 11.3 years ago by Ram 45k

1

Entering edit mode

I am in a learning process. By looking at a wide rage of sed and awk scripts having same functions on the web, I am just curious to see what alternatives are. Plus I am not yet sure this script fully works.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 11.3 years ago by akhst7 ▴ 40

0

Entering edit mode

I usually stick to sed for file content editing. My motto is to use a tool that we all know works. We could use Perl or Python or awk for this, but I'm comfortable with the simplicity and integration that sed gives me.

I like your approach though - makes for quite a bit of learning!

ADD REPLY • link 11.3 years ago by Ram 45k

0

Entering edit mode

Another good resource for find, grep, sed and awk: http://wilsonericn.wordpress.com/2011/08/25/find-grep-sed-and-awk/

ADD REPLY • link 5.6 years ago by Ram 45k

0

Entering edit mode

There may be something wrong with this script. A size of output.fa is around 2.86GB whereas the original input.fa is around 3.16GB. I don't think eliminating the three characters from headers would result in the loss of nearly 300kb ?

This script did delete 'chr' from the header showing the result of

grep -o -E "^>\w+" input.fa | tr -d ">"

chrM
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr20
chr21
chr22
chrX
chrY

and

grep -o -E "^>\w+" output.fa | tr -d ">"

M
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
X
Y

Am I missing something here?

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 11.3 years ago by akhst7 ▴ 40

0

Entering edit mode

That sure sounds a bit unexpected. There are only 25 headers and 75 characters should not amount to a 300 MB difference.

Is there a reason for the [chr,] usage though? Why not just chr?

Also, maybe a difference in the wc -c values before and after the script would make it clear of the number of characters being deleted is >75. This is an expensive process though.

ADD REPLY • link 5.6 years ago by Ram 45k

0

Entering edit mode

Hi,

For [chr,], I was just following some examples on the net. I thought [] was necessary to specify certain letters/characters in the word (in this case 'chr' at begging of the word) but I was wrong. I tried just 'chr' and really worked. Overall, the file size remains similar. It will be long way to master a usage of sed.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 11.3 years ago by akhst7 ▴ 40

0

Entering edit mode

By 'file size remains similar', you mean that using [chr,] or chr doesn't make a difference, right?

ADD REPLY • link 5.6 years ago by Ram 45k

0

Entering edit mode

They did make the significant difference. As in my previous post, the script with [chr,] chewed up around 300kb whereas the one with 'chr' did not change the file size before and after.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 11.2 years ago by akhst7 ▴ 40

0

Entering edit mode

Ah, of course, the expression [chr,] deleted all Cs, Hs, Rs and ,s. This effectively eliminated 1/4th of the genome sequence. How could I not have seen that!

[chr,] translates to 'c' or 'h' or 'r' or ','. 'chr' translates to the exact string 'chr'.

ADD REPLY • link 5.6 years ago by Ram 45k

Ram · Answer 1 · 2014-05-16

0

Entering edit mode

11.3 years ago

Ram 45k

You're streaming the contents of a FASTA file and you would like to edit this stream. sed is your best bet. If you'd like to edit the file in-place and not create a new file with the output of sed, you'll want to try sed -i. If you're on a Mac and would like to use GNU-sed specific options, you might wanna use the gsed utility.

Quick question: Why is the search RegEx [chr,] and not just chr?

ADD COMMENT • link 3.9 years ago by Ram 45k

0

Entering edit mode

Thanks for the suggestion. Please see above.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 11.3 years ago by akhst7 ▴ 40