sed newbie question: removing a specific letter in a word from fasta heading using
1
0
Entering edit mode
10.5 years ago
akhst7 ▴ 40

Hi,

I've been trying to remove the letters "chr" from fasta headings of hg19 ref. genome. I tried a following sed script;

sed 's/[chr,]//g'

I am just wondering if there are any other sed or awk scripts to the same.

Thanks.

RNA-Seq • 4.9k views
ADD COMMENT
1
Entering edit mode

Try more. Here is a link that can help you http://www.brunolinux.com/02-The_Terminal/Find_and%20Replace_with_Sed.html

  1. Remember to use "g" for global replacement.
  2. Perform hit and trial using a small dataset so that you can perform hit and trial quickly.
  3. Most of the sed commands don't produce new file but modify the existing file , so make sure you have the back up of the original file before you try.
ADD REPLY
0
Entering edit mode

Ashutosh, unless you use -i, all sed commands write to stdout and do not affect the original file.

ADD REPLY
1
Entering edit mode

Yes, you are right. It was a warning for newbies who may copy and use a wrong command and accidentally modifying their original file.

ADD REPLY
0
Entering edit mode

I agree, but IMHO, we must advice them to check for the -i option rather than saying that sed overwrites files. Anyway, moot point as OP looks pretty at home with UNIX (grep at least)

ADD REPLY
0
Entering edit mode

Why other scripts, does this not work?

ADD REPLY
1
Entering edit mode

I am in a learning process. By looking at a wide rage of sed and awk scripts having same functions on the web, I am just curious to see what alternatives are. Plus I am not yet sure this script fully works.

ADD REPLY
0
Entering edit mode

I usually stick to sed for file content editing. My motto is to use a tool that we all know works. We could use Perl or Python or awk for this, but I'm comfortable with the simplicity and integration that sed gives me.

I like your approach though - makes for quite a bit of learning!

ADD REPLY
0
Entering edit mode

Another good resource for find, grep, sed and awk: http://wilsonericn.wordpress.com/2011/08/25/find-grep-sed-and-awk/

ADD REPLY
0
Entering edit mode

There may be something wrong with this script. A size of output.fa is around 2.86GB whereas the original input.fa is around 3.16GB. I don't think eliminating the three characters from headers would result in the loss of nearly 300kb ?

This script did delete 'chr' from the header showing the result of

grep -o -E "^>\w+" input.fa | tr -d ">"
  • chrM
  • chr1
  • chr2
  • chr3
  • chr4
  • chr5
  • chr6
  • chr7
  • chr8
  • chr9
  • chr10
  • chr11
  • chr12
  • chr13
  • chr14
  • chr15
  • chr16
  • chr17
  • chr18
  • chr19
  • chr20
  • chr21
  • chr22
  • chrX
  • chrY

and

grep -o -E "^>\w+" output.fa | tr -d ">"
  • M
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • X
  • Y

Am I missing something here?

ADD REPLY
0
Entering edit mode

That sure sounds a bit unexpected. There are only 25 headers and 75 characters should not amount to a 300 MB difference.

Is there a reason for the [chr,] usage though? Why not just chr?

Also, maybe a difference in the wc -c values before and after the script would make it clear of the number of characters being deleted is >75. This is an expensive process though.

ADD REPLY
0
Entering edit mode

Hi,

For [chr,], I was just following some examples on the net. I thought [] was necessary to specify certain letters/characters in the word (in this case 'chr' at begging of the word) but I was wrong. I tried just 'chr' and really worked. Overall, the file size remains similar. It will be long way to master a usage of sed.

ADD REPLY
0
Entering edit mode

By 'file size remains similar', you mean that using [chr,] or chr doesn't make a difference, right?

ADD REPLY
0
Entering edit mode

They did make the significant difference. As in my previous post, the script with [chr,] chewed up around 300kb whereas the one with 'chr' did not change the file size before and after.

ADD REPLY
0
Entering edit mode

Ah, of course, the expression [chr,] deleted all Cs, Hs, Rs and ,s. This effectively eliminated 1/4th of the genome sequence. How could I not have seen that!

[chr,] translates to 'c' or 'h' or 'r' or ','. 'chr' translates to the exact string 'chr'.

ADD REPLY
0
Entering edit mode
10.5 years ago
Ram 44k

You're streaming the contents of a FASTA file and you would like to edit this stream. sed is your best bet. If you'd like to edit the file in-place and not create a new file with the output of sed, you'll want to try sed -i. If you're on a Mac and would like to use GNU-sed specific options, you might wanna use the gsed utility.

Quick question: Why is the search RegEx [chr,] and not just chr?

ADD COMMENT
0
Entering edit mode

Thanks for the suggestion. Please see above.

ADD REPLY

Login before adding your answer.

Traffic: 1574 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6