Hi,
I've been trying to remove the letters "chr" from fasta headings of hg19 ref. genome. I tried a following sed script;
sed 's/[chr,]//g'
I am just wondering if there are any other sed or awk scripts to the same.
Thanks.
Hi,
I've been trying to remove the letters "chr" from fasta headings of hg19 ref. genome. I tried a following sed script;
sed 's/[chr,]//g'
I am just wondering if there are any other sed or awk scripts to the same.
Thanks.
You're streaming the contents of a FASTA file and you would like to edit this stream. sed
is your best bet. If you'd like to edit the file in-place and not create a new file with the output of sed
, you'll want to try sed -i
. If you're on a Mac and would like to use GNU-sed specific options, you might wanna use the gsed
utility.
Quick question: Why is the search RegEx [chr,]
and not just chr
?
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Try more. Here is a link that can help you http://www.brunolinux.com/02-The_Terminal/Find_and%20Replace_with_Sed.html
Ashutosh, unless you use -i, all sed commands write to stdout and do not affect the original file.
Yes, you are right. It was a warning for newbies who may copy and use a wrong command and accidentally modifying their original file.
I agree, but IMHO, we must advice them to check for the -i option rather than saying that sed overwrites files. Anyway, moot point as OP looks pretty at home with UNIX (grep at least)
Why other scripts, does this not work?
I am in a learning process. By looking at a wide rage of sed and awk scripts having same functions on the web, I am just curious to see what alternatives are. Plus I am not yet sure this script fully works.
I usually stick to sed for file content editing. My motto is to use a tool that we all know works. We could use Perl or Python or awk for this, but I'm comfortable with the simplicity and integration that sed gives me.
I like your approach though - makes for quite a bit of learning!
Another good resource for find, grep, sed and awk: http://wilsonericn.wordpress.com/2011/08/25/find-grep-sed-and-awk/
There may be something wrong with this script. A size of output.fa is around 2.86GB whereas the original input.fa is around 3.16GB. I don't think eliminating the three characters from headers would result in the loss of nearly 300kb ?
This script did delete 'chr' from the header showing the result of
and
Am I missing something here?
That sure sounds a bit unexpected. There are only 25 headers and 75 characters should not amount to a 300 MB difference.
Is there a reason for the
[chr,]
usage though? Why not justchr
?Also, maybe a difference in the
wc -c
values before and after the script would make it clear of the number of characters being deleted is >75. This is an expensive process though.Hi,
For
[chr,]
, I was just following some examples on the net. I thought[]
was necessary to specify certain letters/characters in the word (in this case 'chr' at begging of the word) but I was wrong. I tried just'chr'
and really worked. Overall, the file size remains similar. It will be long way to master a usage of sed.By 'file size remains similar', you mean that using
[chr,]
orchr
doesn't make a difference, right?They did make the significant difference. As in my previous post, the script with
[chr,]
chewed up around 300kb whereas the one with'chr'
did not change the file size before and after.Ah, of course, the expression
[chr,]
deleted allC
s,H
s,R
s and,
s. This effectively eliminated 1/4th of the genome sequence. How could I not have seen that![chr,]
translates to 'c' or 'h' or 'r' or ','.'chr'
translates to the exact string 'chr'.