How to rename chromosome names in GTF file?
1
1
Entering edit mode
7.3 years ago
biomagician ▴ 410

Hi,

I have a GTF file with the following head:

head celegans.gtf
CHROMOSOME_I    Coding_transcript   exon    4119    4358    .   -   transcript_id "Transcript:Y74C9A.3.1"; gene_id "Gene:Y74C9A.3";
CHROMOSOME_I    Coding_transcript   exon    5195    5296    .   -   transcript_id "Transcript:Y74C9A.3.1"; gene_id "Gene:Y74C9A.3";
CHROMOSOME_I    Coding_transcript   exon    6037    6327    .   -   transcript_id "Transcript:Y74C9A.3.1"; gene_id "Gene:Y74C9A.3";

However, my FASTA file has the following chromosome names:

grep '>' celegans.fa
>I
>II
>III
>IV
>V
>X
>MtDNA

This discrepancy causes problems in downstream analyses. Does anyone know of a tool or way to rename the chromosome names in my GTF file to correspond to the chromosome names in the FASTA file?

Thanks.

Best, C.

GTF FASTA annotation genome • 7.4k views
ADD COMMENT
4
Entering edit mode
7.3 years ago

Try running sed -e 's/CHROMOSOME_//g' celegans.gtf. Check the output and if that works try : sed -i 's/CHROMOSOME_//g' celegans.gtf

ADD COMMENT
0
Entering edit mode

Hi,

sed -e 's/CHROMOSOME_//g' celegans.gtf

works but the command with the '-i' option gives the following error:

sed: 1: "output/genome/celegans/ ...": invalid command code o

Does the '-i' mean 'in-place' so changes the file directly? I am going to try to redirect the output of the '-e' command to the file itself.

Oups, this erased the content of the file. So the 'return' of the 'sed -e' command is NULL?

Best, C.

ADD REPLY
1
Entering edit mode

yes, unfortunately so. Before posting, I tried with example data and worked: (I am on Ubuntu and sed v4.2.2). I guess you are on MacOS and sed -i issue is discussed here and work around is given at the end of the post.

:~/Desktop $ sed -e 's/CHROMOSOME_//g' test.gtf 
head celegans.gtf
I    Coding_transcript   exon    4119    4358    .   -   transcript_id "Transcript:Y74C9A.3.1"; gene_id "Gene:Y74C9A.3";
I    Coding_transcript   exon    5195    5296    .   -   transcript_id "Transcript:Y74C9A.3.1"; gene_id "Gene:Y74C9A.3";
I    Coding_transcript   exon    6037    6327    .   -   transcript_id "Transcript:Y74C9A.3.1"; gene_id "Gene:Y74C9A.3";

:~/Desktop $ sed -i 's/CHROMOSOME_//g' test.gtf 

~/Desktop $ cat test.gtf 
head celegans.gtf
I    Coding_transcript   exon    4119    4358    .   -   transcript_id "Transcript:Y74C9A.3.1"; gene_id "Gene:Y74C9A.3";
I    Coding_transcript   exon    5195    5296    .   -   transcript_id "Transcript:Y74C9A.3.1"; gene_id "Gene:Y74C9A.3";
I    Coding_transcript   exon    6037    6327    .   -   transcript_id "Transcript:Y74C9A.3.1"; gene_id "Gene:Y74C9A.3";

$ uname -a
Linux genomics 4.8.0-53-generic #56~16.04.1-Ubuntu SMP Tue May 16 01:18:56 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
ADD REPLY
0
Entering edit mode

Correct guess, thanks. It worked now. So you made use of the fact that the GTF file just had 'CHROMOSOME_' prepended to all my FASTA chromosome names, right? Do you mind explaining this: 's/CHROMOSOME_//g' ?

ADD REPLY
1
Entering edit mode

Correct. Sed syntax is s/old string /newstring/ (/ is a markup for before and after). g is for global replacement (entire file). Other wise only first match (old string) will be replaced. Entire substitution is in quotes. In above line, chromosome_ is old string and is replaced with no space in short it got removed.

ADD REPLY
0
Entering edit mode

Maybe you can consolidate your comments into an answer and I can accept it?

ADD REPLY
1
Entering edit mode

You could redirect to a new file and use that, no real need for -i

sed 's/CHROMOSOME_//g' celegans.gtf > celegans.noChr.gtf
ADD REPLY

Login before adding your answer.

Traffic: 3099 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6