hi,
Im newbie to linux and NGS. I have genome.fasta file and transcripts.gff3 file. Both files have different chromosome naming pattern due to which im unable to use them in cufflinks as its giving following error:
Warning: couldn't find fasta record for 'CA_chr13_BGI-A2_v1.0'!
This contig will not be bias corrected.
can someone kindly help me how to equate chromosome naming in both files???
i mean the chromosome name for fasta file should exactly be the same as to that in gff3 file or vice versa.
the chromosome names in the fasta file are as follows:
gnl|BGIA2|CA_chr1
gnl|BGIA2|CA_chr2
while in gff3 file these are as follows:
CA_chr4_BGI-A2_v1.0
CA_chr6_BGI-A2_v1.0
it would be more convenient if they are retained to simply as chr1, chr2,chr3 and so on rather than this long naming. further chr1 in fasta file should exactly be the chr1 in gff3 file.
you can try following. Take a backup of your data and try on sample data first. If you are on mac, sed -i needs output filename. Assuming that all the sequence headers (in fasta) and sequence IDs in gff3 follow the same naming pattern in OP, OP can try following code. Please note that example gff3 format is not correct. Example code needs first column as Sequence IDs are in first column only.
thank you so much for the kind help. i ll try it and let you know. kindly explain one thing that whether (chr[1-9] means chromosome 1-9 ??? in that case i think it would work only for first 9 chromosomes while i have 13 chromosome.
It is chr[1-9]+ where + matches one or more times of [1-9] (In case of chromosomes, two digits) . However it skips 10. I updated the code and example data.
thank you so much for the kind help and your precious time. I have tried this. im getting good results for some of the entries while other are unchanged. pasting some of the examples from the output file opened in excel below:;
it is working absolutely fine now. thank you soooo much for the kind help...
can you plz kindly help me to resolve the same issue with fasta file???
i need same chromosome naming in fasta file too. while the suggessions given below are working only for few chromosome headers while not for others. i.e it is changing name of chromosome for some but not for others...
kind help would highly be appreciated
awk version is GNU Awk 3.1.7 while gawk is also there
yes, it is working absolutely fine. im extremely thankful to you for your kind help. the effort and the time that you have spent to guide a newbie like me is worth appreciating.
try again and let me know. There was some copy/paste mistake. Check if you are using gawk and type $ awk --version in console, to know the awk version. Please also check copying " (apstrophe) as some times, copying special characters from web creates issues.
While you could change the names in your fasta file by doing
sed -e 's/gnl|BGIA2|CA_chr1/CA_chr1_BGI-A2_v1.0/' -e 's/gnl|BGIA2|CA_chr2/CA_chr2_BGI-A2_v1.0/' fasta > new_fasta
you will need to samtools reheader the alignment files (if you have already done the alignments). It may be simpler to just do the alignments again after changing the names to ensure that everything remains in sync.
If you want to simply replace the names with chr* then do
sed -e 's/gnl|BGIA2|CA_chr1/chr1/' -e 's/gnl|BGIA2|CA_chr2/chr2/' fasta > new_fasta
you can try following. Take a backup of your data and try on sample data first. If you are on mac, sed -i needs output filename. Assuming that all the sequence headers (in fasta) and sequence IDs in gff3 follow the same naming pattern in OP, OP can try following code. Please note that example gff3 format is not correct. Example code needs first column as Sequence IDs are in first column only.
input gff3 with first column only:
output:
input fasta:
output fasta:
thank you so much for the kind help. i ll try it and let you know. kindly explain one thing that whether (chr[1-9] means chromosome 1-9 ??? in that case i think it would work only for first 9 chromosomes while i have 13 chromosome.
It is chr[1-9]+ where + matches one or more times of [1-9] (In case of chromosomes, two digits) . However it skips 10. I updated the code and example data.
Dear cpado, I have tried this code on my gff3 file
but there is no change in the naming of the first column (chromosome names are the same as in the parent file i.e
CA_chr4_BGI-A2_v1.0
CA_chr6_BGI-A2_v1.0
i simply placed your code before my file. i could see the terminal output as follows : chr4
chr4
chr4
chr1
chr1
chr1
but no change in first column. and if i append it in new file it give me only first column while al other columns are deleted.
Need your kind guidance plz
Could you please post few gff3 records here? For eg.
head(file.gff3)
yes plz see below
Try this let me know. Assumption is that GLEAN is always in second column.
thank you so much for the kind help and your precious time. I have tried this. im getting good results for some of the entries while other are unchanged. pasting some of the examples from the output file opened in excel below:;
and
plz help me to fix it
thanks