Hello BioStars Community,
I’m currently working with GTF files downloaded directly from NCBI RefSeq for well-studied species like dog (Canis lupus familiaris) and ferret (Mustela putorius furo). When I convert these GTF files to BED format, I encounter an issue with scaffold names not being mapped to standard chromosome names (e.g., chr11), for ferret, you can expect not the chromosome name, because is still not available (the complete information).
Here’s an example of the GTF file I’m working with:
#!annotation-source NCBI Mustela putorius furo Annotation Release 102
NW_025421256.1 Gnomon gene 5221 29828 . + . gene_id "LOC123387000"; transcript_id ""; db_xref "GeneID:123387000"; gbkey "Gene"; gene "LOC123387000"; gene_biotype "lncRNA";
However, after transforming it to a bed file using tools like gff2bed, I got this:
NC_020638.1 0 69 . . + RefSeq exon . gene_id "unassigned_gene_1"; transcript_id "unassigned_transcript_610"; product "tRNA-Phe"; transcript_biotype "tRNA"; exon_number "1";
When my expected output is something like this:
0 NM_001291928.1 chr1 - 134199214 134234856 134202950 134234733 2 134199214,134234662, 134203590,134234856, 0Adora1 cmpl cmpl 2,0,
Please, notice that the three chunks of code are not necessarily related, so the ID do not match the species that I'm asking for, this is only to put in the post what I need.
Taking this information together, I know that is possible to obtain the bed files in the desired format, some online tools like USCS table browser provided this. For dog and ferret, the version that they provided is not the one that I'm working on, so is not an option for me.
Does anyone know about any accurate way to perform the task that I need here?
Juke34 may be able to provide some insight.
Looking forward for this! Thank you for the comment
Have you seen this: https://agat.readthedocs.io/en/latest/gff_to_bed.html
I tried with
bedpods
, however I think thatPASA
I did not, I will check if can get the desired outputJuke34 Do you have any advice? Any comment is greatly appreciated
When you do gtf/gff conversion to bed there is no mapping stuff... Sequence identifier from first column of the gtf/gff must be reported in the first column of the bed file. You can observe that in the mini review I made here https://agat.readthedocs.io/en/latest/gff_to_bed.html
So I don’t get what is your issue about scaffold name.
Anyway if you read the information at the link provided you can see that the conversion using bedops (gff2bed) is quite particular. Only the first 6 columns are as expected for bed file.
If you want to stick at the correct bed output you should probably prefer AGAT
Hi,
I don't see the problem here. If the ferret genome is not full-fledged, so use it as is. Your 'expected output' is not a bed file but a result of the UCSC's Table Browser output. The result you show in the middle is a correct bed file. For more information on the bed file format look e.g. here.
Thank you for your quick reply. You are right about the ferret, that I should expect that output because there is no chromosome number available, because of the fact that it is not full-fledged. However, for dog, I expect that using the traditional tool to generate bed files from a gtf file, for example. But I can not get the desired output. I will say that probably my expected output is a pseudo bed file, you can check in the following link, that is the expected output for me with more details, TOGA-bed-output. Any comments are welcome! Thank you again