How to make the length of the CDS of my bed file divisible by 3?
0
0
Entering edit mode
3 months ago
blumina.r • 0

Hello,

I want to use the annotation tool TOGA for my research, and one of the inputs it needs is a .bed file with the annotation of a reference genome, and all its CDSs must be divisible by 3. I already have the .bed file, which I obtained by converting a gff file to bed12 using AGAT.

How can I check if all CDSs in my file are divisible by 3? And if some of them are not, how could I fix that?

This are the first 5 lines of my .bed file:

ContigUN    14393   67988   SAGMID_R016814  0.02    +   14393   67988   255,0,0 8   136,120,87,87,87,53,192,750 0,3911,11188,13427,27578,33791,42499,52845
ContigUN    20826   91148   SAGMID_R016815  0   -   20826   91148   255,0,0 2   138,480 0,69842
ContigUN    768722  773351  SAGMID_R017391  10.78   +   768722  773351  255,0,0 2   414,477 0,4152
ContigUN    808514  825537  SAGMID_R017392  15.50   -   808514  825537  255,0,0 2   626,682 0,16341
ContigUN    1279233 1293463 SAGMID_R019034  0.02    +   1279233 1293463 255,0,0 6   102,84,154,165,63,110   0,415,7381,10596,12350,14120

I found this smart solution but for fasta files. Anyone knows how could I do this but for my .bed file so that it can be used as input for TOGA?

Thank you in advance.

annotation CDS TOGA bed • 458 views
ADD COMMENT
1
Entering edit mode

It might be easier to check the gff file before you convert it.

ADD REPLY
0
Entering edit mode

Thank you for your answer! And how could I check all CDS length in the gff file? The first lines look like this:

LG09    GeneWise    mRNA    107215837   107256092   .   -   .   ID=SAGMID_R000001;genename=IGHV1-2;Shift=0;;upRepair=Augustus10376;downRepair=Augustus10376;Level=2;
LG09    AUGUSTUS    CDS 107256047   107256092   .   -   .   Parent=SAGMID_R000001;
LG09    AUGUSTUS    CDS 107255673   107255961   .   -   .   Parent=SAGMID_R000001;
LG09    AUGUSTUS    CDS 107248954   107249078   .   -   .   Parent=SAGMID_R000001;
LG09    AUGUSTUS    CDS 107248749   107248867   .   -   .   Parent=SAGMID_R000001;
ADD REPLY
2
Entering edit mode

Basically for each CDS, sum the lengths of each exon, and check if it divides by three. Are you comformatble in any programing lanuage?

ADD REPLY
0
Entering edit mode

Not yet, but I just asked ChatGPT and I think I have what I needed: a bash script that does just that, and corrects those lines not divided by three:

#!/bin/bash
workdir="/Users/Blumina/Doctorado/TOGA"

# Crear el directorio de salida si no existe
mkdir -p ${workdir}/bed_adjusted

for file in ${workdir}/bed_files/* 
do 
    filebasename=$(basename "$file")
    while read -r line
    do
        chrom=$(echo "$line" | awk '{print $1}')
        start=$(echo "$line" | awk '{print $2}')
        end=$(echo "$line" | awk '{print $3}')
        others=$(echo "$line" | cut -f 4-)

        length=$((end - start))
        mod=$(($length % 3))

        if [ $mod -eq 0 ]; then
            echo -e "${chrom}\t${start}\t${end}\t${others}" >> ${workdir}/bed_adjusted/${filebasename}
        elif [ $mod -eq 1 ]; then
            end=$((end - 1))
            echo -e "${chrom}\t${start}\t${end}\t${others}" >> ${workdir}/bed_adjusted/${filebasename}
        elif [ $mod -eq 2 ]; then
            end=$((end - 2))
            echo -e "${chrom}\t${start}\t${end}\t${others}" >> ${workdir}/bed_adjusted/${filebasename}
        fi
    done < "$file"
done

I am testing it at the moment. Thank you very much!

ADD REPLY
0
Entering edit mode

This is a bit of an overly simplistic approach that is typical for a AI generated code. Yes, it can check if the length of each line is divisible by 3 but that is not the point. From the GFF file, you would have to merge all CDS stretches into one first combined length first.

However, the case may be further complicated by the phase. So I would try check using this script: https://agat.readthedocs.io/en/latest/tools/agat_sp_fix_cds_phases.html

ADD REPLY

Login before adding your answer.

Traffic: 1705 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6