Maybe this is a very basic question but I must tell that I am not a bioinformatics person. I just need to count total number of positive and negative genes located on chromosome in .genes or .gff3 file.
Can you please comment how can I do that?
Thanks in advance
Converting to BED is not strictly required, but it makes sense if you're doing counting operations, which are essentially set operations. BED is a better format for set operations than GFF.
Also, note, if you are using GFF3 files from NCBI the feature type in column 3 is not always gene. For example, pseudogenes have pseudogene in column 3. You can change the awk command to ($3=="gene"||$3=="pseudogene") will fix that.
#First lets define a couple of variables to act as counters for each strand ( + & - )
forward=0
reverse=0
#Now we create a loop with your lines of interest from the gff3 file as the changing variable
for line in $(cat file.gff3 | grep '.*gene.*[[:space:]]+[[:space:]]')
do
forward=$(expr "$forward" + 1) #this will add 1 to the counter for each line found by grep
done
#Depending on your grep version you might need to scape the "+" character, BSD grep doesn't have to
#Same process but for the reverse strand
for line in $(cat file.gff3 | grep '.*gene.*[[:space:]]-[[:space:]]')
do
reverse=$(expr "$reverse" + 1)
done
#Finally we print the results
echo -e "\nGenes present in the forward (positive) strand: $forward"
echo -e "Genes present in the reverse (negative) strand: $reverse\n"
Why bother converting the gff3 file to bed when the strand information is already present in the gff3 file?
Also, note, if you are using GFF3 files from NCBI the feature type in column 3 is not always gene. For example, pseudogenes have
pseudogene
in column 3. You can change the awk command to($3=="gene"||$3=="pseudogene")
will fix that.Thank you to both of you! It worked.. Thanks!!!