Question

counting number of genes located on positive or negative strands

0

Entering edit mode

6.1 years ago

arsilan324 ▴ 90

Hi all,

Maybe this is a very basic question but I must tell that I am not a bioinformatics person. I just need to count total number of positive and negative genes located on chromosome in .genes or .gff3 file.

Can you please comment how can I do that? Thanks in advance

gff3 htseq • 1.8k views

ADD COMMENT • link updated 6.1 years ago by n,n ▴ 370 • written 6.1 years ago by arsilan324 ▴ 90

score 5 · Accepted Answer · 2018-10-29

5

Entering edit mode

6.1 years ago

Alex Reynolds 36k

Via BEDOPS, extract gene features from your GFF file and convert to BED:

$ awk '$3=="gene"' annotations.gff | gff2bed - > genes.bed

Then filter on strand and pass to wc -l to count the number of lines:

$ awk '$6=="+"' genes.bed | wc -l
*number of forward-strand genes*

And:

$ awk '$6=="-"' genes.bed | wc -l
*number of reverse-strand genes*

Converting to BED is not strictly required, but it makes sense if you're doing counting operations, which are essentially set operations. BED is a better format for set operations than GFF.

ADD COMMENT • link 5.8 years ago by Alex Reynolds 36k

2

Entering edit mode

Why bother converting the gff3 file to bed when the strand information is already present in the gff3 file?

awk '$3=="gene" && $7=="+"' annotations.gff | wc -l
awk '$3=="gene" && $7=="-"' annotations.gff | wc -l

Also, note, if you are using GFF3 files from NCBI the feature type in column 3 is not always gene. For example, pseudogenes have pseudogene in column 3. You can change the awk command to ($3=="gene"||$3=="pseudogene") will fix that.

ADD REPLY • link 6.1 years ago by vkkodali_ncbi ★ 3.8k

0

Entering edit mode

Thank you to both of you! It worked.. Thanks!!!

ADD REPLY • link 6.1 years ago by arsilan324 ▴ 90

score 3 · Accepted Answer · 2018-10-30

Here is a bash solution as well:

#First lets define a couple of variables to act as counters for each strand ( + & - )

forward=0
reverse=0

#Now we create a loop with your lines of interest from the gff3 file as the changing variable

for line in $(cat file.gff3 | grep '.*gene.*[[:space:]]+[[:space:]]')
do
  forward=$(expr "$forward" + 1) #this will add 1 to the counter for each line found by grep
done

#Depending on your grep version you might need to scape the "+" character, BSD grep doesn't have to

#Same process but for the reverse strand

 for line in $(cat file.gff3 | grep '.*gene.*[[:space:]]-[[:space:]]')
 do
   reverse=$(expr "$reverse" + 1)
 done

#Finally we print the results

echo -e "\nGenes present in the forward (positive) strand: $forward"
echo -e "Genes present in the reverse (negative) strand: $reverse\n"

Hope this helps.