Hi,
As part of a script I am writing, I am trying to make a validator to check what format the input file is - i.e whether the input fasta file contains protein data or nucleic data.
I have tried to start this, but it is not as simple as a simple match - e.g. if match(/[^a^t^c^g]/i) ...
for the DNA since these characters will most probably be included in the sequence description.
Since the input for the script would be the either the full genome data or the full proteome data for a species, they will be quite large files. Thus, I was planning to just test the first 500 sequences (as otherwise, I suppose this could take quite some time with large files). But then again, any opinions.
At the moment I plan to use bioruby to take the 500 first sequences and then try matching the sequence - if match(/[^a^t^c^g]/i)
for the dna and make a similar one for protein.
But Before I start on this properly, (and the reason behind this post), being a begginner I was wondering whether there are any tools (e.g. within bioruby) that would do this for me...
Many Thanks
Since this would be part of a ruby script, any answers need to be in ruby
Count the frequency of A, T, G, and C !! :) If if sum is not equal to ~~ 1 then its not Nucleotide !!
I am not sure exactly what you mean - isn't that differentiating between a single nucleotide and a sequence - I am trying to differentiate between protein sequence and DNA sequence not a sequence and a nucleotide...
Yes, I understand that !! First you will count occurrence of A and then occurrence of T and then occurrence of G and then occurence of C !! and then get the complete length also ..
if occurence of A + occurence of T + occurence of C + occurence of A = complete length, then its Nucletide sequence .
Else start reading file from start and moment you will get any word apart from these A,T, G and C, then break it and print that its Amino acids !! As its highly impossible for to NOT getting any words apart from these 4 in amino acids for a length of say 1000 sequences !!