Hi I am very new to this so appologies if this is a simple question - I have been trying to figure this out for days to no avail and my python skills are not quite there yet!
I am trying to extract all N positions from novel bacterial sequences which have been aligned to a member of the same genus. I would like the start and end positions of all N motifs eg. GTCAGNNNNNTGGT
Is there an existing tool / how could I go about creating this in python?
Many thanks.
try regular expression: https://docs.python.org/2/library/re.html
Do you want to search for a specific pattern or for every location in which an 'N' is present?
every location at which a N is present. I have something like 36 different strains and would like to produce a list of N locations in each FASTA file and then compare these lists to find the unique N locations for each strain.
So the output would be the chromosomal locations, right? Sounds like a job that can be done using Biopython. What have you tried?
If the sequence is not long you can do it without software, open the file fasta by the wordpad
That's not very helpful.
thanks but the sequences are > 4 million bases
Just use SeqKit. shenwei356 has even provided a detailed example below.