Extracting N positions from fasta file
2
2
Entering edit mode
8.1 years ago

Hi I am very new to this so appologies if this is a simple question - I have been trying to figure this out for days to no avail and my python skills are not quite there yet!

I am trying to extract all N positions from novel bacterial sequences which have been aligned to a member of the same genus. I would like the start and end positions of all N motifs eg. GTCAGNNNNNTGGT

Is there an existing tool / how could I go about creating this in python?

Many thanks.

sequence • 5.2k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Do you want to search for a specific pattern or for every location in which an 'N' is present?

ADD REPLY
0
Entering edit mode

every location at which a N is present. I have something like 36 different strains and would like to produce a list of N locations in each FASTA file and then compare these lists to find the unique N locations for each strain.

ADD REPLY
0
Entering edit mode

So the output would be the chromosomal locations, right? Sounds like a job that can be done using Biopython. What have you tried?

ADD REPLY
0
Entering edit mode

If the sequence is not long you can do it without software, open the file fasta by the wordpad

ADD REPLY
0
Entering edit mode

That's not very helpful.

ADD REPLY
0
Entering edit mode

thanks but the sequences are > 4 million bases

ADD REPLY
2
Entering edit mode

Just use SeqKit. shenwei356 has even provided a detailed example below.

ADD REPLY
7
Entering edit mode
8.1 years ago

Use SeqKit. SeqKit supports Windows/Linux/Mac OS X.

$ echo -en '>seq\nGTCAGNNNNNTGGT\n' | seqkit locate --ignore-case --only-positive-strand --pattern "N+" 
seqID   patternName     pattern strand  start   end     matched
seq     N+      N+      +       6       10      NNNNN

aligned result

$ echo -en '>seq\nGTCAGNNNNNTGGT\n' | seqkit locate --ignore-case --only-positive-strand --pattern "N+" | column -t
seqID  patternName  pattern  strand  start  end  matched
seq    N+           N+       +       6      10   NNNNN

or from file and save to file

$ seqkit locate --ignore-case --only-positive-strand --pattern "N+" seqs.fa > result.xls
ADD COMMENT
0
Entering edit mode
2.9 years ago

picard ScatterIntervalsByNs https://broadinstitute.github.io/picard/command-line-overview.html#ScatterIntervalsByNs

Writes an interval list based on splitting a reference by Ns. This tool identifies positions in a reference where the bases are 'no-calls' and writes out an interval-list using the resulting coordinates. This can be used to create an interval list for whole genome sequence (WGS) for e.g. scatter-gather purposes, as an alternative to using fixed-length intervals. The number of contiguous nocalls that can be tolerated before creating a break is adjustable from the command line.

ADD COMMENT

Login before adding your answer.

Traffic: 1752 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6