Question

lowercase fasta coordinates to GFF

1

Entering edit mode

9.7 years ago

Biomonika (Noolean) 3.2k

I have fasta file which has been soft-masked to lowercase. Now, I would like to get GFF (BED) file with coordinates of all soft-masked sequences. I was wondering if such a tool is already available. Thank you.

lowercase fasta GFF • 4.4k views

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Biomonika (Noolean) 3.2k

Ram · Answer 1 · 2015-03-17

5

Entering edit mode

9.7 years ago

Pierre Lindenbaum 164k

Using gnu flex? (not fully tested, should work)

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks for an answer! I haven't used flex before, could you please explain in greater detail how to install/run this code? (flex: can't open jeter.l)

ADD REPLY • link 9.7 years ago by Biomonika (Noolean) 3.2k

1

Entering edit mode

in the current context GNU flex runs like a awk for C.

for example

^>.*\n

is a pattern for the a fasta header.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 9.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Perfect! Thank you very much. I will just add note for the others that code reports are 0-based.

ADD REPLY • link 9.7 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

sorry I renamed the file when copy+paste, that should be flex code.l not flex jeter.l

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 9.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Hi,

I'm trying to do the same but while testing the above script, I figured that I'm not getting the coordinates of the softmasked region at the end of the sequence.

For example, testing the script on the following three cases,

>QWEQ
qweqertADEADASDAWDW
>ASAD
ASQWEQQqweqeASAasa
>adad
asasasdaa

I get back,

QWEQ    0       7
ASAD    7       12
ASAD    15      18

But how would I change the flex script to give back the coordinates of the sequence "adad"?

Thanks,

Cheers,

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by ramvizard • 0

0

Entering edit mode

ah yes, I forgot the trailing lowercase letters. I've added a <<EOF>> condition.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

This is very useful, but how do I run it with a locally stored .fasta file? I tried the following and I got an empty co-ordinate file and it doesn't finish running either. Also, is there a way to use multiple threads to run this?

flex code.l && \

gcc lex.yy.c && \

./a.out genome_softmasked.fasta > co-ords.gff

ADD REPLY • link 4.8 years ago by Biostar • 0

GenoMax · Answer 2 · 2020-06-09

You can do this in Perl in two steps:

1) convert your FASTA file to 1 sequence per line:

perl -lne 'if(/^(>.*)/){ $head=$1 } else { $fa{$head} .= $_ } END{ foreach $s (sort(keys(%fa))){ print "$s\n$fa{$s}\n" }}'  infile.fasta > infile.1line.fasta

2) extract lower-case segments from that FASTA file:

perl -lne 'if(/^>(\S+)/){ $n=$1} else { while(/([a-z]+)/g){ printf("%s\t%d\t%d\n",$n,pos($_)-length($1),pos($_)) } }' infile.1line.fasta > infile.bed

In case you only want sements of some min length, 60 in the example:

perl -lne 'if(/^>(\S+)/){ $n=$1} else { while(/([a-z]{60,})/g){ printf("%s\t%d\t%d\n",$n,pos($_)-length($1),pos($_)) } }' infile.1line.fasta > infile.60.bed