Question

Extracting Positions From A Fasta For A Particular Mask

0

Entering edit mode

11.0 years ago

diviya.smith ▴ 60

Does anyone know of a quick way to extract the positions for a particular mask from a fasta file? So for example if I wanted to know the position for all missing sites for chr 1 and this is coded as "." in my fasta file - how can I generate a bed file with the list of these positions? I can code something in perl where I check each line separately but I was wondering if there are any programs out there like bedtools which have already implemented this?

Input:

>chr1

AAAAA.NNNNCCCCTTTT..A

output: (positions start at 0)

chr1: 5 5

chr1: 18 19

Thank you in advance.

fasta • 4.1k views

ADD COMMENT • link updated 11.0 years ago by Alex Reynolds 36k • written 11.0 years ago by diviya.smith ▴ 60

score 0 · Answer 1 · 2013-11-25

0

Entering edit mode

11.0 years ago

Alex Reynolds 36k

I don't have any advice on using bedtools to do this, but these two outputs are not correct BED elements:

output: (positions start at 0)
chr1: 5 5
chr1: 18 19

The correct, 0-indexed set would look like:

chr1   5    6
chr1   18   20

You might use something like awk to process FASTA data. Here's some untested code:

$ awk ' \
{ \
    if ($0 ~ /^>/) { \
        chr = substr($1, 2); \
        totalLength = 0; \
        previousChar = 'Z'; \
    } \
    else { \
        split($0, chars, ""); \
        for (idx = 1; idx <= length(chars); idx++) { \
             currentChar = chars[idx]; \
             if ((currentChar == ".") && (previousChar != currentChar)) { \
                 start = idx - 1; \
                 stop = start + 1; \
             } \
             else if ((currentChar == ".") && (previousChar == currentChar)) { \
                 stop++; \
             } \
             else if ((currentChar != ".") && (previousChar == ".")) { \
                 print chr"\t"start"\t"stop; \
             } \
             previousChar = currentChar; \
        } \
        totalLength += length(chars); \
    } \
}' input.fasta

ADD COMMENT • link 11.0 years ago by Alex Reynolds 36k

0

Entering edit mode

Yes, thats right. The output should be as you pointed out. Thank you for the correction. However, I tried your command as an awk script but I get the wrong answer. chr1 19 21. Also any suggestions no how to expand this script for multiline fasta files? For the example I just used a single line but the file of course has 1000s of lines of data.

ADD REPLY • link 11.0 years ago by diviya.smith ▴ 60

0

Entering edit mode

Actually adding an additional condition, if ((previousChar == '.') && (currentChar != '.')) { print chr"\t"start"\t"stop; } will print the output as needed. Still trying to figure out the multiple line issue though - suggestions would be very welcome.

ADD REPLY • link 11.0 years ago by diviya.smith ▴ 60

0

Entering edit mode

Perils of running untested code. I modified the awk script that I think will deal with the bugs you noted.

The modified script appears to run okay on your test FASTA input:

$ awk '...' foo.fa
chr1    5       6
chr1    18      20

I modified the FASTA data to include a trailing period and multiline input support:

$ more bar.fa
>chr1
AAAAA.NNNNCCCCTTTT...
.ACGT
>chr2
GATTICA...GATTICA...
.CAT..T

Running the script on this:

$ awk '...' bar.fa
chr1    5       6
chr1    18      21
chr1    22      23
chr2    7       10
chr2    17      20
chr2    21      22
chr2    25      27

ADD REPLY • link 11.0 years ago by Alex Reynolds 36k

0

Entering edit mode

Thanks so much for your help. I got it to work up to this point too. I added an end condition instead of the if (newElementFlag == 1) { \ print chr"\t"start"\t"stop; \ } Any suggestion for merging cases where "..." continues on multiple line. Ideally, I would like to merge the cases, chr1 18 21 chr1 22 23

as chr1 18 23

ADD REPLY • link 11.0 years ago by diviya.smith ▴ 60

0

Entering edit mode

Given input:

$ more bar.fa
>chr1
AAAAA.NNNNCCCCTTTT...
.ACGT
>chr2
GATTICA...GATTICA...
.CAT..T

Here is output from the modified script:

$ awk '...' bar.fa  
chr1    5    6          
chr1    18    22
chr2    7    10
chr2    17    21
chr2    4    6

I seemed to have forgotten that awk arrays are 1-based.

ADD REPLY • link 11.0 years ago by Alex Reynolds 36k

0

Entering edit mode

Thanks. This is very helpful. Just a couple of changes (positions were not correct as the "totalLength" was not included in updating start and including the case if the last character is "."

BEGIN { \
totalLength = 0; \
newElementFlag = 0; \
char ="-";
previousChar = "Z"; \
 } \
{   if ($0 ~ /^>/) { \
    chr = substr($1, 2); \
} \
else { \
    split($0, chars, ""); \
    for (idx = 1; idx <= length(chars); idx++) { \
         currentChar = chars[idx]; \
         if ((currentChar == char) && (previousChar != currentChar)) { \
             start = idx + totalLength - 1; \
             stop = start + 1; \
         } \
         else if ((currentChar == char) && (previousChar == currentChar)) { \
             stop++; \
         } \
         else if ((currentChar != char) && (previousChar == char)) { \
             print chr"\t"start"\t"stop; \
         } \
         previousChar = currentChar; \
    } \
    totalLength += length(chars); \
} \
}END {
    if (currentChar == char) {
            print chr"\t"start"\t"stop;
    }
}

ADD REPLY • link 11.0 years ago by diviya.smith ▴ 60