Extracting Positions From A Fasta For A Particular Mask
1
0
Entering edit mode
11.0 years ago
diviya.smith ▴ 60

Does anyone know of a quick way to extract the positions for a particular mask from a fasta file? So for example if I wanted to know the position for all missing sites for chr 1 and this is coded as "." in my fasta file - how can I generate a bed file with the list of these positions? I can code something in perl where I check each line separately but I was wondering if there are any programs out there like bedtools which have already implemented this?

Input:

>chr1

AAAAA.NNNNCCCCTTTT..A

output: (positions start at 0)

<chr> <start_pos> <end_pos>

chr1: 5 5

chr1: 18 19

Thank you in advance.

fasta • 4.1k views
ADD COMMENT
0
Entering edit mode
11.0 years ago

I don't have any advice on using bedtools to do this, but these two outputs are not correct BED elements:

output: (positions start at 0)
chr1: 5 5
chr1: 18 19

The correct, 0-indexed set would look like:

chr1   5    6
chr1   18   20

You might use something like awk to process FASTA data. Here's some untested code:

$ awk ' \
{ \
    if ($0 ~ /^>/) { \
        chr = substr($1, 2); \
        totalLength = 0; \
        previousChar = 'Z'; \
    } \
    else { \
        split($0, chars, ""); \
        for (idx = 1; idx <= length(chars); idx++) { \
             currentChar = chars[idx]; \
             if ((currentChar == ".") && (previousChar != currentChar)) { \
                 start = idx - 1; \
                 stop = start + 1; \
             } \
             else if ((currentChar == ".") && (previousChar == currentChar)) { \
                 stop++; \
             } \
             else if ((currentChar != ".") && (previousChar == ".")) { \
                 print chr"\t"start"\t"stop; \
             } \
             previousChar = currentChar; \
        } \
        totalLength += length(chars); \
    } \
}' input.fasta
ADD COMMENT
0
Entering edit mode

Yes, thats right. The output should be as you pointed out. Thank you for the correction. However, I tried your command as an awk script but I get the wrong answer. chr1 19 21. Also any suggestions no how to expand this script for multiline fasta files? For the example I just used a single line but the file of course has 1000s of lines of data.

ADD REPLY
0
Entering edit mode

Actually adding an additional condition, if ((previousChar == '.') && (currentChar != '.')) { print chr"\t"start"\t"stop; } will print the output as needed. Still trying to figure out the multiple line issue though - suggestions would be very welcome.

ADD REPLY
0
Entering edit mode

Perils of running untested code. I modified the awk script that I think will deal with the bugs you noted.

The modified script appears to run okay on your test FASTA input:

$ awk '...' foo.fa
chr1    5       6
chr1    18      20

I modified the FASTA data to include a trailing period and multiline input support:

$ more bar.fa
>chr1
AAAAA.NNNNCCCCTTTT...
.ACGT
>chr2
GATTICA...GATTICA...
.CAT..T

Running the script on this:

$ awk '...' bar.fa
chr1    5       6
chr1    18      21
chr1    22      23
chr2    7       10
chr2    17      20
chr2    21      22
chr2    25      27
ADD REPLY
0
Entering edit mode

Thanks so much for your help. I got it to work up to this point too. I added an end condition instead of the if (newElementFlag == 1) { \ print chr"\t"start"\t"stop; \ } Any suggestion for merging cases where "..." continues on multiple line. Ideally, I would like to merge the cases, chr1 18 21 chr1 22 23

as chr1 18 23

ADD REPLY
0
Entering edit mode

Given input:

$ more bar.fa
>chr1
AAAAA.NNNNCCCCTTTT...
.ACGT
>chr2
GATTICA...GATTICA...
.CAT..T

Here is output from the modified script:

$ awk '...' bar.fa  
chr1    5    6          
chr1    18    22
chr2    7    10
chr2    17    21
chr2    4    6

I seemed to have forgotten that awk arrays are 1-based.

ADD REPLY
0
Entering edit mode

Thanks. This is very helpful. Just a couple of changes (positions were not correct as the "totalLength" was not included in updating start and including the case if the last character is "."

BEGIN { \
totalLength = 0; \
newElementFlag = 0; \
char ="-";
previousChar = "Z"; \
 } \
{   if ($0 ~ /^>/) { \
    chr = substr($1, 2); \
} \
else { \
    split($0, chars, ""); \
    for (idx = 1; idx <= length(chars); idx++) { \
         currentChar = chars[idx]; \
         if ((currentChar == char) && (previousChar != currentChar)) { \
             start = idx + totalLength - 1; \
             stop = start + 1; \
         } \
         else if ((currentChar == char) && (previousChar == currentChar)) { \
             stop++; \
         } \
         else if ((currentChar != char) && (previousChar == char)) { \
             print chr"\t"start"\t"stop; \
         } \
         previousChar = currentChar; \
    } \
    totalLength += length(chars); \
} \
}END {
    if (currentChar == char) {
            print chr"\t"start"\t"stop;
    }
}
ADD REPLY

Login before adding your answer.

Traffic: 1736 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6