Does anyone know of a quick way to extract the positions for a particular mask from a fasta file? So for example if I wanted to know the position for all missing sites for chr 1 and this is coded as "." in my fasta file - how can I generate a bed file with the list of these positions? I can code something in perl where I check each line separately but I was wondering if there are any programs out there like bedtools which have already implemented this?
Input:
>chr1
AAAAA.NNNNCCCCTTTT..A
output: (positions start at 0)
<chr> <start_pos> <end_pos>
chr1: 5 5
chr1: 18 19
Thank you in advance.
Yes, thats right. The output should be as you pointed out. Thank you for the correction. However, I tried your command as an awk script but I get the wrong answer. chr1 19 21. Also any suggestions no how to expand this script for multiline fasta files? For the example I just used a single line but the file of course has 1000s of lines of data.
Actually adding an additional condition, if ((previousChar == '.') && (currentChar != '.')) { print chr"\t"start"\t"stop; } will print the output as needed. Still trying to figure out the multiple line issue though - suggestions would be very welcome.
Perils of running untested code. I modified the awk script that I think will deal with the bugs you noted.
The modified script appears to run okay on your test FASTA input:
I modified the FASTA data to include a trailing period and multiline input support:
Running the script on this:
Thanks so much for your help. I got it to work up to this point too. I added an end condition instead of the if (newElementFlag == 1) { \ print chr"\t"start"\t"stop; \ } Any suggestion for merging cases where "..." continues on multiple line. Ideally, I would like to merge the cases, chr1 18 21 chr1 22 23
as chr1 18 23
Given input:
Here is output from the modified script:
I seemed to have forgotten that
awk
arrays are 1-based.Thanks. This is very helpful. Just a couple of changes (positions were not correct as the "totalLength" was not included in updating start and including the case if the last character is "."