Entering edit mode
11.3 years ago
Ram
▴
190
Is anybody know how to convert pgf (probe group file) format to fastq or fasta format??
Thanks a lot.
Is anybody know how to convert pgf (probe group file) format to fastq or fasta format??
Thanks a lot.
OK, try this out. If you're on a Linux box, save it as a file and then make the file executable (e.g. chmod 0555 [filename]
) or execute the file using perl [filename]
. let me know if it works for you. It worked on your test file.
#!/usr/bin/perl
#pgfToFasta.pl - A utility to construct 75-mers based on the sequence data contained within a Probe Group Format(PGF) file.
#This script takes one argument, which is a file in probe group format. It outputs a file of the same name with a FASTA extension in the current working directory.
#Only minimal sanity checking is performed.
use warnings;
use strict;
#prep source and destination files
my $fileName = $ARGV[0];
my $fastaName;
if($fileName =~ /(.+)\.pgf$/){
$fastaName = "$1.fasta";
} else {
die "The input file does not have a PGF extension.\n"
}
open(SOURCEFILE, "<$fileName");
open(FASTA, ">$fastaName");
#prep variables for conversion
my $probeSeqCounter = 1;
my $inHeader = 1;
my $currentSequence = '';
my $fastaCounter = 1;
while(<SOURCEFILE>){
if($inHeader){
if($_ =~ /^#/){
next;
} else {
$inHeader = 0;
next; #skips the column headers
}
}
next unless($_ =~ /\S+/); #skip the blank lines
my $currentLine = s/\R//; #trim newline character (platform agnostic)
my @columns = split /\s+/, $_; #split the line of text into columns based on any kind of whitespace
$currentSequence .= pop(@columns); #In this initial version, we don't care about anything except the sequence data
unless(++$probeSeqCounter < 4){
print FASTA ">Sequence$fastaCounter\n$currentSequence\n";
++$fastaCounter;
$currentSequence = '';
$probeSeqCounter = 1;
}
}
#cleanup
close(SOURCEFILE);
close(FASTA);
I added some sample lines to your file and got the following output:
>Sequence1
ATTGCTTATCATAGACTAGCTACTGATTGCTTATCATAGACTAGCTACTGATTGCTTATCATAGACTAGCTACTG
>Sequence2
ATTGCTTATCATAGACTATTTTTTGATTGCTTATCATAGACTAGAAAAAGATTGCTTATCATAGACTAGCCCCCG
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Probe group format doesn't have per-base quality information, so true conversion to FASTQ is probably a no-go.
I can help you with a script to convert to FASTA if you can tell me how you would like to construct your identifiers in the FASTA file based on the probe-group file information.
If the OP needs a FASTQ file, they are probably using a utility that expects FASTQ. You could assign a value of # for all of the quality scores.
Thanks a lot for replying! It would be better to convert to FASTA format. Construct of identifier in FASTA file can be
OK, that's a start, thanks. Am I understanding correctly that you want to concatenate your sequences in sets of three? If you could provide a dropbox link to a sample file, that would really be helpful. The format isn't strictly specified, so being able to write code around one of your real files will ensure better results.
Thanks Deedee for replying! Here is link for pgf file: https://www.dropbox.com/s/z1etxewjrx633jw/new%20file%20.pgf?m To remove all things in file and to have 75 bases per line till the end and that would make to FASTA File. Thanks
Hello Deedee, Have you got any suggestion on how to proceed with it?
Thanks
Hey, sorry I didn't get back to you. It's been very hectic. When I took a look at your file originally, the whitespace was very confusing. It seems your columns are space-delimited rather than tab-delimited, so it's hard for me to understand the construction of the file. Is this file part of a "real" instrument-generated PGF file? Anyway, I'll go ahead and write a script based on my best guess. I'm also not understanding how you want to construct the FASTA identifier for each 75-mer based on what you've already said and what I see in your example file, but I'll go ahead and make my script write arbitrary incrementing FASTA headers and we can fix it if need be. Expect a follow-up post in about two hours. Gotta take care of some stuff and then I'll write it.
Thank you so much !! I will go ahead with it and let you know if it works and will try to solve any error or problem if that occurs in it without giving any more trouble to you. I am really Sorry if I seized any important time of your working hours.
Thanks
I'm happy to help! It's good practice for me; I just couldn't make much sense of the source file because it wasn't as regular as something generated by an instrument. Definitely let me know how the script works and I can adjust it accordingly.
Thank you so much for your help!! It works really well. Thanks a lot.. :))
Glad to hear it! Can you mark the answer as "accepted" if indeed it solved your problem?
Hi Deedee, Can I ask you one thing is it possible to output in header like probeset_id with atom _id and to restrict sequences upto 25 - mers rather than 75-mers?
Thank you so much!
So let me see if I understand this. You want the header for each FASTA sequence to be probeset_id plus a counter (probeset_id1, probeset_id2, probeset_id3...). Then each sequence of the PGF file should be put into its own FASTA sequence instead of three at a time being concatenated. Correct?
Yes Like >16457848_1 ATTGCTTATCATAGACTAGCTACTG
Thanks.