Format Conversion: Pgf (Probe Group File) Format To Fastq
1
1
Entering edit mode
11.3 years ago
Ram ▴ 190

Is anybody know how to convert pgf (probe group file) format to fastq or fasta format??

Thanks a lot.

format • 3.4k views
ADD COMMENT
0
Entering edit mode

Probe group format doesn't have per-base quality information, so true conversion to FASTQ is probably a no-go.

I can help you with a script to convert to FASTA if you can tell me how you would like to construct your identifiers in the FASTA file based on the probe-group file information.

ADD REPLY
0
Entering edit mode

If the OP needs a FASTQ file, they are probably using a utility that expects FASTQ. You could assign a value of # for all of the quality scores.

ADD REPLY
0
Entering edit mode

Thanks a lot for replying! It would be better to convert to FASTA format. Construct of identifier in FASTA file can be

gi|1503109 and as in file one probe_sequence is of 25 probe_length, so it would be good to include three probe_sequence in one line (75 bases per line) and as in respect to identifier i can edit afterwards also. Thanks a lot for your help!

ADD REPLY
0
Entering edit mode

OK, that's a start, thanks. Am I understanding correctly that you want to concatenate your sequences in sets of three? If you could provide a dropbox link to a sample file, that would really be helpful. The format isn't strictly specified, so being able to write code around one of your real files will ensure better results.

ADD REPLY
0
Entering edit mode

Thanks Deedee for replying! Here is link for pgf file: https://www.dropbox.com/s/z1etxewjrx633jw/new%20file%20.pgf?m To remove all things in file and to have 75 bases per line till the end and that would make to FASTA File. Thanks

ADD REPLY
0
Entering edit mode

Hello Deedee, Have you got any suggestion on how to proceed with it?

Thanks

ADD REPLY
0
Entering edit mode

Hey, sorry I didn't get back to you. It's been very hectic. When I took a look at your file originally, the whitespace was very confusing. It seems your columns are space-delimited rather than tab-delimited, so it's hard for me to understand the construction of the file. Is this file part of a "real" instrument-generated PGF file? Anyway, I'll go ahead and write a script based on my best guess. I'm also not understanding how you want to construct the FASTA identifier for each 75-mer based on what you've already said and what I see in your example file, but I'll go ahead and make my script write arbitrary incrementing FASTA headers and we can fix it if need be. Expect a follow-up post in about two hours. Gotta take care of some stuff and then I'll write it.

ADD REPLY
0
Entering edit mode

Thank you so much !! I will go ahead with it and let you know if it works and will try to solve any error or problem if that occurs in it without giving any more trouble to you. I am really Sorry if I seized any important time of your working hours.

Thanks

ADD REPLY
0
Entering edit mode

I'm happy to help! It's good practice for me; I just couldn't make much sense of the source file because it wasn't as regular as something generated by an instrument. Definitely let me know how the script works and I can adjust it accordingly.

ADD REPLY
0
Entering edit mode

Thank you so much for your help!! It works really well. Thanks a lot.. :))

ADD REPLY
0
Entering edit mode

Glad to hear it! Can you mark the answer as "accepted" if indeed it solved your problem?

ADD REPLY
0
Entering edit mode

Hi Deedee, Can I ask you one thing is it possible to output in header like probeset_id with atom _id and to restrict sequences upto 25 - mers rather than 75-mers?

Thank you so much!

ADD REPLY
0
Entering edit mode

So let me see if I understand this. You want the header for each FASTA sequence to be probeset_id plus a counter (probeset_id1, probeset_id2, probeset_id3...). Then each sequence of the PGF file should be put into its own FASTA sequence instead of three at a time being concatenated. Correct?

ADD REPLY
0
Entering edit mode

Yes Like >16457848_1 ATTGCTTATCATAGACTAGCTACTG

16457848_2 ATTGCTTATCATAGACTAGCTACTG 17632785_1 ATTGCTTATCATAGACTAGCTACTG 17632785_2 ATTGCTTATCATAGACTAGCTACTG 17632785_3 ATTGCTTATCATAGACTAGCTACTG where 17632785 - is probeset_id and 3 is sequence number in probeset_id.

Thanks.

ADD REPLY
2
Entering edit mode
11.2 years ago
Dan D 7.4k

OK, try this out. If you're on a Linux box, save it as a file and then make the file executable (e.g. chmod 0555 [filename]) or execute the file using perl [filename]. let me know if it works for you. It worked on your test file.

#!/usr/bin/perl
#pgfToFasta.pl - A utility to construct 75-mers based on the sequence data contained within a Probe Group Format(PGF) file.
#This script takes one argument, which is a file in probe group format. It outputs a file of the same name with a FASTA extension in the current working directory.
#Only minimal sanity checking is performed.
use warnings;
use strict;

#prep source and destination files
my $fileName = $ARGV[0];
my $fastaName;

if($fileName =~ /(.+)\.pgf$/){
    $fastaName = "$1.fasta";
} else {
    die "The input file does not have a PGF extension.\n"
}

open(SOURCEFILE, "<$fileName");
open(FASTA, ">$fastaName");

#prep variables for conversion
my $probeSeqCounter = 1;
my $inHeader = 1;
my $currentSequence = '';
my $fastaCounter = 1;

while(<SOURCEFILE>){
    if($inHeader){
        if($_ =~ /^#/){
            next;
        } else {
            $inHeader = 0;
            next; #skips the column headers
        }
    }
    next unless($_ =~ /\S+/); #skip the blank lines
    my $currentLine = s/\R//; #trim newline character (platform agnostic)
    my @columns = split /\s+/, $_; #split the line of text into columns based on any kind of whitespace
    $currentSequence .= pop(@columns); #In this initial version, we don't care about anything except the sequence data
    unless(++$probeSeqCounter < 4){
        print FASTA ">Sequence$fastaCounter\n$currentSequence\n";
        ++$fastaCounter;
        $currentSequence = '';
        $probeSeqCounter = 1;
    }
}

#cleanup
close(SOURCEFILE);
close(FASTA);

I added some sample lines to your file and got the following output:

>Sequence1 
ATTGCTTATCATAGACTAGCTACTGATTGCTTATCATAGACTAGCTACTGATTGCTTATCATAGACTAGCTACTG 
>Sequence2 
ATTGCTTATCATAGACTATTTTTTGATTGCTTATCATAGACTAGAAAAAGATTGCTTATCATAGACTAGCCCCCG
ADD COMMENT

Login before adding your answer.

Traffic: 2530 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6