Question

Matching And Extracting Contents Of Two Ncbi Genome Files...??

0

Entering edit mode

13.2 years ago

Kiran ▴ 80

Hello friends,

I am new to perl programming still I have to practice Regular expression and NCBI file handling but here I have a task to do I have done half cud anybody help doing the rest

File 1:

Candida glabrata CBS 138 chromosome D, complete genome - 1..651701
283 proteins
Location    Strand    Length    PID        Gene    Synonym        Code    COG Product
17042..17914    -    290    50285983    -    CAGL0D00154g    -    -               
23693..25075    +    460    50285985    -    CAGL0D00176g    -    -    
27559..28710    +    383    50285987    -    CAGL0D00198g    -    -    
29345..29914    +    189    50285989    -    CAGL0D00220g    -    -

So on 40 lines.....

File 2: Contains

>ref|NC_006027.1|:c17914-17042 hypothetical protein [Candida glabrata CBS 138]
ATGGAAACAGAACATCAGGCAGACAAAAATGCGGAATTGGGTTATGACAGTGGATCAACCGTTGCTCCCC
CCAATAAATATAGTACATTACGCTCTAGGTTCAATTTAGGACCTGACACTATGAGAAATCATGTTATTGC
CTTTTTTGGGGAGTTGGTTGGCACATTCATGTTTTTATGGTGTGCCTATGTTATTGCAAATATTGCAAAT

>ref|NC_006027.1|:23693-25075 hypothetical protein [Candida glabrata CBS 138]
ATGTCTTCTCAAGTTAACGAACCAGAATTTCAACAAGCTTACCACGAAGTTGTTTCCTCTTTGAAGGACT
CTTCTTTGTTCGAAAAGCACCCAAAATATGCTAAGGTTCTTCCAGTTGTCTCTGTCCCAGAGAGAATCAT

so on number of locations in file 1 is equal to no. of Seq in File 2..

here is what I have to do if the location of FILE 1 i.e "17042..17914" matches with the Header of the FILE 2 i.e "c17914-17042 match with either upper or the lower limit

then it should remove header of fasta of file 2 then insert">CAGL0D00154g" which is in synonym column of File 1 , location with the corresponding synonym

then my Output file should come as follows:

File3:

>CAGL0D00154g
ATGGAAACAGAACATCAGGCAGACAAAAATGCGGAATTGGGTTATGACAGTGGATCAACCGTTGCTCCCC
CCAATAAATATAGTACATTACGCTCTAGGTTCAATTTAGGACCTGACACTATGAGAAATCATGTTATTGC
CTTTTTTGGGGAGTTGGTTGGCACATTCATGTTTTTATGGTGTGCCTATGTTATTGCAAATATTGCAAAT

>CAGL0D00176g
ATGTCTTCTCAAGTTAACGAACCAGAATTTCAACAAGCTTACCACGAAGTTGTTTCCTCTTTGAAGGACT
CTTCTTTGTTCGAAAAGCACCCAAAATATGCTAAGGTTCTTCCAGTTGTCTCTGTCCCAGAGAGAATCAT
`

Here is what I have done

foreach $line(@File1){
    chomp($line);

($f1,$f2,$f3,$f4,$f5,$f6)=split (/\t+/,$line);
    push(@F1,$f1);
    push(@F2,$f2);

so on... }

@F1 contains Locations colunm(17042..17914,,) @F6 contains Synonym column (CAGL0D00176g)

same way I collected the all the upper limit of location of File 2 i.e(17914,25075,,) @B using

foreach $line(@File2){
    chomp $line;
    if ($line=~/\-(\d*)/){
}

So could anybody help/write code to get output as I specified above

Looking forward for your code

Thank you

ncbi genome • 2.4k views

ADD COMMENT • link updated 20 months ago by Ram 44k • written 13.2 years ago by Kiran ▴ 80

8

Entering edit mode

Please can you reformat this question to make it readable. As it stands, no-one is likely to answer because it is almost unintelligible.

ADD REPLY • link 13.2 years ago by Neilfws 49k

5

Entering edit mode

...and improve the spelling, grammar, etc...

ADD REPLY • link 13.2 years ago by Casey Bergman 18k

score 2 · Answer 1 · 2011-09-22

You do not specify any file size but if file 2 remains rather small (or you have x64 and lots of memory) I would suggest for the easy solution to:

Just throw file2 in a hash where the key is the fasta header (either original or processed like you captured the lookup fragment). This will allow an easy lookup from the hash any value from file1 in realtime and you can just spit out the multi fasta with the modified headers quite easily on the fly...