Question

scripting problem to change string match of one file to another

2

Entering edit mode

10.2 years ago

rob234king ▴ 610

I have a map file and gff file and all I want to do is for each line in the map file match the first column for each line in gff file and any matches and replaced with the second column from map file. I've tried the map id script from maker2 but isn't working beyond the mRNA field, the parent CDS field is not changed so not much use to me.

gff file tab delineated:

Chromosome_1   Geneious   mRNA   421    2049   .   -   .   ID=12;Parent=12-gene
Chromosome_1   Geneious   gene   421    2049   .   -   .   ID=12-gene
Chromosome_1   Geneious   CDS    421    2049   .   -   .   ID=12:cds;Parent=12
Chromosome_1   Geneious   CDS    3747   6569   .   -   .   ID=rmRNA-1:cds;Parent=rmRNA-1
Chromosome_1   maker      mRNA   3747   6569   .   -   .   ID=rmRNA-1;Parent=gene1
Chromosome_1   maker      gene   3747   6569   .   -   .   ID=gene1

Map file tab delineated

12         B1
rmRNA-1    B2

output desired:

Chromosome_1   Geneious   mRNA   421    2049   .   -   .   ID=B1;Parent=B1-gene
Chromosome_1   Geneious   gene   421    2049   .   -   .   ID=B1-gene
Chromosome_1   Geneious   CDS    421    2049   .   -   .   ID=B1:cds;Parent=B1
Chromosome_1   Geneious   CDS    3747   6569   .   -   .   ID=B2:cds;Parent=B2
Chromosome_1   maker      mRNA   3747   6569   .   -   .   ID=B2;Parent=gene1
Chromosome_1   maker      gene   3747   6569   .   -   .   ID=gene1

gff • 3.5k views

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 10.2 years ago by rob234king ▴ 610

0

Entering edit mode

Thanks to both responses, accomplish perfectly what I was hoping for. I was spending too much time contemplating doing it by hand which would be insane but was getting quite frustrated in the end when it should be simple like the answers given below.

ADD REPLY • link 10.2 years ago by rob234king ▴ 610

Ram · Accepted Answer · 2015-05-20

If you are in a linux/unix environment you could use sed like such:

cat tmp.map | while read old new
do
  sed "s/$old/$new/g" tmp.gff > tmp2.gff
  mv tmp2.gff tmp.gff
done

Just make sure you keep a copy of your original gff file because this code over-writes the file. The $old will be the value in the first column and the $new is the value in the second. This loops through each line of your map file in turn to do the replacement. However, this is naive and will replace, for example, all 12's in the file to B1--including any coordinates etc. Below I've added some pre and post file processing using the awk command to eliminate this issue and only do the replacement in column 9. You can edit to include other columns if you need to:

#print column 9 to a file by itself 
awk '{print $9}' tmp.gff > col9.txt 

#print all other columns to a separate file 
awk '{print $1,$2,$3,$4,$5,$6,$7,$8}' tmp.gff > other.txt 

#run the replacement on column 9 
cat tmp.map | while read old new 
do 
  sed "s/$old/$new/g" col9.txt > tmp.txt 
  mv tmp.txt col9.txt 
done 

#paste the files back together 
paste other.txt col9.txt > final.gff

Ram · Accepted Answer · 2015-05-21

This is a little python script that will do the job.

gff_file = 'gff.txt'
map_file = 'map.txt'
out_file = 'output.txt'

my_map = dict()
with open(map_file, 'r') as f:
    for line in f:
        data = line.split('\t')
        my_map.update({data[0]:data[1]})

with open(out_file, 'w') as out:
    with open(gff_file, 'r') as f:

        for line in f:

            data = line.split('\t')

            for key, value in my_map.items():
                if key in data[-1]:
                    print(key,value)
                    print(data[-1])
                    data[-1] = data[-1].replace(key, value.strip())
                    print(data[-1])



            for item in data:
                out.write(item+'\t')