Question

How to match genes of one file to another but in multiple columns

0

Entering edit mode

6.8 years ago

Bweil2 ▴ 10

I want to see which genes in archloro.txt match both lines of ArabidopsisParalogue.txt. I tried grep -f but that would just find the genes in both files have in common and not 2 genes that are in the same line. Any ideas on how to do this?

Basically i want to see how many and which rows in ArarbidopsisParalogue.txt match with 2 genes from archloro.txt.

head archloro.txt

AT1G01080
AT1G01090
AT1G01100
AT1G01110
AT1G01250
AT1G01550
AT1G01690
AT1G01730
AT1G01860
AT1G01950
AT1G02060
AT1G21400
AT1G24180
AT1G59900
AT1G17480
AT5G09300
AT5G34780
AT5G50250

head ArabidopsisParalogue.txt

AT1G01080       AT5G50250
AT1G01090       AT1G21400
AT1G01090       AT1G24180
AT1G01090       AT1G59900
AT1G01090       AT5G09300
AT1G01090       AT5G34780
AT1G01100       AT4G00810
AT1G01100       AT5G24510
AT1G01100       AT5G47700
AT1G01110       AT1G17480

Expected output

AT1G01080      AT5G50250
AT1G01090      AT1G21400
AT1G01090      AT1G24180
AT1G01090      AT1G59900
AT1G01090      AT5G09300
AT1G01090      AT5G34780
AT1G01100      AT1G17480

genome sequence gene alignment • 4.5k views

ADD COMMENT • link 6.8 years ago by Bweil2 ▴ 10

2

Entering edit mode

It's not clear to me from your question and example data how a gene ID from archloro.txt can match both IDs in one row of ArabidopsisParalogue.txt, when the IDs in the latter file are not identical. Could you clarify the question with some expected output?

ADD REPLY • link 6.8 years ago by Neilfws 49k

0

Entering edit mode

due to formatting issues, I could include head ArabidopsisParalogue.txt. There are two columns of genes.

ADD REPLY • link 6.8 years ago by Bweil2 ▴ 10

0

Entering edit mode

These commands are not working for me!

ADD REPLY • link 6.8 years ago by Bweil2 ▴ 10

1

Entering edit mode

Unless you show data from both files, one can't give any suggestions.

BTW, please move this answer to comment as it is not an answer.

ADD REPLY • link 6.8 years ago by venu 7.1k

0

Entering edit mode

The last line of your expected output is incorrect because AT1G17480 is not present in archloro.txt (in the data that you've shown)

ADD REPLY • link 6.8 years ago by Kevin Blighe 88k

0

Entering edit mode

and AT5G34780 is present twice in first file.

ADD REPLY • link 6.8 years ago by cpad0112 21k

0

Entering edit mode

Yep (and good to see you again).

Your solution also works, it seems. bweil2, if you could devote a few moments more to looking at the other 2 new answers to see if they work (and Accept them also, if so), then that would be great.

ADD REPLY • link 6.8 years ago by Kevin Blighe 88k

1

Entering edit mode

thanks and you seem to be in good spirits always @kevin

ADD REPLY • link 6.8 years ago by cpad0112 21k

1

Entering edit mode

6.8 years ago

venu 7.1k

I tried grep -f but that would just find the genes in both files have in common and not 2 genes

 after grep command | awk -F'\t' '$1==$2'

above works if ArabidopsisParalogue.txt is tab delimited file. If not change -F'\t'.

ADD COMMENT • link 6.8 years ago by venu 7.1k

0

Entering edit mode

Yes it is tab delimited but its still not working. This is the input

grep -f Achloroplast.txt ArabidopsisParalogues.txt |  awk -F'\t' '$1==$2' >> MatchingParalogues.txt

ADD REPLY • link 6.8 years ago by Bweil2 ▴ 10

0

Entering edit mode

When you tried/implemented some answer from any forum, make sure you are updating with the result, i.e whether the solution worked or not?

P.S: It didn't work if I'm correct. Try this

cat Achloroplast.txt | cut -d "|" -f1 | sort -u | grep -Fwf - ArabidopsisParalogues.txt | awk -F'\t' '$1==$2'

ADD REPLY • link 6.8 years ago by venu 7.1k

0

Entering edit mode

Did not work

Now it says grep: -: No such file or directory

So, I took out the "-" before ArabidopsisParalogues.txt

Getting no values with that

ADD REPLY • link 6.8 years ago by Bweil2 ▴ 10

1

Entering edit mode

6.8 years ago

robertobfisher ▴ 30

You could write a short script in a language like Python. Pseudocode:

f1 = archoloro.txt
f2 = ArabidopsisParalogue.txt

s = construct set from f1
t = empty set

for each line in f2
    g1, g2 = split line at "\t"

    if g1 and g2 in s
        add g1, g2 to s

If your file is large, there's two optimizations you can do:

Specifically use a HashSet or similar structure that permits low-cost test of membership
Before checking if g1 and g2 are in s, check whether they are both already in t (which should be much smaller) first

ADD COMMENT • link 6.8 years ago by robertobfisher ▴ 30

0

Entering edit mode

I've only been using terminal on my mac. I do have R but I never used python.

ADD REPLY • link 6.8 years ago by Bweil2 ▴ 10

0

Entering edit mode

Unfortunately I don't know many terminal commands so I wrote pseudocode instead. Hopefully you can convert it to a language you're familiar with.

ADD REPLY • link 6.7 years ago by robertobfisher ▴ 30

0

Entering edit mode

I've figured out the exact code for all of them except for "add g1, g2 to s." Is this just adding two files to another? Any ideas on the exact command?

ADD REPLY • link 6.8 years ago by Bweil2 ▴ 10

0

Entering edit mode

s here should be a data structure that resembles a mathematical set (all elements unique). In Python, you can do s = set() to create an empty set and add elements with s.add(g1) and s.add(g2).

ADD REPLY • link 6.7 years ago by robertobfisher ▴ 30

1

Entering edit mode

6.8 years ago

igor 13k

I think you are looking for the join command. You can do something like: join archloro.txt ArabidopsisParalogue.txt.

More info and examples here: http://www.theunixschool.com/2012/01/join-command.html

ADD COMMENT • link 6.8 years ago by igor 13k

score 4 · Accepted Answer · 2018-03-28

4

Entering edit mode

6.8 years ago

Damian Kao 16k

You can use this python script:

match = set(open('archloro.txt').read().strip().split('\n'))
for line in open('ArabidopsisParalogue.txt'):
    a,b = line.strip().split('\t')
    if a in match and b in match:
        print line.strip()

Save as match.py run by python match.py

ADD COMMENT • link 6.8 years ago by Damian Kao 16k

1

Entering edit mode

For the record following python code works fine as well. I think @Damian forgot to add brackets in the print line.

match = set(open('archloro.txt').read().strip().split('\n'))
for line in open('ArabidopsisParalogue.txt'):
    a,b = line.strip().split('\t')
    if a in match and b in match:
        print(line.strip())

ADD REPLY • link 6.8 years ago by GenoMax 148k

0

Entering edit mode

Damian wrote Python2.

ADD REPLY • link 6.8 years ago by WouterDeCoster 47k

0

Entering edit mode

I guess there are people in the field that learned python from v3 onwards. I feel like a dinosaur now.

ADD REPLY • link 6.8 years ago by Damian Kao 16k

0

Entering edit mode

Yeah, I started out on 2.7 but realised quickly enough that I should write Python3 so now I'm used to that.

ADD REPLY • link 6.8 years ago by WouterDeCoster 47k

Ram · Accepted Answer · 2018-03-28

3

Entering edit mode

6.8 years ago

Kevin Blighe 88k

~~I think that this may be it:~~

awk 'FNR==NR {idx[$1]=$1; next} {print idx[$(1)]"\t"$2}' FS=' ' archloro.txt FS=' ' ArabidopsisParalogue.txt AT1G01080 AT5G50250 AT1G01090 AT1G21400 AT1G01090 AT1G24180 AT1G01090 AT1G59900 AT1G01090 AT5G09300 AT1G01090 AT5G34780 AT1G01100 AT4G00810 AT1G01100 AT5G24510 AT1G01100 AT5G47700 AT1G01110 AT1G17480

ADD COMMENT • link updated 6.8 years ago by Ram 44k • written 6.8 years ago by Kevin Blighe 88k

1

Entering edit mode

These are lines from ArabidopsisParalogue.txt in which both genes are also present in ararchloro.txt:

awk 'FNR==NR {idx[$1]=$1; next} {if (idx[$1] && idx[$2]) print}' FS=' ' archloro.txt FS=' ' ArabidopsisParalogue.txt

AT1G01080       AT5G50250
AT1G01090       AT1G21400
AT1G01090       AT1G24180
AT1G01090       AT1G59900
AT1G01090       AT5G09300
AT1G01090       AT5G34780

ADD REPLY • link 6.8 years ago by Kevin Blighe 88k

0

Entering edit mode

Upon further review, I am getting the same number of a particular gene from one column in the other. For example, AT1G01090 shows up 5 times in the left column. But further down the right column it shows up 5 times matching with the same genes just inverted.

AT1G01080       AT5G50250
AT1G01090       AT1G21400
AT1G01090       AT1G24180
AT1G01090       AT1G59900
AT1G01090       AT5G09300
AT1G01090       AT5G34780

Further down, (not sorted)

AT5G50250       AT1G01080
AT1G21400       AT1G01090
AT1G24180       AT1G01090
AT1G59900       AT1G01090
AT5G09300       AT1G01090
AT5G34780       AT1G01090

We know all these genes are targeted to the chloroplast from your script but we're getting some duplicates. I know I can just cut these into two files and than cat them together to get the unique number of genes to find the line count. But I need a way to align them back up the original way. In other words, I need a code that will take out the reciprocal genes matches.

ADD REPLY • link 6.7 years ago by Bweil2 ▴ 10

0

Entering edit mode

I'd switch to R if I were you.

ADD REPLY • link 6.7 years ago by Ram 44k

0

Entering edit mode

You may want to try this, but please rigorously check this:

awk 'FNR==NR {idx[$1]=$1; next} {if (idx[$1] && idx[$2]) {print}}' FS=' ' archloro.txt FS=' ' ArabidopsisParalogue.txt | awk '{print $0"\t"$1$2"\t"$2$1}' | awk '{if (!duplicate[$3]++ && !duplicate[$4]++) {print $1"\t"$2}}'

This works by creating a unique key (for each record) that is based on Col1+Col2 and also Col2+Col1

ADD REPLY • link 6.7 years ago by Kevin Blighe 88k

score 2 · Accepted Answer · 2018-03-28

2

Entering edit mode

6.8 years ago

cpad0112 21k

$ sort -k2 ArabidopsisParalogue.txt | join -t $'\t' -1 2 -2 1 - archloro.txt | sort -k2| join -t $'\t'  -1 1 -2 2  archloro.txt -| uniq
AT1G01080   AT5G50250
AT1G01090   AT1G21400
AT1G01090   AT1G24180
AT1G01090   AT1G59900
AT1G01090   AT5G09300
AT1G01090   AT5G34780

ADD COMMENT • link 6.8 years ago by cpad0112 21k