How to match genes of one file to another but in multiple columns
6
0
Entering edit mode
6.8 years ago
Bweil2 ▴ 10

I want to see which genes in archloro.txt match both lines of ArabidopsisParalogue.txt. I tried grep -f but that would just find the genes in both files have in common and not 2 genes that are in the same line. Any ideas on how to do this?

Basically i want to see how many and which rows in ArarbidopsisParalogue.txt match with 2 genes from archloro.txt.

head archloro.txt

AT1G01080
AT1G01090
AT1G01100
AT1G01110
AT1G01250
AT1G01550
AT1G01690
AT1G01730
AT1G01860
AT1G01950
AT1G02060
AT1G21400
AT1G24180
AT1G59900
AT1G17480
AT5G09300
AT5G34780
AT5G50250

head ArabidopsisParalogue.txt

AT1G01080       AT5G50250
AT1G01090       AT1G21400
AT1G01090       AT1G24180
AT1G01090       AT1G59900
AT1G01090       AT5G09300
AT1G01090       AT5G34780
AT1G01100       AT4G00810
AT1G01100       AT5G24510
AT1G01100       AT5G47700
AT1G01110       AT1G17480

Expected output

AT1G01080      AT5G50250
AT1G01090      AT1G21400
AT1G01090      AT1G24180
AT1G01090      AT1G59900
AT1G01090      AT5G09300
AT1G01090      AT5G34780
AT1G01100      AT1G17480
genome sequence gene alignment • 4.5k views
ADD COMMENT
2
Entering edit mode

It's not clear to me from your question and example data how a gene ID from archloro.txt can match both IDs in one row of ArabidopsisParalogue.txt, when the IDs in the latter file are not identical. Could you clarify the question with some expected output?

ADD REPLY
0
Entering edit mode

due to formatting issues, I could include head ArabidopsisParalogue.txt. There are two columns of genes.

ADD REPLY
0
Entering edit mode

These commands are not working for me!

ADD REPLY
1
Entering edit mode

Unless you show data from both files, one can't give any suggestions.

BTW, please move this answer to comment as it is not an answer.

ADD REPLY
0
Entering edit mode

The last line of your expected output is incorrect because AT1G17480 is not present in archloro.txt (in the data that you've shown)

ADD REPLY
0
Entering edit mode

and AT5G34780 is present twice in first file.

ADD REPLY
0
Entering edit mode

Yep (and good to see you again).

Your solution also works, it seems. bweil2, if you could devote a few moments more to looking at the other 2 new answers to see if they work (and Accept them also, if so), then that would be great.

ADD REPLY
1
Entering edit mode

thanks and you seem to be in good spirits always @kevin

ADD REPLY
4
Entering edit mode
6.8 years ago

You can use this python script:

match = set(open('archloro.txt').read().strip().split('\n'))
for line in open('ArabidopsisParalogue.txt'):
    a,b = line.strip().split('\t')
    if a in match and b in match:
        print line.strip()

Save as match.py run by python match.py

ADD COMMENT
1
Entering edit mode

For the record following python code works fine as well. I think @Damian forgot to add brackets in the print line.

match = set(open('archloro.txt').read().strip().split('\n'))
for line in open('ArabidopsisParalogue.txt'):
    a,b = line.strip().split('\t')
    if a in match and b in match:
        print(line.strip())
ADD REPLY
0
Entering edit mode

Damian wrote Python2.

ADD REPLY
0
Entering edit mode

I guess there are people in the field that learned python from v3 onwards. I feel like a dinosaur now.

ADD REPLY
0
Entering edit mode

Yeah, I started out on 2.7 but realised quickly enough that I should write Python3 so now I'm used to that.

ADD REPLY
3
Entering edit mode
6.8 years ago

I think that this may be it:

awk 'FNR==NR {idx[$1]=$1; next} {print idx[$(1)]"\t"$2}' FS=' ' archloro.txt FS=' ' ArabidopsisParalogue.txt
AT1G01080   AT5G50250
AT1G01090   AT1G21400
AT1G01090   AT1G24180
AT1G01090   AT1G59900
AT1G01090   AT5G09300
AT1G01090   AT5G34780
AT1G01100   AT4G00810
AT1G01100   AT5G24510
AT1G01100   AT5G47700
AT1G01110   AT1G17480

ADD COMMENT
1
Entering edit mode

These are lines from ArabidopsisParalogue.txt in which both genes are also present in ararchloro.txt:

awk 'FNR==NR {idx[$1]=$1; next} {if (idx[$1] && idx[$2]) print}' FS=' ' archloro.txt FS=' ' ArabidopsisParalogue.txt

AT1G01080       AT5G50250
AT1G01090       AT1G21400
AT1G01090       AT1G24180
AT1G01090       AT1G59900
AT1G01090       AT5G09300
AT1G01090       AT5G34780
ADD REPLY
0
Entering edit mode

Upon further review, I am getting the same number of a particular gene from one column in the other. For example, AT1G01090 shows up 5 times in the left column. But further down the right column it shows up 5 times matching with the same genes just inverted.

AT1G01080       AT5G50250
AT1G01090       AT1G21400
AT1G01090       AT1G24180
AT1G01090       AT1G59900
AT1G01090       AT5G09300
AT1G01090       AT5G34780

Further down, (not sorted)

AT5G50250       AT1G01080
AT1G21400       AT1G01090
AT1G24180       AT1G01090
AT1G59900       AT1G01090
AT5G09300       AT1G01090
AT5G34780       AT1G01090

We know all these genes are targeted to the chloroplast from your script but we're getting some duplicates. I know I can just cut these into two files and than cat them together to get the unique number of genes to find the line count. But I need a way to align them back up the original way. In other words, I need a code that will take out the reciprocal genes matches.

ADD REPLY
0
Entering edit mode

I'd switch to R if I were you.

ADD REPLY
0
Entering edit mode

You may want to try this, but please rigorously check this:

awk 'FNR==NR {idx[$1]=$1; next} {if (idx[$1] && idx[$2]) {print}}' FS=' ' archloro.txt FS=' ' ArabidopsisParalogue.txt | awk '{print $0"\t"$1$2"\t"$2$1}' | awk '{if (!duplicate[$3]++ && !duplicate[$4]++) {print $1"\t"$2}}'

This works by creating a unique key (for each record) that is based on Col1+Col2 and also Col2+Col1

ADD REPLY
2
Entering edit mode
6.8 years ago
$ sort -k2 ArabidopsisParalogue.txt | join -t $'\t' -1 2 -2 1 - archloro.txt | sort -k2| join -t $'\t'  -1 1 -2 2  archloro.txt -| uniq
AT1G01080   AT5G50250
AT1G01090   AT1G21400
AT1G01090   AT1G24180
AT1G01090   AT1G59900
AT1G01090   AT5G09300
AT1G01090   AT5G34780
ADD COMMENT
1
Entering edit mode
6.8 years ago
venu 7.1k

I tried grep -f but that would just find the genes in both files have in common and not 2 genes

 after grep command | awk -F'\t' '$1==$2'

above works if ArabidopsisParalogue.txt is tab delimited file. If not change -F'\t'.

ADD COMMENT
0
Entering edit mode

Yes it is tab delimited but its still not working. This is the input

grep -f Achloroplast.txt ArabidopsisParalogues.txt |  awk -F'\t' '$1==$2' >> MatchingParalogues.txt
ADD REPLY
0
Entering edit mode

When you tried/implemented some answer from any forum, make sure you are updating with the result, i.e whether the solution worked or not?

P.S: It didn't work if I'm correct. Try this

cat Achloroplast.txt | cut -d "|" -f1 | sort -u | grep -Fwf - ArabidopsisParalogues.txt | awk -F'\t' '$1==$2'
ADD REPLY
0
Entering edit mode

Did not work

Now it says grep: -: No such file or directory

So, I took out the "-" before ArabidopsisParalogues.txt

Getting no values with that

ADD REPLY
1
Entering edit mode
6.8 years ago

You could write a short script in a language like Python. Pseudocode:

f1 = archoloro.txt
f2 = ArabidopsisParalogue.txt

s = construct set from f1
t = empty set

for each line in f2
    g1, g2 = split line at "\t"

    if g1 and g2 in s
        add g1, g2 to s

If your file is large, there's two optimizations you can do:

  • Specifically use a HashSet or similar structure that permits low-cost test of membership
  • Before checking if g1 and g2 are in s, check whether they are both already in t (which should be much smaller) first
ADD COMMENT
0
Entering edit mode

I've only been using terminal on my mac. I do have R but I never used python.

ADD REPLY
0
Entering edit mode

Unfortunately I don't know many terminal commands so I wrote pseudocode instead. Hopefully you can convert it to a language you're familiar with.

ADD REPLY
0
Entering edit mode

I've figured out the exact code for all of them except for "add g1, g2 to s." Is this just adding two files to another? Any ideas on the exact command?

ADD REPLY
0
Entering edit mode

s here should be a data structure that resembles a mathematical set (all elements unique). In Python, you can do s = set() to create an empty set and add elements with s.add(g1) and s.add(g2).

ADD REPLY
1
Entering edit mode
6.8 years ago
igor 13k

I think you are looking for the join command. You can do something like: join archloro.txt ArabidopsisParalogue.txt.

More info and examples here: http://www.theunixschool.com/2012/01/join-command.html

ADD COMMENT

Login before adding your answer.

Traffic: 2210 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6