Extract rows from file two having the 4th column ( which contain /1 /2 signs ) identical to column 7 from file 1 ( which doesn't have /1 /2 signs )
2
0
Entering edit mode
6.1 years ago
i.jabre26 ▴ 10

File1

 chr  start1 end1 start2 end2 score Tag name
1     1000  2000  3000   4000   0  K00230:40:HNWJLBBXX:4:1101:1002:24155 
1     8000  9000  10000  11000  0  K00230:40:HNWJLBBXX:4:1101:1002:35936

File2

chr  start1 end1 score  Tag name 
1     8000  9000   0    K00230:40:HNWJLBBXX:4:1101:1002:35936/1 
1     10000 11000  0    K00230:40:HNWJLBBXX:4:1101:1002:35936/2
1     1000  2000   0    K00230:40:HNWJLBBXX:4:1101:1002:35900/1 
1     3000  4000   0    K00230:40:HNWJLBBXX:4:1101:1002:35900/2

expected output

  chr  start1 end1 score  Tag name  
  1    8000  9000   0    K00230:40:HNWJLBBXX:4:1101:1002:35936/1 
  1    10000 11000  0    K00230:40:HNWJLBBXX:4:1101:1002:35936/2

May I please have some help to get the desired output

Thanks

python • 1.5k views
ADD COMMENT
0
Entering edit mode

I already worked that out :)

ADD REPLY
2
Entering edit mode
6.1 years ago
shawn.w.foley ★ 1.3k

You can do this pretty easily in python, you'll read through File 1 and save the 7th column in a list. Then read through File 2, delimit the 5th column on the slash ("/") and if the element before the slash is in your list then print the line. Example code would be:

inFile1 = open('file1','r')
inFile2 = open('file2','r')
printList = []

for line in inFile1:
    lineList = line.rstrip().split('\t') #Generate a list split on the tab
    printList.append(lineList[6])  #Append your tag to printList

header = inFile2.readline()  #Remove header
print(header.rstrip())

for line in inFile2:
    lineList = line.rstrip().split('\t') #Generate a list split on the tab
    (tag,junk) = lineList[4].split('/') #Split your tag on the slash
    if tag in printList:
        print(line.rstrip())

Personally, I'd recommend doing this in awk since you can get through it with just one line of code:

awk 'BEGIN{OFS="\t";FS="\t"} NR==FNR{tagArray[$7];next} NR>FNR{split($5,ARRAY,"/"); if (ARRAY[1] in tagArray) print $0}' file1 file2

This uses tabs as delimiters (BEGIN{OFS="\t";FS="\t}) then it reads in the first file and saves the 7th column in an array called tagArray (NR==FNR{tagArray[$7];next}). After it starts reading in the second file, it splits the 4th column on the slash, and if the first element of the 5th column is in tagArray is prints the line (NR>FNR{split($4,ARRAY,"/"); if (ARRAY[1] in tagArray) print $0}).

ADD COMMENT
0
Entering edit mode

Thank you for your codes. I have tried them by both are giving me an error. Did you try them on the test example I posted. Maybe the problem is from my files...

ADD REPLY
0
Entering edit mode

Apologies, I edited the above code. The tag name is in column 5 of the file2, the original code assumed it was in column 4. I also added a line in the python script to remove the header from file2.

ADD REPLY
0
Entering edit mode

Thank you for editing the codes :).. I'm still getting an error tho. Please see below the error I'm getting for the python code

Traceback (most recent call last): File "iNPS_tags/Trial.py", line 8, in <module> printList.append(lineList[6]) #Append your tag to printList IndexError: list index out of range

Concerning the awk code it is printing the following where I'm only intrested in printing only the tag ending with 35936 since they exist in both files 1 and 2. AS you can see tags ending with 35900 are not needed since they don;t exist in both files.. Thank you so much in advance !

chr  start1 end1 score  tag 
1     8000  9000   0    K00230:40:HNWJLBBXX:4:1101:1002:35936/1 
1     10000 11000  0    K00230:40:HNWJLBBXX:4:1101:1002:35936/2
1     1000  2000   0    K00230:40:HNWJLBBXX:4:1101:1002:35900/1 
1     3000  4000   0    K00230:40:HNWJLBBXX:4:1101:1002:35900/2
ADD REPLY
0
Entering edit mode

The error from the python code, printList.append(lineList[6]) #Append your tag to printList IndexError: list index out of range indicates that the script reaches a line where there are not seven columns in you file. Is this a typo in the file?

And I'm not sure why you're having an issue with the awk code, using the example you provided I see the expected output:

awk 'BEGIN{OFS="\t";FS="\t"} NR==FNR{tagArray[$7];next} NR>FNR{split($5,ARRAY,"/"); if (ARRAY[1] in tagArray) print $0}' file1 file2
1   8000    9000    0   K00230:40:HNWJLBBXX:4:1101:1002:35936/1 
1   10000   11000   0   K00230:40:HNWJLBBXX:4:1101:1002:35936/2

Am I misunderstanding your question or are you getting a different result?

ADD REPLY
0
Entering edit mode

Thank you so much for your reply. No you didn't misunderstand my question at all. This is the desired output. you were very helpful. I guess I'm doing something wrong because I'm getting different result using the awk code

Thank you so much !

ADD REPLY
1
Entering edit mode

The code worked perfectly fine ! thanks

ADD REPLY
0
Entering edit mode

Hello again, I was wondering if for the python code or for the awk code if we can specify the core number., Many thanks :)

ADD REPLY
0
Entering edit mode
6.1 years ago

output (add header to the output):

$ tail -n+2 test1.txt  | cut -f7 | grep -f - test2.txt 

1   8000    9000    0   K00230:40:HNWJLBBXX:4:1101:1002:35936/1 
1   10000   11000   0   K00230:40:HNWJLBBXX:4:1101:1002:35936/2

input;

$ tail -n+1 *.txt
==> test1.txt <==
chr start1  end1    start2  end2    score   Tag name
1   1000    2000    3000    4000    0   K00230:40:HNWJLBBXX:4:1101:1002:24155   
1   8000    9000    10000   11000   0   K00230:40:HNWJLBBXX:4:1101:1002:35936

==> test2.txt <==
chr start1  end1    score   Tag name    
1   8000    9000    0   K00230:40:HNWJLBBXX:4:1101:1002:35936/1 
1   10000   11000   0   K00230:40:HNWJLBBXX:4:1101:1002:35936/2
1   1000    2000    0   K00230:40:HNWJLBBXX:4:1101:1002:35900/1 
1   3000    4000    0   K00230:40:HNWJLBBXX:4:1101:1002:35900/2
ADD COMMENT

Login before adding your answer.

Traffic: 1637 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6