Compare specific parts of two columns in a text file in Linux
3
0
Entering edit mode
5.0 years ago
bobbyle0210 ▴ 10

I have a text file with several columns separated by tab character as below:

1    ATGCCCAGA  AS:i:10   XS:i:10  
2    ATGCTTGA   AS:i:10   XS:i:5  
3    ATGGGGGA   AS:i:10   XS:i:1  
4    ATCCCCGA   AS:i:20   XS:i:20

I now want to compare the last two columns AS:i:(n1) and XS:i:(n2) to obtain only lines with n1 different to n2. So, my desired output would be:

2    ATGCTTGA   AS:i:10   XS:i:5  
3    ATGGGGGA   AS:i:10   XS:i:1

Could you suggest me some ways that I can compare n1 and n2 and print out the output? Thanks in advance.

linux samfile alignment aligning_score • 2.3k views
ADD COMMENT
0
Entering edit mode

AS:i:(n1) and XS:i:(n2) to obtain only lines with identical n1 and n2.

in your "desired output" n1 is not identical to N2.

Anyway, your looking for awk. look at https://www.unixtutorial.org/awk-delimiter and https://www.unix.com/shell-programming-and-scripting/274247-how-compare-two-column-using-awk.html

ADD REPLY
2
Entering edit mode
5.0 years ago
tim.booth ▴ 110

Hi,

The last answer from @cpad0112 looks right to me, in that it directly answers your question and extracts and compares n1 and n2 from the third and fourth columns.

However I note that the example file looks rather like a simplified version of SAM data, and you did ask for alternative approaches. If your original source of data really is a SAM/BAM file then a more robust approach is to use htslib to parse the whole file. In Python, the pysam library gives access to htslib as documented here:

https://pysam.readthedocs.io/en/latest/api.html

In the fourth example on that page, under the heading "You can also write to a AlignmentFile" there is a prototypical filter script. In the example the test is on read.is_paired but you could instead test on read.get_tag('AS') != read.get_tag('XS'). Other command-line tools like 'bamtools' and 'samtools' have various filter options but I'm not aware of any that can compare two tags.

ADD COMMENT
0
Entering edit mode

Probably this is the way to go if it is an alignment file. Please use dedicated tool for operations @ bobbyle0210

ADD REPLY
1
Entering edit mode
5.0 years ago

Code that works with example data: @ bobbyle0210

$ awk 'a[$3]++' file.txt 
2   ATGCTTGA    AS:i:10 XS:i:5  
3   ATGGGGGA    AS:i:10 XS:i:1

variation of this would be:

$ cat file.txt 
1   ATGCCCAGA   AS:i:10 XS:i:10 
2   ATGCTTGA    AS:i:10 XS:i:5  
3   ATGGGGGA    AS:i:10 XS:i:1  
4   ATCCCCGA    AS:i:20 XS:i:20
5   CTGATCGAT   AS:i:10 XS:i:10

$ awk '!a[$4]++ && a[$3]++' file.txt 
2   ATGCTTGA    AS:i:10 XS:i:5  
3   ATGGGGGA    AS:i:10 XS:i:1
ADD COMMENT
0
Entering edit mode

Hi, Thank you for your help. Could you explain in detail how the command works? I am not so familiar with linux command. Thank you :)

ADD REPLY
1
Entering edit mode

First function prints all the rows with identical column 3 values

Second function prints all the rows where "column 4 values are non-identical and identical column 3"

If, AS:i: and XS:i: are fixed, you can use following, where column 3 last values are not equal to column 4 last values:

$ awk 'substr($3, length($3)-5, length($3)) != substr($4, length($4)-5,length($4))' file.txt 

2   ATGCTTGA    AS:i:10 XS:i:5  
3   ATGGGGGA    AS:i:10 XS:i:1

You can use following, where column 3 last values are equal to column 4 last values:

$ awk 'substr($3, length($3)-5, length($3)) == substr($4, length($4)-5,length($4))' file.txt 

1   ATGCCCAGA   AS:i:10 XS:i:10 
4   ATCCCCGA    AS:i:20 XS:i:20

If AS and XS are not fixed, but last field is separated by : and there are only 3 fields in a column, you can also use:

$ awk '{split ($3,a,":"); split ($4,b,":"); if (a[3]!=b[3]) print}' file.txt 

2   ATGCTTGA    AS:i:10 XS:i:5  
3   ATGGGGGA    AS:i:10 XS:i:1
ADD REPLY
0
Entering edit mode
5.0 years ago
with open("file.txt") as file:
    read_file = file.read().split("\n")
    read_file_two = [x.split("\t") for x in read_file]
    read_file_three = [[x.rstrip("  ") for x in y] for y in read_file_two]

for x in read_file_three:
    if x[1][3:] != x[2][3:]:
        print("\t".join(x),file=open("output.txt", "a"))

Quick and dirty python solution. Note this matches i:n1 with i:n2. If you want to omit the i:, change the 3: in the for loop to a 5:

ADD COMMENT

Login before adding your answer.

Traffic: 2454 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6