Question

Compare specific parts of two columns in a text file in Linux

0

Entering edit mode

5.5 years ago

bobbyle0210 ▴ 10

I have a text file with several columns separated by tab character as below:

1    ATGCCCAGA  AS:i:10   XS:i:10  
2    ATGCTTGA   AS:i:10   XS:i:5  
3    ATGGGGGA   AS:i:10   XS:i:1  
4    ATCCCCGA   AS:i:20   XS:i:20

I now want to compare the last two columns AS:i:(n1) and XS:i:(n2) to obtain only lines with n1 different to n2. So, my desired output would be:

2    ATGCTTGA   AS:i:10   XS:i:5  
3    ATGGGGGA   AS:i:10   XS:i:1

Could you suggest me some ways that I can compare n1 and n2 and print out the output? Thanks in advance.

linux samfile alignment aligning_score • 2.6k views

ADD COMMENT • link updated 5.5 years ago by bioinformatics2020 ▴ 830 • written 5.5 years ago by bobbyle0210 ▴ 10

0

Entering edit mode

AS:i:(n1) and XS:i:(n2) to obtain only lines with identical n1 and n2.

in your "desired output" n1 is not identical to N2.

Anyway, your looking for awk. look at https://www.unixtutorial.org/awk-delimiter and https://www.unix.com/shell-programming-and-scripting/274247-how-compare-two-column-using-awk.html

ADD REPLY • link 5.5 years ago by Pierre Lindenbaum 166k

score 2 · Answer 1 · 2019-11-25

Hi,

The last answer from @cpad0112 looks right to me, in that it directly answers your question and extracts and compares n1 and n2 from the third and fourth columns.

However I note that the example file looks rather like a simplified version of SAM data, and you did ask for alternative approaches. If your original source of data really is a SAM/BAM file then a more robust approach is to use htslib to parse the whole file. In Python, the pysam library gives access to htslib as documented here:

https://pysam.readthedocs.io/en/latest/api.html

In the fourth example on that page, under the heading "You can also write to a AlignmentFile" there is a prototypical filter script. In the example the test is on read.is_paired but you could instead test on read.get_tag('AS') != read.get_tag('XS'). Other command-line tools like 'bamtools' and 'samtools' have various filter options but I'm not aware of any that can compare two tags.

score 1 · Answer 2 · 2019-11-25

1

Entering edit mode

5.5 years ago

cpad0112 21k

Code that works with example data: @ bobbyle0210

$ awk 'a[$3]++' file.txt 
2   ATGCTTGA    AS:i:10 XS:i:5  
3   ATGGGGGA    AS:i:10 XS:i:1

variation of this would be:

$ cat file.txt 
1   ATGCCCAGA   AS:i:10 XS:i:10 
2   ATGCTTGA    AS:i:10 XS:i:5  
3   ATGGGGGA    AS:i:10 XS:i:1  
4   ATCCCCGA    AS:i:20 XS:i:20
5   CTGATCGAT   AS:i:10 XS:i:10

$ awk '!a[$4]++ && a[$3]++' file.txt 
2   ATGCTTGA    AS:i:10 XS:i:5  
3   ATGGGGGA    AS:i:10 XS:i:1

ADD COMMENT • link 5.5 years ago by cpad0112 21k

0

Entering edit mode

Hi, Thank you for your help. Could you explain in detail how the command works? I am not so familiar with linux command. Thank you :)

ADD REPLY • link 5.5 years ago by bobbyle0210 ▴ 10

1

Entering edit mode

First function prints all the rows with identical column 3 values

Second function prints all the rows where "column 4 values are non-identical and identical column 3"

If, AS:i: and XS:i: are fixed, you can use following, where column 3 last values are not equal to column 4 last values:

$ awk 'substr($3, length($3)-5, length($3)) != substr($4, length($4)-5,length($4))' file.txt 

2   ATGCTTGA    AS:i:10 XS:i:5  
3   ATGGGGGA    AS:i:10 XS:i:1

You can use following, where column 3 last values are equal to column 4 last values:

$ awk 'substr($3, length($3)-5, length($3)) == substr($4, length($4)-5,length($4))' file.txt 

1   ATGCCCAGA   AS:i:10 XS:i:10 
4   ATCCCCGA    AS:i:20 XS:i:20

If AS and XS are not fixed, but last field is separated by : and there are only 3 fields in a column, you can also use:

$ awk '{split ($3,a,":"); split ($4,b,":"); if (a[3]!=b[3]) print}' file.txt 

2   ATGCTTGA    AS:i:10 XS:i:5  
3   ATGGGGGA    AS:i:10 XS:i:1

ADD REPLY • link 5.5 years ago by cpad0112 21k

score 0 · Answer 3 · 2019-11-25

with open("file.txt") as file:
    read_file = file.read().split("\n")
    read_file_two = [x.split("\t") for x in read_file]
    read_file_three = [[x.rstrip("  ") for x in y] for y in read_file_two]

for x in read_file_three:
    if x[1][3:] != x[2][3:]:
        print("\t".join(x),file=open("output.txt", "a"))

Quick and dirty python solution. Note this matches i:n1 with i:n2. If you want to omit the i:, change the 3: in the for loop to a 5: