Question

How to fetch those line having only similar characters.

0

Entering edit mode

6.8 years ago

kartikayprasad ▴ 10

Hello friends, I have a file having multiple nucleotides present in a comma separated format. I want to filter only those lines which have atleast one different nucleotie present in it. I dont want those lines which have only one kind of nucleotide present in it throughout. for example, i have this kind of file which is shown below, though number of nucleotides in a single line can me 100 as well.

T,T,

G,G,

T,T,

G,G,

A,A,

T,T,T,

G,G,

C,C,C,

A,A,A,

T,T,T,

A,A,

T,T,

A,A,

G,G,G,

T,T,T,

A,A,

A,T,A,

G,G,

C,C,

G,G,G,

T,T,

A,A,

G,T,

A,T,A,

T,C,

C,C,C,

G,G,

C,C,C,

A,A,

G,G,

A,A,

T,T,

C,C,

A,A,

T,T,

T,T,T,

C,C,C,

T,T,

A,A,

T,T,

G,G,G,

T,T,

A,A,A,

T,T,

A,A,

T,T,

G,G,G,

A,A,

G,G,G,A,

T,T,

C,C,

A,A,

T,T,

A,A,A,

T,T,

T,T,T,

C,C,

A,A,.

A,T,G,C

I want answer be like:

A,T,A,

G,G,G,A,

A,T,G,C

It would be very helpful if there is any one liner is present for it and please explain the code as well so that i would properly understand the code and can use it in future as well.

Thanks in advance for all the helpers.

RNA-Seq SNP next-gen assembly • 2.0k views

ADD COMMENT • link updated 6.8 years ago by steve ★ 3.5k • written 6.8 years ago by kartikayprasad ▴ 10

0

Entering edit mode

As a side comment: you can shorten your post a lot, by adding only some example lines of your file that are enough for us to understand the problem. Answer is coming right away (in the answers below).

ADD REPLY • link 6.8 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

Macspider thanks for the comment, i will keep it in mind. Waiting for the answer :)

ADD REPLY • link 6.8 years ago by kartikayprasad ▴ 10

0

Entering edit mode

Quite a few people have made a good effort here. Could you take the time to test each and then up-vote or accept the answers that have helped?

Thanks!

ADD REPLY • link 6.8 years ago by Kevin Blighe 88k

0

Entering edit mode

Yes i will try all of the codes and i will surely upvote the answer. Thanks once again for the help.

ADD REPLY • link 6.8 years ago by kartikayprasad ▴ 10

score 2 · Answer 1 · 2018-01-31

2

Entering edit mode

6.8 years ago

mxs ▴ 530

perl -lne '$a[0] = ($_ =~ tr/A//);$a[1]= ($_ =~ tr/T//);$a[2]=($_ =~ tr/C//);$a[3]=($_ =~ tr/G//); my $b =0; foreach(@a){$b++ if $_>0} print $_ if $b >1 ' myfile

Under the assumption only ATCG are under the investigation

ADD COMMENT • link 6.8 years ago by mxs ▴ 530

1

Entering edit mode

Another Perl solution:

perl -lne '%a = (); @b = split /,/, $_; foreach $b (@b) { $a{$b}++; }; @b = keys %a; print if ($#b > 0)' < in.txt

A,T,A,
G,T,
A,T,A,
T,C,
G,G,G,A,
A,T,G,C

ADD REPLY • link 6.8 years ago by JC 13k

0

Entering edit mode

hey thanks for the code, can you please help me a little more. i was trying to add something in your code but it was a fail so can you please help. what i was trying that this code also print \n for those lines which is having similar nucleotides in it. for example: the code will also print \n for A,A and for T,T along with the result which this code is providing already.

ADD REPLY • link 6.8 years ago by kartikayprasad ▴ 10

0

Entering edit mode

hi, not sure what do you want to print, just a "\n" if the line contains a homozygous base? or just for A,A, T,T?

what you need is to extend the last if, something like:

perl -lne '%a = (); @b = split /,/, $_; foreach $b (@b) { $a{$b}++; }; @b = keys %a; if ($#b > 0) { print } else { print "\n" } ' < in.txt

ADD REPLY • link 6.8 years ago by JC 13k

0

Entering edit mode

hi mxs, thanks for the reply. Can you please explain the code? Thanks

ADD REPLY • link 6.8 years ago by kartikayprasad ▴ 10

0

Entering edit mode

the idea is U count the characters separately and if you have only one type then occurrence of that character will be > 0 while the rest will have 0 so if you count the number of times you have occurrence > 0 if that number is > 1 than you don't have poly-something, thus this is the line you print... learn perl oneliners or awk. Don't waste your time on such trivial tasks . it adds up as you start doing bioinfo professionally :)

PS there is a shorter version of this solution, can you figure it out ? :)

ADD REPLY • link 6.8 years ago by mxs ▴ 530

score 2 · Answer 2 · 2018-01-31

2

Entering edit mode

6.8 years ago

Kevin Blighe 88k

Assuming that each line in your file does not actually end in a comma:

awk '{strPrevBase=$1; boolDiff=0; for (i=2; i<=NF; i++) {if ($(i)!=strPrevBase) {boolDiff=1}} if (boolDiff==1) {print $0}} ' FS="," test
A,T,A
G,T
A,T,A
T,C
G,G,G,A
A,T,G,C

Note that I identified 2 extra lines in your pasted data where at least one base differs.

Kevin

ADD COMMENT • link 6.8 years ago by Kevin Blighe 88k

1

Entering edit mode

thank you very much for the help

ADD REPLY • link 6.8 years ago by kartikayprasad ▴ 10

0

Entering edit mode

hey thanks for the code, can you please help me a little more. i was trying to add something in your code but it was a fail so can you please help. what i was trying that this code also print \n in those lines which ih having similar nucleotides in it. for example: the code will print \n for A,A and for T,T.

what i edited in your code is one else condition in the last but it is not working can you pls help. awk '{strPrevBase=$1; boolDiff=0; for (i=2; i<=NF; i++) {if ($(i)!=strPrevBase) {boolDiff=1}} if (boolDiff==1) {print $0} else {print "\n"}} ' FS="," test

ADD REPLY • link 6.8 years ago by kartikayprasad ▴ 10

0

Entering edit mode

Sure, can you try this (I think that this is what you want):

awk '{strPrevBase=$1; boolDiff=0; for (i=2; i<=NF; i++) {if ($(i)!=strPrevBase) {boolDiff=1}} if (boolDiff==1) {print $0} else print "\n"} ' FS="," test

ADD REPLY • link 6.8 years ago by Kevin Blighe 88k

score 2 · Answer 3 · 2018-01-31

Here's a oneliner that will do the trick. It's 95% based on python, so I hope it's good. Substitute "test.txt" with your file name.

command

cat test.txt | python2.7 -c 'import sys; lst=[[line.rstrip("\n"), list(set(line.rstrip("\b\r\n, ").split(",")))] for line in sys.stdin]; tmp=[x[0] for x in lst if len(x[1])>1]; sys.stdout.write("\n".join(tmp) + "\n")'

explanation

We cat the file and we pipe it to python2.7, with the -c option to include a command within quotes (''). We first import the sys module just for having it easy at reading and writing to output (or at least, I like it haha). Every line of the python command is separated by a semi-colon (;).

We create a list (lst). We read the input file through the python list comprehension syntax (see the end of the command ... for line in sys.stdin. What we declare before that is our variable that is stored in the list. In this case, another list composed of two elements. The first item of this sub-list is the raw element you want to print out, the second is a processed version of it.

The first item of the sub-list is simply stripped off of the newline metacharacter (rstrip("\n")). The second is processed more. We remove the trailing metacharacters and commas (rstrip(\r\b\n,). We then split this item at every comma (split(",")). This produces an output like [A, T, A], a list where each item is one of the ones you had separated by commas. So each line here at this point looks like this:

["A,T,A", [A, T, A]]

A list of two elements: the raw line in string format and the processed line in form of list.

Since you want only the lines which contain more than one "letter", one neat way to do so is to "unique" the list and see if the final length is > 1 (i.e. there is more than one letter). To do so in python: list(set()). set() will remove the duplicates in the list, and list() will re-format the output as a list again. So each line here at this point looks like this:

["A,T,A", [A, T]]

Note that the latest A has disappeared, being a duplicate.

The following command in the python part is selecting only those lines that have a uniqued list > 1, meaning the ones that you are interested in. It does so with the list length (if len(x[1])>1). Each selected item is a list of two elements, where the first is the raw input line. We make a list, which I here call tmp, that contains only the raw input line for each selected item. That is what we now print out: with sys.stdout.write("\n".join(tmp) + "\n") we join() each element of this list with a newline character, forming the line-formatted output file, and we add a final newline to complete it (+ "\n").

score 2 · Answer 4 · 2018-01-31

2

Entering edit mode

6.8 years ago

steve ★ 3.5k

Python

import sys
input_file = "nucleotides.csv"
with open(input_file) as f:
    for line in f:
        parts = [x for x in line.strip().split(',') if x != '']
        all_equal = all( x == parts[0] for x in parts)
        if not all_equal and len(parts) > 0:
            sys.stdout.write(line)

ADD COMMENT • link 6.8 years ago by steve ★ 3.5k