Here's a oneliner that will do the trick. It's 95% based on python, so I hope it's good. Substitute "test.txt" with your file name.
command
cat test.txt | python2.7 -c 'import sys; lst=[[line.rstrip("\n"), list(set(line.rstrip("\b\r\n, ").split(",")))] for line in sys.stdin]; tmp=[x[0] for x in lst if len(x[1])>1]; sys.stdout.write("\n".join(tmp) + "\n")'
explanation
We cat
the file and we pipe it to python2.7
, with the -c
option to include a command within quotes (''
). We first import the sys
module just for having it easy at reading and writing to output (or at least, I like it haha). Every line of the python command is separated by a semi-colon (;
).
We create a list (lst
). We read the input file through the python list comprehension syntax (see the end of the command ... for line in sys.stdin
. What we declare before that is our variable that is stored in the list. In this case, another list composed of two elements. The first item of this sub-list is the raw element you want to print out, the second is a processed version of it.
The first item of the sub-list is simply stripped off of the newline metacharacter (rstrip("\n")
). The second is processed more. We remove the trailing metacharacters and commas (rstrip(\r\b\n,
). We then split this item at every comma (split(",")
). This produces an output like [A, T, A]
, a list where each item is one of the ones you had separated by commas. So each line here at this point looks like this:
["A,T,A", [A, T, A]]
A list of two elements: the raw line in string format and the processed line in form of list.
Since you want only the lines which contain more than one "letter", one neat way to do so is to "unique" the list and see if the final length is > 1 (i.e. there is more than one letter). To do so in python: list(set())
. set()
will remove the duplicates in the list, and list()
will re-format the output as a list again. So each line here at this point looks like this:
["A,T,A", [A, T]]
Note that the latest A has disappeared, being a duplicate.
The following command in the python part is selecting only those lines that have a uniqued list > 1, meaning the ones that you are interested in. It does so with the list length (if len(x[1])>1
). Each selected item is a list of two elements, where the first is the raw input line. We make a list, which I here call tmp
, that contains only the raw input line for each selected item. That is what we now print out: with sys.stdout.write("\n".join(tmp) + "\n")
we join()
each element of this list with a newline character, forming the line-formatted output file, and we add a final newline to complete it (+ "\n"
).
As a side comment: you can shorten your post a lot, by adding only some example lines of your file that are enough for us to understand the problem. Answer is coming right away (in the answers below).
Macspider thanks for the comment, i will keep it in mind. Waiting for the answer :)
Quite a few people have made a good effort here. Could you take the time to test each and then up-vote or accept the answers that have helped?
Thanks!
Yes i will try all of the codes and i will surely upvote the answer. Thanks once again for the help.