How to compare several ASCII files in LINUX
1
0
Entering edit mode
6.2 years ago
jomagrax ▴ 40

Hi everyone, I have several ASCII files containing genes expressed in different experimental conditions ( apple.conditionA apple.conditionB apple.conditionC ) The first column of all of them conteins the gene name, the other colums have information like the chromosome where It is, the direction etc. I need to extract in a .txt file the gene names that are only expressed in condition A (apple.genesA) using LINUX commands.

Thanks in advance

rna-seq LINUX • 2.1k views
ADD COMMENT
5
Entering edit mode

Please be more specific. ASCII doesn’t narrow down what type of file you are trying to analyse, only the text encoding style.

ADD REPLY
0
Entering edit mode

Ok sorry, the files have all the same structure with several colums, I need to compare the first colum of all of them (where the gene names are) and then extract the unique genes of a concrete file.

ADD REPLY
0
Entering edit mode

You still haven’t told us what the files are.

Edit your question to include some example input data and the kind of structure you would like output.

ADD REPLY
0
Entering edit mode

Ok, I hope It´s clear now. Thank you for yor your time

ADD REPLY
0
Entering edit mode

It's not. Tell us how you obtained the files, which software was used and show an example.

ADD REPLY
0
Entering edit mode

We are not mind readers. How are we supposed to know what ‘condition A’ is, if you don’t show us the data? At the moment “is expressed in condition A” could be a Boolean, it could be some integer value, a floating point > some threshold?

If the data is confidential or something, you can make a mockup of the file which follows the same patterns with different context.

Please put more effort in else we will just close this post.

ADD REPLY
0
Entering edit mode

MDP0000303933 MDP0000303933 chr1 - 4276 5447

This is for instance the first line of the apple.conditionA file, on the first column we can see the gene name, the second column has de RNA read that was sequenced and asigned to the gene specified in colum one, the remaining columns give the chromosome, the direction of the gene, and It's coordinates.

All three files have the same structure, using linux command-lines, Is there a way to extract the unique genes expressed in the file apple.conditionA, comparing it to apple.conditionB and apple.conditionC?

Sorry for the vagueness of my questions but this is all very new to me, once again thank you for your help

ADD REPLY
0
Entering edit mode

So the question is you just want all the lines in the condition A files which are unique (i.e. not in file B and C), based on column 1?

ADD REPLY
0
Entering edit mode

Yes, exactly! Thanks

ADD REPLY
2
Entering edit mode

Please refer to my solution below. You can achieve this by grep-ing against A for all patterns not matching the cat-ed first columns of B and C, which are gotten by cut-ing files B and C.

ADD REPLY
0
Entering edit mode

I think I have It,

$  cut -f1 apple.conditionA > compare | cut -f1 apple.conditionB apple.conditionC > tocompare
$ comm -12 compare tocompare

This way I need to create two files and to use two lines, but I can't think of anything else.

ADD REPLY
1
Entering edit mode

Temporary files work fine, but if you wish to not use files, check out process substitution

sort -u <(cut -f1 file.txt)

is the same as

cut -f1 file.txt | sort -u

Also, please don't use the command >file | command2 syntax. It maybe works now because your shell doesn't have MULTIOS enabled, but if you have MULTIOS enabled, it will pipe cut -f1 apple.conditionA to both the file compare as well as downstream to cut -f1 apple.conditionB, mangling the output and introducing unpredictable bugs bordering on file corruption to your pipeline.

ADD REPLY
0
Entering edit mode

Ok, so finally I got

$ comm -12 <(cut -f1 apple.conditionA) <(cut -f1 apple.conditionB apple.conditionC)

I dindn´t know process Substitution structures existed, thank you very very much!!

ADD REPLY
1
Entering edit mode

Yep, its as simple as that!

ADD REPLY
4
Entering edit mode
6.2 years ago
Ram 44k

You could use a combination of cut and diff or cut and comm or cut and grep to get to your results. Of course, you can also substitute cut with awk or do the entire thing in python or R.

Given how vague and obfuscated your description is at the moment, this is all the help I can give you.

ADD COMMENT

Login before adding your answer.

Traffic: 3837 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6