Question

Problem when trying to extract column value using "awk" from txt table

0

Entering edit mode

7.8 years ago

Wet&DryImmunology ▴ 240

Hi I have the regions of interest (ROIs) generated using ROSE (https://bitbucket.org/young_computation/rose) the information of ROIs was generated as output as a txt "H3K27acDP_peaks_AllEnhancers_ENHANCER_TO_GENE.txt"

the inside of the txt looks like this (only the first 3 rows are shown here):

#H3K27acDP_peaks Enhancers  OVERLAP_GENES   PROXIMAL_GENES  CLOSEST_GENE    enhancerRank    isSuper
2_H3K27ac_WTDP_peak_8539_lociStitched   chr6    41482303    41510764    2   26841   100899.9372 3865.0038   1       Ephb6,Prss2 Prss2   1   1
12_H3K27ac_WTDP_peak_8627_lociStitched  chr6    71249202    71328945    12  47488   101791.9395 10342.6671  2   Cd8a,Cd8b1  Krcc1,Smyd1 Cd8b1   2   1

I wanted to extract columns from this table to generate stand gtf file as input for DESeq2 (a R package for the analysis of regions with differentially enriched regions), for that purpose, I used:

awk '{OFS="\t"; print $2, "DP_enhancers","enhancer", $3, $4, "0.000000","-",".", $12}' H3K27acDP_peaks_AllEnhancers_ENHANCER_TO_GENE.txt > H3K27acDP_enhancers.gff &

but I did not get the gtf file which I wanted, here the first 4 rows are shown:

chr6    DP_enhancers    enhancer    41482303    41510764    0.000000    -   .   1
chr6    DP_enhancers    enhancer    71249202    71328945    0.000000    -   .   Cd8b1
chr14   DP_enhancers    enhancer    54779797    54858773    0.000000    -   .   Dad1
chr17   DP_enhancers    enhancer    47640970    47694393    0.000000    -   .   Ccnd3

the problem is the first row, the "awk" seemed to fail to recognize there is a empty value for the column "OVERLAP_GENES", so instead of treating "Prss2" as $12, awk extract "1" which belongs to "enhancerRank" as $12, while the other rows seemed to be Ok. if just for the first row, I guess I could try to extract $11, instead of $12, but it would be problematic for most of the other rows. Anyone has idea to solve the problem please kindly let me know.

Thank you very much in advance.

gene • 2.7k views

ADD COMMENT • link 7.8 years ago by Wet&DryImmunology ▴ 240

score 3 · Answer 1 · 2017-02-07

3

Entering edit mode

7.8 years ago

Asaf 10k

Try adding -F"\t" to the awk , i.e. awk -F"\t" '{OFS....

ADD COMMENT • link 7.8 years ago by Asaf 10k

0

Entering edit mode

@Asaf. I don't know what magic you have suggested, but it worked, perfectly! What is "-F"\t" "? why it could solve the problem?

ADD REPLY • link 7.8 years ago by Wet&DryImmunology ▴ 240

2

Entering edit mode

awk splits to columns using any whitespace, if there are consecutive whitespaces it will consider them as one delimiter. When defying the column splitting character to be tab (with -F"\t") then when it sees consecutive tabs it will treat them as two splitters.

ADD REPLY • link 7.8 years ago by Asaf 10k

0

Entering edit mode

I see. Tks for patient explanation!

ADD REPLY • link 7.8 years ago by Wet&DryImmunology ▴ 240

1

Entering edit mode

You could have also done:

awk '{FS=OFS="\t"; prin..

Field Separator equals Output Field Separator equals..

In general it's a good idea to place this kind of stuff into a begin block so that the rules are executed before anything is read:

awk 'BEGIN{FS=OFS="\t"}{print..}'

ADD REPLY • link 7.8 years ago by 5heikki 11k

1

Entering edit mode

I think it tells the input file is tab separated.

ADD REPLY • link 7.8 years ago by mbk0asis ▴ 700

0

Entering edit mode

I moved this to an answer so it can get accepted.

ADD REPLY • link 7.8 years ago by WouterDeCoster 47k