Hi I have the regions of interest (ROIs) generated using ROSE (https://bitbucket.org/young_computation/rose) the information of ROIs was generated as output as a txt "H3K27acDP_peaks_AllEnhancers_ENHANCER_TO_GENE.txt"
the inside of the txt looks like this (only the first 3 rows are shown here):
#H3K27acDP_peaks Enhancers OVERLAP_GENES PROXIMAL_GENES CLOSEST_GENE enhancerRank isSuper
2_H3K27ac_WTDP_peak_8539_lociStitched chr6 41482303 41510764 2 26841 100899.9372 3865.0038 1 Ephb6,Prss2 Prss2 1 1
12_H3K27ac_WTDP_peak_8627_lociStitched chr6 71249202 71328945 12 47488 101791.9395 10342.6671 2 Cd8a,Cd8b1 Krcc1,Smyd1 Cd8b1 2 1
I wanted to extract columns from this table to generate stand gtf file as input for DESeq2 (a R package for the analysis of regions with differentially enriched regions), for that purpose, I used:
awk '{OFS="\t"; print $2, "DP_enhancers","enhancer", $3, $4, "0.000000","-",".", $12}' H3K27acDP_peaks_AllEnhancers_ENHANCER_TO_GENE.txt > H3K27acDP_enhancers.gff &
but I did not get the gtf file which I wanted, here the first 4 rows are shown:
chr6 DP_enhancers enhancer 41482303 41510764 0.000000 - . 1
chr6 DP_enhancers enhancer 71249202 71328945 0.000000 - . Cd8b1
chr14 DP_enhancers enhancer 54779797 54858773 0.000000 - . Dad1
chr17 DP_enhancers enhancer 47640970 47694393 0.000000 - . Ccnd3
the problem is the first row, the "awk" seemed to fail to recognize there is a empty value for the column "OVERLAP_GENES", so instead of treating "Prss2" as $12, awk extract "1" which belongs to "enhancerRank" as $12, while the other rows seemed to be Ok. if just for the first row, I guess I could try to extract $11, instead of $12, but it would be problematic for most of the other rows. Anyone has idea to solve the problem please kindly let me know.
Thank you very much in advance.
@Asaf. I don't know what magic you have suggested, but it worked, perfectly! What is "-F"\t" "? why it could solve the problem?
awk splits to columns using any whitespace, if there are consecutive whitespaces it will consider them as one delimiter. When defying the column splitting character to be tab (with -F"\t") then when it sees consecutive tabs it will treat them as two splitters.
I see. Tks for patient explanation!
You could have also done:
Field Separator equals Output Field Separator equals..
In general it's a good idea to place this kind of stuff into a begin block so that the rules are executed before anything is read:
I think it tells the input file is tab separated.
I moved this to an answer so it can get accepted.