Making disjoint chromosomal sites contiguous with awk
0
0
Entering edit mode
4.1 years ago
selplat21 ▴ 20

I'm trying to write a loop in awk using the following info from two files:

  • a file with chromosome in the first column and site in the second column
  • a second file with chromosome in the first column and chromosome size in the second column

The sites in the second column range from the first to the last site of that chromosome, but the next chromosome will have sites starting from 1 again. I need to make all the sites in the first file contiguous so I will need to add the chromosome size to each site for chromosomes greater than 1 to make the sites contiguous in the first file.

Any help is appreciated!

awk assembly bash sequence • 985 views
ADD COMMENT
0
Entering edit mode

Pleas provide representative input and desired output.

ADD REPLY
0
Entering edit mode

For example, a section of file 1 looks like this (chromosome, site):

Chr2 884860
Chr2 884875
Chr2 884892

The second file looks like this (chromosome, chromosome size):

Chr1    196345723
Chr2    149451176
Chr3    114294425

For every chromosome bigger than 1 in the first file, I need to add the chromosome size of the preceding chromosome to make it continuous. So, the section of file 1 should look like this:

Chr2 197230583
Chr2 197230598
Chr2 197230615
ADD REPLY
0
Entering edit mode

Your question seems unclear about the exact operation being performed, and your output looks suspect (duplicate rows in the output, but the first file contains different "sites").

Can you please simplify the question and double-check what the input and output should look like?

ADD REPLY
0
Entering edit mode

I apologize, one of the sites was accidentally duplicated there. I edited it.

File 1, Column 1 = Chromosome

File 1, Column 2 = Site

Example:

Chr1    1
Chr1    3
Chr1    5
...
Chr2    3
Chr2    6
Chr2    7
...
Chr3    4
Chr3    6
Chr3    8
...

File 2, Column 1 = Chromosome

File 2, Column 2 = Chromosome Size

Example:

Chr1    196345723
Chr2    149451176
Chr3    114294425
...

Desired Output:

File 1 has a list of sites for each chromosome ranging from 1 to the total chromosome size of that chromosome. Note that some sites are not present because these are filtered sites. However, the maximum site value is the chromosome size and minimum value is 1 for each chromosome. The desired output file makes this first file contiguous between chromosomes so that chromosome 2 would start where chromosome 1 ends. In order to do this, I would add the total chromosome size of Chr1 to all site values of Chr2 in the first file and so on for each subsequent chromosome.

ADD REPLY
0
Entering edit mode

I am simply just trying to add the value of file 2, column 2 to each site of file 1 column 2, but the value being added to file 1 is from the preceding chromosome. That being said, Chr1 would be ignored. Any additional chromosomes would have to add the chromosome size of all preceding chromosomes to the sites value.

ADD REPLY

Login before adding your answer.

Traffic: 1590 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6