Remove observations from txt file
1
0
Entering edit mode
2.6 years ago
gubrins ▴ 350

Heys,

I have a simple question but I am not managing to solve. I have a txt file with records about genome-wide heterozygosity for one individual, like this:

NC_018723.3 50001   305 39182   0.00778419
NC_018723.3 150000  644 78927   0.00815944
NC_018723.3 250000  28  83487   0.000335382
NC_018723.3 350000  43  84221   0.000510561
NC_018723.3 450000  56  73332   0.00076365
NC_018723.3 550000  52  77842   0.00066802

Where the first column is the chromosome, the second column is the genomic coordinates (I did non-overlapping sliding windows of 100Kb and in the second column I have a number which is half the sliding windows), third column is number of SNPs, fourth column number of called bases and fifth column is the division SNPs / callable.

Then, I have a second file with bed coordinates of the regions I want to first include and latter exclude for re-calculating the heterozygosity. So, what I want is: if one sliding windowns is within one of the regions I have in my bed file, make a file including all of them and a second file excluding all of them. How can I do it? It is not necessary to be done in bash!

the second file where I have the bed coordinates is like this:

NC_018723.3 203270  441160
NC_018723.3 624960  695520
NC_018723.3 756696  977820
NC_018723.3 1005429 1221086
NC_018723.3 1240095 1705853
NC_018723.3 1747839 1964846
NC_018723.3 1975644 2136144
NC_018723.3 2169657 2651377

and the expected output file would be this:

NC_018723.3 250000  28  83487   0.000335382
NC_018723.3 350000  43  84221   0.000510561
NC_018723.3 450000  56  73332   0.00076365

As these three entries are within the first column from the bed coordinates.

Thanks in advance!

bash • 1.0k views
ADD COMMENT
1
Entering edit mode

It would help understanding the issue if you post expected output and example input files instead of explaining the problem.

ADD REPLY
1
Entering edit mode

Sorry for that, so a part of the input file I already uploaded, the second file where I have the bed coordinates is like this:

NC_018723.3 203270  441160
NC_018723.3 624960  695520
NC_018723.3 756696  977820
NC_018723.3 1005429 1221086
NC_018723.3 1240095 1705853
NC_018723.3 1747839 1964846
NC_018723.3 1975644 2136144
NC_018723.3 2169657 2651377

and the expected output file would be this:

NC_018723.3 250000  28  83487   0.000335382
NC_018723.3 350000  43  84221   0.000510561
NC_018723.3 450000  56  73332   0.00076365

As these three entries are within the first column from the bed coordinates. Does this help?

ADD REPLY
1
Entering edit mode

Thank you. I added this information to OP, for others to understand the post.

ADD REPLY
2
Entering edit mode
2.6 years ago

Sounds like a job for bedtools intersect but first you need to convert your file into a proper bed file like this:

NC_018723.3 1 100000   305 39182   0.00778419
NC_018723.3 100001 200000  644 78927   0.00815944
NC_018723.3 200001 300000  28  83487   0.000335382

then

bedtools intersect -a file1 -b file2 -wa > file1_overlapping_file2.txt

You can add option -v to get the entries in file1 not overlapping regions in file2.

ADD COMMENT
0
Entering edit mode

Thanks, it is exactly what I needed. One question related with bedtools that maybe you know, I am getting this error: ***** ERROR: illegal number "1.1e+07". Exiting... Seems that bedtools does not like this type of numbers, there is an easy solution to solve it that it is not to change the number's format? Thanks!

ADD REPLY
1
Entering edit mode

I think you have no choice but to convert the scientific notation to plain integer numbers.

ADD REPLY

Login before adding your answer.

Traffic: 2546 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6