Question

Remove observations from txt file

0

Entering edit mode

3.4 years ago

gubrins ▴ 350

Heys,

I have a simple question but I am not managing to solve. I have a txt file with records about genome-wide heterozygosity for one individual, like this:

NC_018723.3 50001   305 39182   0.00778419
NC_018723.3 150000  644 78927   0.00815944
NC_018723.3 250000  28  83487   0.000335382
NC_018723.3 350000  43  84221   0.000510561
NC_018723.3 450000  56  73332   0.00076365
NC_018723.3 550000  52  77842   0.00066802

Where the first column is the chromosome, the second column is the genomic coordinates (I did non-overlapping sliding windows of 100Kb and in the second column I have a number which is half the sliding windows), third column is number of SNPs, fourth column number of called bases and fifth column is the division SNPs / callable.

Then, I have a second file with bed coordinates of the regions I want to first include and latter exclude for re-calculating the heterozygosity. So, what I want is: if one sliding windowns is within one of the regions I have in my bed file, make a file including all of them and a second file excluding all of them. How can I do it? It is not necessary to be done in bash!

the second file where I have the bed coordinates is like this:

NC_018723.3 203270  441160
NC_018723.3 624960  695520
NC_018723.3 756696  977820
NC_018723.3 1005429 1221086
NC_018723.3 1240095 1705853
NC_018723.3 1747839 1964846
NC_018723.3 1975644 2136144
NC_018723.3 2169657 2651377

and the expected output file would be this:

NC_018723.3 250000  28  83487   0.000335382
NC_018723.3 350000  43  84221   0.000510561
NC_018723.3 450000  56  73332   0.00076365

As these three entries are within the first column from the bed coordinates.

Thanks in advance!

bash • 1.6k views

ADD COMMENT • link updated 3.4 years ago by cpad0112 21k • written 3.4 years ago by gubrins ▴ 350

1

Entering edit mode

It would help understanding the issue if you post expected output and example input files instead of explaining the problem.

ADD REPLY • link 3.4 years ago by cpad0112 21k

1

Entering edit mode

Sorry for that, so a part of the input file I already uploaded, the second file where I have the bed coordinates is like this:

NC_018723.3 203270  441160
NC_018723.3 624960  695520
NC_018723.3 756696  977820
NC_018723.3 1005429 1221086
NC_018723.3 1240095 1705853
NC_018723.3 1747839 1964846
NC_018723.3 1975644 2136144
NC_018723.3 2169657 2651377

and the expected output file would be this:

NC_018723.3 250000  28  83487   0.000335382
NC_018723.3 350000  43  84221   0.000510561
NC_018723.3 450000  56  73332   0.00076365

As these three entries are within the first column from the bed coordinates. Does this help?

ADD REPLY • link 3.4 years ago by gubrins ▴ 350

1

Entering edit mode

Thank you. I added this information to OP, for others to understand the post.

ADD REPLY • link 3.4 years ago by cpad0112 21k

score 2 · Accepted Answer · 2022-04-12

2

Entering edit mode

3.4 years ago

Carlo Yague 9.0k

Sounds like a job for bedtools intersect but first you need to convert your file into a proper bed file like this:

NC_018723.3 1 100000   305 39182   0.00778419
NC_018723.3 100001 200000  644 78927   0.00815944
NC_018723.3 200001 300000  28  83487   0.000335382

then

bedtools intersect -a file1 -b file2 -wa > file1_overlapping_file2.txt

You can add option -v to get the entries in file1 not overlapping regions in file2.

ADD COMMENT • link 3.4 years ago by Carlo Yague 9.0k

0

Entering edit mode

Thanks, it is exactly what I needed. One question related with bedtools that maybe you know, I am getting this error: ***** ERROR: illegal number "1.1e+07". Exiting... Seems that bedtools does not like this type of numbers, there is an easy solution to solve it that it is not to change the number's format? Thanks!

ADD REPLY • link 3.4 years ago by gubrins ▴ 350

1

Entering edit mode

I think you have no choice but to convert the scientific notation to plain integer numbers.

ADD REPLY • link 3.4 years ago by Carlo Yague 9.0k