Extract only rows with main chromosomes (1-22, X, Y) on first column?
5
4
Entering edit mode
6.3 years ago
star ▴ 350

I have a table like below, it is a bed file of genome coordinate, I would like to keep only rows with numbers.

Input:

1           141009669   141009952
9           141016322   141016973
GL000195.1  81719   82468
GL000195.1  142613  142923
GL000220.1  119445  119746
HG115_PATCH 101957832   101958132
HG1308_PATCH 130205069  130205369
HG1308_PATCH 130205406  130205773
HG748_PATCH  77577953   77578264
X            200983 202660
y         205180    205702

output:

1           141009669   141009952
9           141016322   141016973
X            200983 202660
y            205180 205702

Thanks in advance!

linux grep R script • 7.4k views
ADD COMMENT
0
Entering edit mode

x and y are not numbers but I get what you are asking for. You only want to keep main chromosomes?

ADD REPLY
0
Entering edit mode

yes, exactly, I want just main chromosomes.

ADD REPLY
1
Entering edit mode

Tell us why you have tried so far? If you are interested in fixing your own attempt.

ADD REPLY
5
Entering edit mode
6.3 years ago

command line:

egrep "^[0-9XY]" file

EDIT: originally, this used the same regex as for the R-based example (^[0-9XY]$). This won't work because the full line of the text file contains more characters (such as the coordinates...). Thanks to Alex for pointing this out.

R (because this post is tagged with it):

# assuming your data frame with the coordinates looks like this
df <- data.frame(chr = c("1","2","X","Y", "GL000220.1"),
                start = c(1,20,30,40,50),
                 end = c(11, 21, 31, 41, 51)
)

subset(df, grepl("^[0-9XY]$", chr))
ADD COMMENT
0
Entering edit mode

I don't think this works? For example:

$ echo -e '1\n11\n22\n2' | grep -E "^[0-9XY]$"
1
2

Perhaps you might want the following:

$ echo -e '1\n11\n22\nP\n\XY\n2\nZ\nX' | grep -E "^[0-9]{1,2}$|^[XY]$"
1
11
22
2
X
ADD REPLY
0
Entering edit mode

yes, you're right, the dollar sign in the original command I posted was silly (only makes sense for the R-based command)

ADD REPLY
5
Entering edit mode
6.3 years ago
$ sed -n '/^[0-9,X,Y]/Ip' test.txt

or

$ sed -n '/^[^A-WZ]/Ip' test.txt

1   141009669   141009952
9   141016322   141016973
X   200983  202660
y   205180  205702

with tsv-utils:

$  tsv-filter  --iregex  '1:^[^A-WZ]' test.txt

1   141009669   141009952
9   141016322   141016973
X   200983  202660
y   205180  205702
ADD COMMENT
4
Entering edit mode
6.3 years ago
ewre ▴ 250

A straight forward way:

 mainChr = c(as.character(1:22),'x','X','y','Y')
 data = read.delim('your.bed',stringsAsFactor = F,header = F)
 data_with_mainChr = data[data$V1 %in% mainChr,]

if you want to use readr and dplyr which is more efficient when dealing with big files:

mainChr = c(as.character(1:22),'x','X','y','Y')
library(dplyr); library(readr)
data = read_tsv('your_bed_file',col_names = F)
data_with_mainChr = dplyr::filter(data, X1 %in% mainChr)
ADD COMMENT
2
Entering edit mode
6.3 years ago

you can do the opposite

awk '{print $1}' file | sort | uniq

to find the rows you want to exclude and then exclude them with grep

grep -v -e 'pattern1' -e 'pattern2'
ADD COMMENT
0
Entering edit mode
6.3 years ago

You can almost do this in one line with R and GenomicFeatures + rtracklayer

library(GenomicFeatures)
library(rtracklayer)

## read your table as a random text file
keepStandardChromosomes(GRanges(read.table('your_table.bed',col.names=c('chr','start','stop'))),pruning.mode='coarse')

## read your table in as a bed - if it really is a bonafide .bed then you can simplify a bit 
keepStandardChromosomes(import.bed('your_table.bed'),pruning.mode='coarse')

## do the same thing as above but simultaneously save it as a new bed file named "your_table_subset.bed"
export.bed(keepStandardChromosomes(import.bed('your_table.bed'),pruning.mode='coarse'),file='your_table_subset.bed')
ADD COMMENT

Login before adding your answer.

Traffic: 1006 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6