fetch particular columns from csv files
5
0
Entering edit mode
8.2 years ago

i have file of 10000 rows, consist some values like below.. i need to fetch the columns only consist of rs, value(0-9), but the problem is rs is not always in first column, column will be differ. so how solve this problem. please help me out

rs1 0.5(0.3-1.5) 0.08
rs2 1.2(1.5-5.0) 0.05
snp1 rs6 12 54 1.8(0.8-8.1) 0.02
ght gt rs7 1.2(1.0-2.8) 0.04

Thank you, Anitha

perl python • 2.8k views
ADD COMMENT
1
Entering edit mode

what have you tried ?

ADD REPLY
0
Entering edit mode

Thank you for your patience.

I tried with this regex format to match column that has value(value), "\d.\d+[([]\d.\d+\–\d.\d+[)]]". From this query its fetching whole rows, but i need a columns that only match with my query.

ADD REPLY
0
Entering edit mode

You should check out grep's options. -o is what you need here. Also, please double check your regex.

ADD REPLY
1
Entering edit mode

To me it's not clear what you want to obtain, could you (based on your small example) post the desired result?

& (as Pierre asks) it is good practice to show what you tried and the effort you put into it when asking questions on online fora

ADD REPLY
0
Entering edit mode

You say

rs is not always in first column

and number of 'columns' in a row may not be the same. Then it is not a column data. The data format is not optimal. Try to change that if possible.

If the data can only be like that, checkout regular expression and use it.

ADD REPLY
3
Entering edit mode
8.2 years ago

Suppose your data is stored in a file called temp.txt Use grep to fetch the column with rs entries

grep -oP 'rs\d+' temp.txt

where

-P : using perl regex where \d corresponds to digits.
-o: returns only the matched string.

Hope this solves your problem.

Thanks, Persistent LABS, SanGeniX

ADD COMMENT
0
Entering edit mode

I'm removing the link to your company. Please do not use posts to advertise your service. If you would like to do that, I'd recommend creating a new thread instead, and the mods and admins can decide if it's of value to the site. Thank you.

ADD REPLY
1
Entering edit mode

Thank you Ram :)

ADD REPLY
0
Entering edit mode

This answer does not in fact ensure that the values matched are the complete contents of a "column"; however, I expect that it _probably_ achieves the intent of the original poster, so, +1 from me.

ADD REPLY
1
Entering edit mode

Word/delimiter boundaries. That's what this needs.

ADD REPLY
1
Entering edit mode

Agreed. It is not a perfect solution, but it will solve what he expects :) Thanks

ADD REPLY
2
Entering edit mode
8.2 years ago
Ram 44k

+1 to Pierre and Wouter. I'd recommend grep coupled with word boundaries or separator-based boundaries in your case. You have delimited data that is not necessarily structured, and the one thing you have going for you is the consistent format of dbSNP rs numbers. I'd recommend cleaning the dataset if you wish to get anything more out of it.

ADD COMMENT
1
Entering edit mode
8.2 years ago
Eric T. ★ 2.8k

If the values of interest are not all in the same column, you don't need a tool for extracting individual columns (e.g. the Unix command cut). Instead, you need to search every line for matching strings. You can use the Unix commands grep and sed for this, and/or awk if you're familiar with it.

ADD COMMENT
0
Entering edit mode
8.2 years ago
natasha.sernova ★ 4.0k

Try this link:

csv-columns

or this one if you use Python:

the_same_with_Python

ADD COMMENT
0
Entering edit mode

awk might work in this case, but I never recommend it for parsing CSV to people. It is insensitive to quoted separators, which means you'll end up with an inconsistent number of fields across rows. The csv modules in python/R are the best options for CSV parsing in general.

ADD REPLY
0
Entering edit mode
8.0 years ago

Thank you for your answers

ADD COMMENT

Login before adding your answer.

Traffic: 1249 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6