Question

ChipSeq data analysis

0

Entering edit mode

10.0 years ago

mfahim ▴ 10

I have downloaded chip-seq data from NCBI datasets.. here is the format

chr1    3       F
chr1    62      R
chr1    131     F
chr1    582     F
chr1    586     R
chr1    1604    F
chr1    2252    F
chr1    2374    F
chr1    2728    F
chr1    2965    R
chr1    2965    R
chr1    3649    F
chr1    3649    F
chr1    3746    R
chr1    3918    F
chr1    3918    F
chr1    3918    F

How do I convert it and read it?.

What is this format?

I am using seqmonk and have access to IGV.

Thanks for your time

Chip-seq • 2.6k views

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 10.0 years ago by mfahim ▴ 10

0

Entering edit mode

Hmm.. what a weird format :/

mfahim, could you try something out for me? Could you try running the following from the command line:

cat ./the/name/of/the.file | grep R | wc -l
cat ./the/name/of/the.file | grep F | wc -l

If they come to the same number, that suggests that the Fs and Rs are likely to represent the start and end of ... something.

If they don't, they'll be forward and reverse strand, and each row is 'independent', like SNP data or something.

ADD REPLY • link updated 2.9 years ago by Ram 45k • written 10.0 years ago by John 13k

1

Entering edit mode

Ah right, I didn't think about F/R as being paired. That just helps make your point that guessing is almost always a bad move!

ADD REPLY • link updated 2.3 years ago by Ram 45k • written 10.0 years ago by Ryan Dale 5.0k

Ram · Answer 1 · 2015-05-20

3

Entering edit mode

10.0 years ago

Ryan Dale 5.0k

This is not a standard format (BED, BAM, FASTQ etc). Data repositories typically don't enforce the use of a standard format, and this sort of problem happens all the time.

In this case, the meaning of each field entirely depends on the computational methods used in the study. You'll have to read that carefully to figure out what the data format is. Often that doesn't have enough information, so you may have to contact the authors.

Failing that, you have to guess. Chromosome is obvious, but what is the number? If there are millions of lines in the file, that would suggest that they represent reads somehow. Maybe the number is the 5' end of a read. Maybe F/R is forward/reverse? Assuming all those things are true, maybe you could supply some fake seqs and quality scores and CIGAR strings and make sort of a BAM file that you could use.

edit:

Another trick for getting an author response is to submit an issue with the data repository. For example, if you contact GEO and indicate the issue, they can independently contact the author to try and get more information.

You might get lucky and be able to find the raw data. A lot of the time I end up re-running the entire analysis from scratch. Annoying, but at least then I know how the results were generated.

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 10.0 years ago by Ryan Dale 5.0k

0

Entering edit mode

Ditto what Ryan said. I would just to especially emphasise against guessing what the format means. Read the paper, find the description. If it is still not clear, contact the authors. If they do not respond, contact the authors again until they do.

And I'm speaking from experience here. Every time I made a guess I was proven wrong, even on the most obvious things.

ADD REPLY • link updated 2.3 years ago by Ram 45k • written 10.0 years ago by Saulius Lukauskas ▴ 540

0

Entering edit mode

Totally agree. If guessing is the only option, and you're doing more than just exploratory work, t's better to just pretend the data don't exist.

ADD REPLY • link updated 2.3 years ago by Ram 45k • written 10.0 years ago by Ryan Dale 5.0k

0

Entering edit mode

I totally agree with both of you, and admit my advice of guessing was bad - assumptions lead to mistakes!

Still. When you come across a weird thing on the road, you cant help but poke it :)

ADD REPLY • link updated 2.3 years ago by Ram 45k • written 10.0 years ago by John 13k

0

Entering edit mode

Good Morning from South Korea.. and thank you everyone for your time and inputs.

I am sorry I failed to mention that this is alignment file as per authors..

The validity of data is obvious.. here is the link..

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE33120

As you scroll down, you will find mapped reads with txt.gz extension.. I pulled them out and it boggled my mind.. and I guess yours as well.

Ryan, John and Saulius.. Thank you.

F

ADD REPLY • link updated 2.3 years ago by Ram 45k • written 10.0 years ago by mfahim ▴ 10