How to merge chr+strand+coordinate+score from multiple files?

0

Entering edit mode

7.0 years ago

Rae • 0

I have many files (~50) with the same format of four columns: chr, strand, coordinate, score

Each file has about 1 Million lines. Now I want to merge these files by coordinates (strand specific) while preserve the score value of each sample. So the final file will be: chr, strand, coord, scoreTotal, score1, score2, ...., scoreN

For example:

File001:
chr strand  coord   score
chr1    +   100 1
chr1    -   200 2

File002:            
chr1    +   100 3
chr1    +   101 4
chr1    -   200 5
chr1    -   201 6

Merged_file001_file001:

chr strand  coord   scoreSum    score1  score2
chr1    +   100 4   1   3
chr1    +   101 4   0   4
chr1    -   200 7   2   5
chr1    -   201 6   0   6

I'm thinking of using R package like dplyr to group by coordinates, but for so many files with so many lines, it may not be efficient. I'm also thinking of converting each file to BED format, and then use tool like multiIntersectBed to merge files, but it seems that it can not preserve the score column.

Is there any tool can do this merging quickly?

Thank you very much!

genome RNA-Seq • 3.8k views

ADD COMMENT • link updated 7.0 years ago by finswimmer 16k • written 7.0 years ago by Rae • 0

0

Entering edit mode

You can use "bedtools groupby" command. Show some example files, will behelpfulr to get precise answer.

ADD REPLY • link 7.0 years ago by Chirag Nepal ★ 2.4k

0

Entering edit mode

Thank you, but I checked the "bedtools groupby" command, it seems that it is not suitable for my case.

ADD REPLY • link 7.0 years ago by Rae • 0

0

Entering edit mode

Hello,

are the coordinates in every file the same or do they differ?

fin swimmer

ADD REPLY • link 7.0 years ago by finswimmer 16k

0

Entering edit mode

I have added example tables. Coordinates in every file are not exactly the same.

ADD REPLY • link 7.0 years ago by Rae • 0

0

Entering edit mode

I'm thinking of using R package like dplyr to group by coordinates, but for so many files with so many lines, it may not be efficient.

50 million records isn't that big a data set. How many zeros are you expecting once you spread() you data into a matrix?

ADD REPLY • link 7.0 years ago by d-cameron ★ 2.9k

0

Entering edit mode

There will not be many zeroes. Because coordinates in different files are largely the same with some difference.

ADD REPLY • link 7.0 years ago by Rae • 0

1

Entering edit mode

7.0 years ago

finswimmer 16k

Hello,

my experiences with the pandas modul for python is limited. So I didn't know if it can handle your amount of data. But you should give it a try.

Run the following script like this:

$ python join.py File* > result.csv

	import glob
	import sys
	import pandas


	def read_filenames(names):
	for arg in names:
	if "*" in arg:
	for file in glob.glob(arg):
	yield file
	else:
	yield arg


	data = []

	for sample in read_filenames(sys.argv[1:]):
	frame = pandas.read_csv(
	sample,
	sep="\t",
	header=None,
	names=["chr", "strand", "coord", sample],
	index_col=[0, 1, 2]
	)
	data.append(frame)

	data_joined = data[0].join(data[1:], how='outer')

	sum = data_joined[data_joined.columns].sum(axis=1)
	data_joined.insert(0, "scoreSum", sum)

	print(data_joined.to_csv(
	sep="\t",
	na_rep=0,
	float_format='%.0f'
	))

view raw join.py hosted with ❤ by GitHub

EDIT:

Based on the suggestion by cpad0112 I changed the way how to specify the files. One can use the wildcard format and/or just list all files that should be used.

I also adress the info by the OP that there are no headers in the file.

fin swimmer

ADD COMMENT • link 7.0 years ago by finswimmer 16k

1

Entering edit mode

Code works absolutely fine. Use of ls can be totally avoided as one can list files (with pattern..for eg files starting with File*) with glob (IMO). Some thing like files=glob.glob ("File*.). Rest of the code works as expected. User simply has to run the code: python join.py > result.csv

ADD REPLY • link 7.0 years ago by cpad0112 21k

0

Entering edit mode

You always find something to improve my code ;)

Thanks, I edited it.

fin swimmer

ADD REPLY • link 7.0 years ago by finswimmer 16k

0

Entering edit mode

I am learning from your code. I am not good at python @finswimmer

ADD REPLY • link 7.0 years ago by cpad0112 21k

0

Entering edit mode

Thank you! I'm not familiar with python, I will have a try!

ADD REPLY • link 7.0 years ago by Rae • 0

0

Entering edit mode

Hello Rae,

than it's a good point to get familiar with python ;) The easiest way of installing pandas modul after you have install python is to use pip.

In another comment you wrote that your files doesn't contain a header. If so you have to change my code to work. Change header=0 in header=None.

fin swimmer

ADD REPLY • link 7.0 years ago by finswimmer 16k

0

Entering edit mode

7.0 years ago

Alex Reynolds 36k

Make forward-stranded BED for each of N (~50) files:

$ awk -vOFS="\t" '($2 == "+"){ print $1, $3-1, $3, ".", $4 }' file001.txt | sort-bed - > file001.for.bed
...

Make reverse-stranded BED for each of N files:

$ awk -vOFS="\t" '($2 == "-"){ print $1, $3-1, $3, ".", $4 }' file001.txt | sort-bed - > file001.rev.bed
...

Map forward-stranded elements:

$ bedmap --echo --sum --echo-map-score <(bedops -i file*.for.bed) <(bedops -u file*.for.bed) > answer.for.bed

Map reverse-stranded elements:

$ bedmap --echo --sum --echo-map-score <(bedops -i file*.rev.bed) <(bedops -u file*.rev.bed) > answer.rev.bed

The result of --echo-map-score will be delimited by semi-colons. Add --multidelim "\t" to this option to replace semi-colons with tabs.

Use awk to add strand back to answer.*.bed, if needed.

ADD COMMENT • link 7.0 years ago by Alex Reynolds 36k

0

Entering edit mode

Thank you! But I tested the above two example tables, but no result was returned.

I put data in file01.txt and file02.txt, and then run your commands:

To generate rev and forward strand for each file:

for file in *.txt;
do
  echo $file
  awk -vOFS="\t" '($2 == "+"){ print $1, $3-1, $3, ".", $4 }' "$file" | sort-bed - > "$file".for.bed
done

for file in *.txt;
do
  echo $file
  awk -vOFS="\t" '($2 == "-"){ print $1, $3-1, $3, ".", $4 }' "$file" | sort-bed - > "$file".rev.bed
done

Then run bedopt (here I also added the --exact option to see difference):

bedmap --echo --sum --echo-map-score --multidelim "\t"  <(bedops -i file0*.txt.for.bed) <(bedops -u file0*.txt.for.bed) > file.for.bed
bedmap --echo --sum --echo-map-score --multidelim "\t" --exact <(bedops -i file0*.txt.rev.bed) <(bedops -u file0*.txt.rev.bed) > file.rev.bed

These two commands can return the union or intersect bed regions:

bedops -u file0*.txt.for.bed
bedops -i file0*.txt.for.bed

But the output file.for.bed and file.rev.bed are both empty. I don't know why?

ADD REPLY • link 7.0 years ago by Rae • 0

0

Entering edit mode

Can you remove the column headers from your input files? I'd be curious to see if that fixes the issue for you.

ADD REPLY • link 7.0 years ago by Alex Reynolds 36k

0

Entering edit mode

There is no column header in my input files.

ADD REPLY • link 7.0 years ago by Rae • 0

0

Entering edit mode

Sorry, I'm just looking at what you posted. Feel free to ignore.

ADD REPLY • link 7.0 years ago by Alex Reynolds 36k

0

Entering edit mode

7.0 years ago

cpad0112 21k

with individual scores in a single column:

$ cat test1.txt <(tail -n+2 test2.txt) | datamash  -sH -g 1,2,3  sum 4 collapse 4
GroupBy(chr)    GroupBy(strand) GroupBy(coord)  sum(score)  collapse(score)
chr1    +   100 4   1,3
chr1    +   101 4   4
chr1    -   200 7   2,5
chr1    -   201 6   6

input:

$ cat test1.txt test2.txt 
chr strand  coord   score
chr1    +   100 1
chr1    -   200 2
chr strand  coord   score
chr1    +   100 3
chr1    +   101 4
chr1    -   200 5
chr1    -   201 6

ADD COMMENT • link 7.0 years ago by cpad0112 21k

Login before adding your answer.