Hello Everyone,
I am running bsmap. methratio.py on hadoop using hadoop streaming. But I am having error when I run the command:
Hadoop command :
hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar -libjars /usr/local/hadoop-bam/hadoop-bam-7.0.0-jar-with-dependencies.jar -D org.apache.hadoop.mapreduce.lib.input.FileInputFormat=org.seqdoop.hadoop_bam.BAMInputFormat -file './mad.cmd' -file '../fadata/test.fa' -mapper './mad.cmd' -input ./wgEncodeSydhRnaSeqK562Ifna6hPolyaAln.bam -output ./outfile
Tha mad.cmd
has:
python methratio.py --ref=../fadata/test.fa -r -g --out=bsmap_out_sample1.txt ./wgEncodeSydhRnaSeqK562Ifna6hPolyaAln.bam
The error I am having:
15/01/22 15:52:17 INFO mapreduce.Job: Job job_1418762215449_0033 running in uber mode : false
15/01/22 15:52:17 INFO mapreduce.Job: map 0% reduce 0%
15/01/22 15:52:23 INFO mapreduce.Job: Task Id : attempt_1418762215449_0033_m_000016_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
15/01/22 15:52:23 INFO mapreduce.Job: Task Id : attempt_1418762215449_0033_m_000011_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Can someone tell me what I am doing wrong here
Hi Devon,
Can you please tell me what are my options to achieve this. I am trying to execute the methratio.py on 32 node hadoop cluster. Is it possible to wrap it with hadoop ? Or is there any there alternatives are there? Please advice me on this.
The simple options are:
BTW, I recall that methratio.py is somewhat limited in what it can do. If this is actually important data, then use something that can deal with methylation bias, like PileOMeth (before you ask, no, we never wrote that to interact with hadoop) or Bis-SNP (it also doesn't use hadoop and it's pretty weak in handling methylation bias).
So it is No way to use with Hadoop in this? When you say split the BAM file (2) did you mean load in to hdfs?
Is there any other way to do this with map reduce?
Well, there are ways to do this with map reduce, but you're going to be going through more trouble than it's worth since you're going to have to write the code to make it happen.
Loading something into hdfs just means, "copy it to the files system using hdfs". There's nothing special there. Splitting a BAM file literally means splitting it into multiple files. Of course in the time taken to do the splitting in (2), method (1) would have mostly completed.
Out of curiosity, why do you want to use hadoop for this? Your cluster will be perfectly happy without you doing that and you're unlikely to get the results any faster if you're using a local cluster.
The use of hadoop is for learning purpose how bio and computer science can co relate with this project.
So you are saying just to load the .bam file to hdfs and run the methratio.py separately on each machine with some scheduler like oozie (in Hadoop ecosystem)? Am I correct?
If I do so then also I need to reduce the output it produces, am I correct?
If you split the BAM file first then yes, you'll need to reduce the split output to produce a consolidated output. If you run individual samples/files on different (or even the same) node then there's nothing to really reduce, given that the target is per-file output. Granted, you then have to process those output files, but you probably want to have a look at them first before proceeding.
It depends on how they're split. You can't just arbitrarily chop up a BAM file and have it work.
Presumably Hadoop-Bam provides an API for that. Just so you know, anything you do like this with hadoop is going to involve at least some amount of programming on your part.
You're going to have to figure this out for yourself, I'm not familiar with the inner-working of hadoop-bam.
Currently we are executing with out hadoop in a single node. But we are trying to use it with hadoop cluster to save processing time and also for learning purpose