Question

Why You Need Perl/Python If You Know R/Shell [Ngs Data Analysis]

8

Entering edit mode

12.7 years ago

Sukhi Singh 11k

Hello everyone,

Today I spent 4 hours on a python script for executing a shell command constructed using user inputs within the script, still It was unsatisfactory to what I wanted. I achieved same thing using R < 30 mins. I am in under obsession that one of the scripting language is very important. I use R, shell and terminal utilities for most of my tasks.

So, the argument would be I haven't come across to such a task in exploratory next generation sequencing data analysis, where you specifically need a scripting language like Perl/Python or Ruby.

Thus, if one is not writing new tools from scratch, how important is to learn any of the scripting language. I used Perl in start and then got influenced by python co-workers that python is better and more extensive and now lost in both. But I am confident in R.

Could someone also elaborate, the part and parcel of work, where there is no move without any of the scripting language. One can also argue about the audience, perl user , python and R users at the consumer end, if you make software but what if the results and tasks are for oneself or its just a matter of taste.

Thanks for your input

python next-gen • 28k views

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 12.7 years ago by Sukhi Singh 11k

34

Entering edit mode

12.7 years ago

Zev.Kronenberg 12k

From you profile picture I take it you like beer? Do you ever drink wine? My choice of intoxicant has do with what I am trying to accomplish (getting drunk or enjoying a drink; not mutually exclusive).

Do you see where I am going with this? There are programming tasks that "I" would never use R for. Same goes for Perl. Both languages can technically do almost anything, but some tasks are easier in one language than another. Bottom line, it is nice having options.

Have you pattern matched a lot in R? It's not pretty.

Have you tried plotting in Perl? What a pain.

When I started bioinformatics I only used R, however I grew into perl out of necessity.

Keep plugging away at scripting!!!

ADD COMMENT • link 12.7 years ago by Zev.Kronenberg 12k

2

Entering edit mode

neat answer - :)

ADD REPLY • link 12.7 years ago by Sukhi Singh 11k

1

Entering edit mode

Coffee/tea would have been a better comparison since his avatar is showing coffee bean .. But anyway, good answer :)

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by madkitty ▴ 690

1

Entering edit mode

That is a good point, but for reference, the avatar used to include a beer. I guess that is why this particular answer might seem a little confusing now.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.2 years ago by SES 8.6k

4

Entering edit mode

It is funny reading old posts.

ADD REPLY • link 10.2 years ago by Zev.Kronenberg 12k

16

Entering edit mode

12.7 years ago

lh3 33k

While R is the best for statistics, it is not well designed as a programming language, and it has the slowest for() loop and pattern matching, nearly 10X slower than perl/python (I have evaluated myself). Though you can write R to process data stream in principle, R together with many packages encourages to load all data in memory, which does not fit the NGS world. Perl/Python are easier to write and faster to run. They suit better for processing large-scale data. Note that I am known in biostar as an R hater, so take my words with a grain of salt.

As to shell, I used to write a mixture of shell and awk. It was a mistake you should avoid. For most tasks beyond a few lines of shell/awk commands, perl/python scripts are easier, cleaner and more maintainable for long-term uses. Shell+Perl/Python+R is pretty much enough for most NGS analysts.

ADD COMMENT • link 12.7 years ago by lh3 33k

0

Entering edit mode

You mean to tell us that you aren't porting samtools to R? JK.

ADD REPLY • link 12.7 years ago by Zev.Kronenberg 12k

2

Entering edit mode

See Rsamtools, which focuses mostly on exposing I/O (of bam, tabix, b/vcf) facilities of samtools. Rsamtools plays well with the 'GRanges' concept in R / Bioconductor, allowing (multiple) ranged-based queries on local or remote bam files, for instance, and a recent addition is a stream interface to iterate through bam / etc. files in chunks (not single records, which would be inefficient in R).

ADD REPLY • link 12.7 years ago by Martin Morgan ★ 1.6k

0

Entering edit mode

I use Rsamtools. I pretty much use all bioconductor packages. I don't know if you remember me, but we met in Idaho. I was working with Matt Settles on some projects. Your team and Matt turned me into a huge [R] nut.

ADD REPLY • link 12.7 years ago by Zev.Kronenberg 12k

5

Entering edit mode

12.7 years ago

Aaronquinlan 12k

This is sort of like asking a fine craftsman why he or she needs all the tools in his or her toolbox when all you *need* to build furniture is a saw, hammer, wood, and nails.

By analogy, an important skill for computational scientists of any ilk is the ability to know which tool is best suited to the task at hand. The shell and R each have their strengths - statistical analysis and file manipulation, respectively (among others). However, working with large datasets, exploring binary data, complicated (yet still relatively fast) loops and logic gates and other such tasks are much better suited to better designed programming languages. Python and Perl are great options for this. However, when certain speed and memory efficiency requirements are needed, even these languages fail to offer the performance of C, C++, Fortran, Java, etc.

For example, most of the tools you adore (or occasionally abhor) using in the shell were written in C or C++.

ADD COMMENT • link 12.7 years ago by Aaronquinlan 12k

0

Entering edit mode

I agree with you as well Aaron! For building new memory efficient tools, we need scripting and other procedural languages.

ADD REPLY • link 12.7 years ago by Sukhi Singh 11k

4

Entering edit mode

12.7 years ago

madkitty ▴ 690

I personally think that NGS data should be sent to Databases so we wouldn't be in pain anymore trying to understand Python and R and all these useful-but-complicated softwares. It reminds me the time where my Dad was trying to use Windows 3.11 meanwhile I just installed Windows 7 on my netbook.

We should create a Databases with a GRAPHIC INTERFACE and establish a standard to handle large NGS data files, and never use again NEVER EVER use again whatever requires line of coding in UNIX Python Rs and friends. There are just too many softwares outside where we don't even understand how they are built up, we can't even use them 100% because it's far too complicated for a computer-trained-Biochemist to understand all of these and keeping up with all the newest-better-than-the-others softwares is even more of a pain.

 mysql> Who wants to start a new project with me ?;

ADD COMMENT • link 12.7 years ago by madkitty ▴ 690

0

Entering edit mode

+1 only to cancel the negative vote, even though I agree with the downvoter.

ADD REPLY • link 12.7 years ago by lh3 33k

0

Entering edit mode

How do you agree with the downvoter ?

ADD REPLY • link 12.7 years ago by madkitty ▴ 690

0

Entering edit mode

You therefore propose that all analysis be done using SQL? In my opinion, this is a bad idea. There are many examples of analyses that could just not be done, and even those that could would require horrifically slow and difficult to understand code.

ADD REPLY • link 12.7 years ago by Aaronquinlan 12k

3

Entering edit mode

12.7 years ago

Bioinfosm ▴ 620

Well, never forget good old unix stuff! for instance here is an example of awk to summarize a fastq file pretty nicely in a matter of minutes... obviating any scripting at all!! http://gettinggeneticsdone.blogspot.com/2012/04/awk-command-to-count-total-unique-and.html

cat myfile.fq | \
    awk '((NR-2)%4==0) {read=$1;total++;count[read]++} \
                 END{ for(read in count) { \
                              if(!max||count[read]>max) { \
                                  max=count[read]; \
                                  maxRead=read \
                              }; \ 
                              if(count[read]==1){ \
                                  unique++ \
                              } \
                          }; \
                          print total,unique,unique*100/total, \
                                 maxRead,count[maxRead],count[maxRead]*100/total \
                      }' \

The output would look something like this for some RNA-seq data downloaded from the Galaxy RNA-seq tutorial:

99115 60567 61.1078 ACCTCAGGA 354 0.357161

This is telling you:

The total number of reads (99,115).
The number of unique reads (60,567).
The frequency of unique reads as a proportion of the total (61%).
The most abundant sequence (useful for finding adapters, linkers, etc).
The number of times that sequence is present (354).
The frequency of that sequence as a proportion of the total number of reads (0.35%).

ADD COMMENT • link updated 12.7 years ago by Aaronquinlan 12k • written 12.7 years ago by Bioinfosm ▴ 620

1

Entering edit mode

I know I'm about 7 years late to the party here, but I had a chuckle at this "obviating any scripting at all!!" when the post is literally an AWK script.

ADD REPLY • link 5.8 years ago by groverj3 ▴ 20

0

Entering edit mode

hahahah and I am having a laugh after 7 years :D

ADD REPLY • link 5.8 years ago by Sukhi Singh 11k

2

Entering edit mode

12.7 years ago

SES 8.6k

In my opinion, you must learn Perl/Python/Ruby/(add your favorite here) to stay ahead of the data deluge that never ends if you work in a large lab. Sure, there is a learning curve initially, but you don't usually have to ask questions like, "How can I read in my sequences and do .....?" or "How can I analyze my alignments to .....?" or "How can I do a simulation with my trees to test if .....?" In all of these cases, you would start with a search for the Bio* package containing the methods you need, and then decide if you have to extend that functionality or write something from scratch if it doesn't exist. I know a lot of this can be done in R, but for non-model species and large data sets, you will be much more efficient with a modern scripting language.

The main reasons for learning one of these scripting languages (in a small amount of words) is you have direct programmatic access to local and remote databases and analysis tools, and you have a large user community that has solved most of the common tasks. Also, these languages are the basis of some of the most widely used modern web frameworks (which may not be important to you at this point). People always ask which one to learn, or which one is better. Just think about the tasks you have to perform and try to figure out which language offers the most support for that area. You can always get help some way, so no matter which way you go, it will be worth the investment.

ADD COMMENT • link 12.7 years ago by SES 8.6k

2

Entering edit mode

this reads like you have never actually used R or Bioconductor

ADD REPLY • link 12.7 years ago by Jeremy Leipzig 22k

1

Entering edit mode

Then maybe you should read it again :). I actually said you can do a lot of sequence analysis in R, but the question is whether you should (i.e., is it the best approach). In my experience, with large data sets and for non-model species, the answer is, "No. R is not the best approach." I use R everyday almost and nothing in my post is contradictory to what others have said. What part do you disagree with?

ADD REPLY • link 12.7 years ago by SES 8.6k

2

Entering edit mode

I see R and Bioconductor as the most efficient means of leveraging the most from other people's work, not getting bogged down in implementation but exploring summary data interactively, as well as where I would point people to solutions for the three questions you posed. But I would agree that learning a scripting language (and a build tool) is an absolute necessity. R can only provide the final steps of an analysis, and it sucks for the web.

ADD REPLY • link 12.7 years ago by Jeremy Leipzig 22k

2

Entering edit mode

Okay, I still don't see the point of the snarky comment. You say that learning a scripting language is an absolute necessity (I said it is a must) and that R excels at some tasks, which I agree with. I use R almost exclusively for polishing phylogenetic trees/figures, but you have to do a lot of scripting to get to that point. In relation to the topic of this thread (NGS data analysis), I would not start an R session and try to put together an assembly or annotation pipeline interactively. I use C/Perl/Python/Bash for all the heavy lifting during the genome assembly and annotation steps, but I use R a lot in the final stages for statistical analyses and graphing, which sounds like it is in line with your usage of R and scripting. I think people may have different perspectives about the utility of R/Bioconductor based on the type of work they do (e.g., analyzing microarrays from humans vs. assembling a non-model plant genome), but regardless, the point is simply that learning a scripting language will be helpful for many reasons.

ADD REPLY • link 12.7 years ago by SES 8.6k

2

Entering edit mode

12.7 years ago

Sukhi Singh 11k

Thanks everyone for the input. So major points raised are :

1) For R, the files need to be read in memory, so might be a problem with big file. I acknowledge it though there are some packages like BigMemory and ff

2) Amount of support for R as compared to perl/python (non-biologists can also help)

3) Developing web applications in R is far sighted, one should stick to Perl-CGI, catalyst, Django, ROR or something like that.

Its true and I would assume the support with increase in the future as well. R started as an statistical language but with advancements, we got functions like system to run the shell commands and Rscript to run the script with the user inputted arguments. Though most of my work can be done using these still R is not best when it comes to pulling very large datasets from a database, analyse and store them. This was the reason I started using shell for ChIP-Seq as with files > 10GB its a pain , just getting a coverage out of it. All in all, its a matter of taste as well. I think, I should revise my Perl again to have a sort of programming balance as in future as I might have to work on developing multi-threaded tools with database administration and strictly no plotting involved.

Cheers

ADD COMMENT • link 12.7 years ago by Sukhi Singh 11k

1

Entering edit mode

12.7 years ago

Ying W ★ 4.3k

If you are familiar with awk in shell, I would think of perl as something that evolved from shell. So why use perl instead of shell? some reasons might be because it is cleaner to read and can do more things. I find that I mainly use perl when I am doing more system administration things (managing files) and parsing output. Imagine that you find an interesting result and decide to to run an analysis in parallel across all cell lines with publicly avalible data and then you want to merge the results back together. Merging and filtering the results back together (especially if it is a giant file in the range of several GB) might be done best in perl instead of R since reading the file into R would put all of the data (serveral gb) into memory.

Python I find I use when I am really looking for efficiency. If i have a giant thing to compute and its not matricies and I would need to use this code multiple times, I would write it in python.

ADD COMMENT • link 12.7 years ago by Ying W ★ 4.3k

score 10 · Accepted Answer · 2012-04-19

From a certain point of view you are right: you don't need to learn python or perl programming, and you can do a lot of work just in R. CRAN and Bioconducator contain a lot of libraries, and R allows you to create good plots, which is one of the most important skills for a bioinformatician. I know people who mostly use only R, and they are very good in their work, so you don't have to be afraid if you don't want to learn other programming languages.

However, consider that learning at least the basics of other languages does not take too much time, adds skills to your curriculum, and allows you to learn new approaches to programming. R is a language for data analysis, centered on the data.frame structure, and is suited mostly to analyze data organized in tabular form. That's a good paradigm for analysing data, but as soon as you have to other problems R becomes more clunky and less useful. Web programming is an example, but also data manipulation (convert one format to another, automatizing tasks on the shell, etc.) is important. If you know the philosophy under which other programming languages are developed, and what tasks are they typically used for, your programming skills will be stronger, and you will be able to choose among a larger range of approaches when you have to solve a certain programming topic.