Question

Forum:Good Habit for Bioinformatics Analyst or Scientist

74

Entering edit mode

8.8 years ago

Shicheng Guo ★ 9.6k

Hey colleagues,

Summary some good habit in our research. I have been hit by the project badly since some bad habit, such as:

Record everything in a project in one systemic page, such as Wiki or Evernote, so that you can check them easily. Never try to remember everything if you put them everywhere.
Save all the data which you were used to make the figure, since sometimes boxplot will be change to violin plot or heatmap plot or bee swarm plot. You will never know which is the prefer for your boss or reviewer. If you don’t save them, maybe you need to re-built the data again.
Keep the figure as PDF forever, you know, JPEG, TIFF, PNG is not what you need in the publication.
Use Adobe illustrator, Never Never Never use Photoshop.
Learn to use ggplot2, it would be more fast to prepare Figures if you master it compared with R plot.
Build your own function (Perl, R, Python) library/packages. Compile and Use them for next time. Don't write them again and again.
Upload the code to github or gitlab, share with yourself and others.
record all the method, idea, process, procedure and pipelines in mediawiki and shared with your lab-mates
Save the fastq to SRA/GEO or wig to UCSC so that we don't need spend extra money after we complete the project
The code or script by non-professional stuff/student would be horrible, Majority of them will have some bugs, be careful, asking help for code review from colleagues would be good habit.
how to prepare your manuscript and the efficiency: link: the best habit to prepare manuscript
Time Management Strategies and Advice for Bioinformaticians: Link here
Build your own bioinformatics server and assemble all the platform your need and your own pipeline.
Arial for font in the Fiugre, never use red-green combination, never use rainbow color scale, Font size:8pt
Never never make your script running for 12 hours (especially in PBS), split them into many pieces within 2 hours. You boss will be in the trouble if you meet bugs for several times.
try to use Anaconda data science platform and assemble the tools what you prefer as a uniform platform.
fork and help to make your frequent software more powerful in github
check the positive and negative control for each computational analysis, so that find all bugs in the beginning.
maintain your blog/make md5sum label for each your own database

More suggestions?

Analyst Scientist • 11k views

ADD COMMENT • link updated 22 months ago by Ram 44k • written 8.8 years ago by Shicheng Guo ★ 9.6k

3

Entering edit mode

Curious why you dislike Photoshop? I do most of my figure creation in GIMP, so it isn't vector based like AI is. But i've never had any problems with it.

ADD REPLY • link 8.8 years ago by Sinji ★ 3.2k

4

Entering edit mode

Photoshop is really only appropriate for editing images of gels, or things like that. For generating or editing other types of plots, which should be scalable and vector-based, Illustrator (or Inkscape, etc) is the right tool.

ADD REPLY • link 8.8 years ago by Chris Miller 22k

3

Entering edit mode

Yep. I generate almost all my plots in R, including very complex ones. But for final polishing, required figure dimensions, dpi, color profile - I also use Photoshop.

ADD REPLY • link 8.8 years ago by Biomonika (Noolean) 3.2k

3

Entering edit mode

Isn't Illustrator better for editing vector graphics like PDF? Just curious why Photoshop...

ADD REPLY • link 8.8 years ago by fanli.gcb ▴ 730

3

Entering edit mode

Photoshop and Illustrator are both $29.99 a month. Meanwhile, e.g. GIMP and ImageMagick are FOSS.

ADD REPLY • link 8.8 years ago by 5heikki 11k

3

Entering edit mode

and Inkscape as a direct Illustrator alternative!

ADD REPLY • link 8.8 years ago by Daniel ★ 4.0k

2

Entering edit mode

If you're a student, it's 20 bucks for everything :)

I've been on that deal for what seems like most of my adult life :P As great as GIMP and ImageMagick are (and ImageMagick is particularly good with the command + extensions like montage), once you learn where everything is in Photoshop and Illustrator, there's really no competition. I mean, GIMP and IM are really good considering they're totally free - but I think you get what you pay for with Adobe's Creative Cloud. You even get cloud storage and some other perks (like your username/password dumped online every now and again...heh).

But the best thing about going Adobe is that there are online guides/tutorials for just about everything. I had a particularly tricky issue the other day involving intersecting two SVG heatmaps, which i could 'solve' in Illustrator in about 10 minutes thanks to a guide someone made in 2001 :P

ADD REPLY • link 8.8 years ago by John 13k

score 18 · Answer 1 · 2016-05-06

18

Entering edit mode

8.8 years ago

Devon Ryan 105k

Use a literate programming approach, such as with R-markdown or Jupyter/ipython.
Version control everything. That annotation you got from Ensembl? Yeah, you better write down which release, because the next one might produce different results.
Remember backups? Yeah, make sure you have those.
Clean up after yourself. Don't be the guy/gal that occupies an excessive amount of space on the "very-expensive-overly-priced-poorly-performing-storage-array" (TM).

BTW, I would skip your number 2. You need to save primary unprocessed files (e.g., compressed fastq files or BAM/CRAM files after using bamHash or similar) and anything that takes an absurd amount of time to reproduce. You also need to save anotations and anything else that will likely be different if you download it again. However, don't save the results of every step, that's just going to blow up your storage costs and make it impossible to find anything. This happens to be identical to the common practice on the wet-lab side, where freezer/-80 space is always shared and HIGHLY limited.

ADD COMMENT • link 8.8 years ago by Devon Ryan 105k

3

Entering edit mode

I think the OP means to save the data that was used to plot a final published figure

This is a good point - it is easy to forget to save this data or at least document it very clearly.

Usually in the heat of the moment as we are focused on the data analysis we end up with many data inputs all from the same original set but these may be filtered one way or another, and we are going back and forth between them. Two months later when the reviews come back it is not so easy to figure out which data was plotted where.

ADD REPLY • link 8.8 years ago by Istvan Albert 102k

1

Entering edit mode

Sure, you need to document how each version of each figure is made. Ideally you just extend whatever analysis you have by creating a new file (with a new name, or with some version or date associated to it that's then represented in your documentation/code).

ADD REPLY • link 8.8 years ago by Devon Ryan 105k

0

Entering edit mode

This happens to be identical to the common practice on the wet-lab side, where freezer/-80 space is always shared and HIGHLY limited.

Great analogy.

ADD REPLY • link 8.6 years ago by Gjain 5.8k

score 13 · Answer 2 · 2016-05-06

In addition to Devon's answers above:

Sanity check. If you filter a dataset, check that this actually happened! Especially with projects that are linked with scripting, it is easy for unnoticed errors and omissions to occur.
Don't reinvent the wheel. Chances are whatever you want to do has been done before. Biostars, stackoverflow, and seqanswers are all great places to search first.
Take a step back and look at the big picture. What hypotheses are we trying to (dis)prove? What conclusions can we draw from the data, and what potential impact could they have?
In line with #3, it's really important to keep on the scientific and technical literature. Bleeding edge approaches are great, and having a feel for where to apply them is great as well.

score 11 · Answer 3 · 2016-05-06

11

Entering edit mode

8.8 years ago

John 13k

Things not to do:

Make Quality Control plots but never really look at them.
Make Quality Control plots and look at them for too long.

ADD COMMENT • link 8.8 years ago by John 13k

2

Entering edit mode

Or may be keep looking at them until you have the published version of them!

ADD REPLY • link 8.8 years ago by MAPK ★ 2.1k

2

Entering edit mode

John. Don't tell us it is a joke. I love this joke.

ADD REPLY • link 8.8 years ago by Shicheng Guo ★ 9.6k

2

Entering edit mode

I think watching QC plots is overrated and a thing of the past. I just got a dataset with over 150 samples, due to technical replicates it comes in over 600 fastq files with a FastQC generated pdf each (thankfully). Should I open all the 600 pdfs lying around on the server, by copying them to my computer, double clicking and viewing just to find the read quality is maybe ok for most and that all of the samples fail the sequence composition by position filter, like all samples before? If I need only 30 seconds per file, I will do this for 5 hours continued. I could imagine it is worth spending most 10 seconds per file, that I could do by putting all pdfs in one folder and then use the MacOS gallery view to only look at the first page of the pdf and only open the file when I spot some problem. A tabular overview would be much better.

ADD REPLY • link 8.8 years ago by Michael 55k

10

Entering edit mode

This is a perfect application for MultiQC by Phil Ewels.

ADD REPLY • link 8.8 years ago by GenoMax 149k

4

Entering edit mode

AfterQC is another great QC tool for fastq.

ADD REPLY • link 8.8 years ago by biomaster ▴ 180

3

Entering edit mode

I think watching QC plots is overrated and a thing of the past.

Hi- I think this is a bit harsh. Rather I would say QC tools should make the output in a form easy to tabulate so that looking at hundreds of QCs is not difficult. In your case the problem was that PDFs are pretty much the opposite of "easy to tabulate" but if you had the raw output of fastqc you could fairly easy parse the text file containing the QC metrics.

ADD REPLY • link 8.8 years ago by dariober 15k

score 10 · Answer 4 · 2016-05-07

Plan and manage the projects as modules: for example data clean up/QC, database management, analytics, predictive modeling, machine learning, statistical inference, data visualization and biological/clinical inference. This would help in the long-run for plug-and-play and easily build, test and deploy analytic pipelines.
Assess the task: think before one spend countless hours on coding that slick function. Someone may have already made an open bio-* package for the bioinformatics task.
Backup: Document, version control and backup everything (including the Linux/Unix command line using history). Bioinformatics is an applied practice and often contribute to scientific inference and clinical impact - here reproducibility is incredibly important. Tracking data provenance could help with reproducibility.
Collaborate: co-create, and code-review
Design thinking: Spend time to solve the task creatively, you have the choice to convert the bioinformatics task to a simple script or a package that many others could use.
Engineer, don't just code. Understanding the technical details and know how to scale the systems from one data set to a 100 or 1000 data sets is key
Future-proof the infrastructure - codes can crack, pipelines could break, good to have a mechanism to maintain and support the bioinformatics infrastructure
Give back to the community - share code, analytics or blog. This would give more visibility and help to take the tool/paper to a large user base.
Happy bioinformatics: Enjoy.

score 8 · Answer 5 · 2016-05-06

8

Entering edit mode

8.8 years ago

Sean Davis 27k

Do as much of your work in the public eye as possible. Github and the like have changed the way that I think and work.

ADD COMMENT • link 8.8 years ago by Sean Davis 27k

score 8 · Answer 6 · 2016-05-07

8

Entering edit mode

8.8 years ago

biomaster ▴ 180

mine:

Talk less, code more!

ADD COMMENT • link 8.8 years ago by biomaster ▴ 180

5

Entering edit mode

That might work as a punchline, but a good bioinformatician discusses with collaborators, checks for existing tools and "codes" only as a last option.

ADD REPLY • link 8.7 years ago by Ram 44k

1

Entering edit mode

OK but don't think less!

ADD REPLY • link 8.7 years ago by Manu Prestat 4.1k

score 7 · Answer 7 · 2016-05-06

7

Entering edit mode

8.8 years ago

TriS ★ 4.7k

when you are coding, add comments so that when you go back to it 3 months from now you remember/understand what/why you did it
I use Google Slides to summarize my analysis, add thoughts and plots so that it's well organized and I can access them quickly, comment, go back to bed :)
when using R, save the workspace so that you don't have to re-run the whole code when you need to go back
when possible use more than one approach to analyze the data, if the result is consistent, great, if not, workout why
can I mention again backups on server(s)?

ADD COMMENT • link 8.8 years ago by TriS ★ 4.7k

2

Entering edit mode

I think Your No 3 is Great. I will use it later.

ADD REPLY • link 8.8 years ago by Shicheng Guo ★ 9.6k

2

Entering edit mode

I always have an .RData and an .Rhistory file saved, with the workspace being the project directory. The directory itself is part of a project hierarchy, so that structures everything.

Initialize R projects with setwd(), close R sessions with save.image() and savehistory(), reopen them with load() and loadhistory() - that's my routine for every project.

ADD REPLY • link 8.8 years ago by Ram 44k

0

Entering edit mode

Oh! On point #3 - Didn't know this could be That handy. I always select "No" as on any exit prompts. :-/ Thank you very much. Will try using it.

And, #2 definitely helpful. I create flow diagram in power point to explain pipeline/etc to my supervisor/PI.

ADD REPLY • link 8.8 years ago by Bioinformatics_NewComer ▴ 330

11

Entering edit mode

I am going to chime in with a disagreement on point 3. In my mind that feature is like the dark side in Star Wars or, as Yoda would say

Easily they flow, quick to join you when code you write. If once you start down the dark path, forever will it dominate your destiny. Consume you it will.

Basically reproducible analysis describes our ability to quickly reproduce a result - BUT that needs to happen from a raw data not some intermediate state that we don't quite remember how we got.

Don't get me wrong .RData and .RHistory are very useful, but as with many good things one needs to use these in moderation and understand their pitfalls.

ADD REPLY • link 8.8 years ago by Istvan Albert 102k

score 7 · Answer 8 · 2016-05-06

7

Entering edit mode

8.8 years ago

Ryan Dale 5.0k

If using data from other sources, keep track of where it came from. This can be as easy as a shell script with a bunch of wget or curl lines, but such a small thing can make a big difference in a few months when you forget where you got those files.

ADD COMMENT • link 8.8 years ago by Ryan Dale 5.0k

3

Entering edit mode

I also store md5sum or any other details like "date downloaded", "# sequences included" about the file - public datasets like uniprot_sprot.fasta keep on changing and it it's easier to compare with collaborators if you have md5sums

ADD REPLY • link 8.8 years ago by Philipp Bayer 8.8k

Sean Davis · Answer 9 · 2016-05-07

Here are some of my tips:
1, sharing: make your code to be libraries, share them in github
2, visualization: always visualize your data
3, noise: keep in mind, data is always with noise, do filtering and cleaning before using them
4, git: use git to trace all your codes, manuscripts and slides
5, toolchain: maintain the tools you usually use to be a toolchain
6, testing: always test your pipelines/algorithms/tools with benchmark data

score 4 · Answer 10 · 2016-10-04

4

Entering edit mode

8.4 years ago

Israel Barrantes ▴ 790

Keep a command line history log for every program you installed/compiled, including version numbers of its dependencies. This will be really helpful not only for reproducibility, but also in case of moving up to new servers.

ADD COMMENT • link 8.4 years ago by Israel Barrantes ▴ 790

score 3 · Answer 11 · 2016-05-08

3

Entering edit mode

8.8 years ago

Asaf 10k

Make your projects reproducible.

ADD COMMENT • link 8.8 years ago by Asaf 10k

score 3 · Answer 12 · 2016-05-17

Slightly unusual but tar ball and archive all your work directories after you are done with a large project to save money on storage. Tape archives, especially the ones maintained by large genome centers are quite cheap and can be retrieved in a day or two. On the other hand you get billed heavily for data storage on file systems.