Entering edit mode
2.1 years ago
shinyjj
▴
50
Hi biostars,
I want to generate a histogram of reference transcript in here (https://www.ncbi.nlm.nih.gov/projects/genome/guide/human/index.shtml#:~:text=gff3-,RefSeq%20Transcripts,-Fasta).
Can anyone suggest a tool that can generate a histogram of the length of the isoform in this file? Ideally, the x-axis would the distribution of the isoform length and the y-axis would be the number of isoforms counted.
Use bioawk and then pass the output of bioawk to a simple
hist()
on R.Thank you! I am unfamiliar with bioawk. Do you know what kind of command line I should use to generate the output? What kind of output is it when I use bioawk?
Are you familiar with
awk
? Bioawk is awk customized to work with common bioinformatics formats. For example, (if memory serves me right) the preset "fastx" uses@
and>
as record separators instead of the usual new line. You can use awk's functions/variables to get what you want once you understand the underlying concepts.See the manual: https://github.com/lh3/bioawk
Experiment with it - generate a 2 column output with transcript name and transcript length (although you'd only need the second column for the histogram). In R, run
?hist
to understand how to plot a histogram - it is trivial, it simply needs a vector of numbers.Maybe the solution suggested in How to generate sequence length distribution from Fasta file could work? Once you have the lengths, you could plot it in
R
,python
, or your language of choice.Thanks everyone! Now, I have a file that looks like this that has the transcript name on the left and its length on the right. It contains 177816 transcripts. What would be a good tool to plot this in R?
Just read the file in R (
read.table
...) and plot it usinghist()
, as Ram suggested. Maybe good to try it a bit yourself first, see this. If you get into trouble, just feel free to come back and ask.I got the result as I wanted. I am pretty new to R. Thanks Ram and iraun :)