Hello,
I am trying to do two things, I will try to make this as clear as possible.
1. I have aligned and downloaded about 500 sequences in BLAST. However, in my FASTA file I just want to show the accession number, not the GI number.
So convert this:
>gi|2182117|gb|U95551.1|
into this:
>U95551.1|
Is there a way to do this? I could write a script in Python, but I don't want to re-invent the wheel.
2. From my alignment, I generated a sequence similarity matrix in a software called MacVector. This assigns a similarity score to all the sequences on the basis of how similar they are. I then plotted this in excel, in the form of a histogram.
It looks like this:
Each bar in the histogram is supposed to be a single sequence, and the x axis the accession number or identifier for that sequence. As you can probably tell, the x-axis is missing a lot of labels (it needs to be 500 labels).
I had this problem before, displaying relatively large data sets cleanly in R, usually I just edited the picture, but for 500 sequences, it is just too much. I am sure someone has run into this before. Is there a way to clean this up?
Any advice or pointing me in the right direction would be thoroughly appreciated.
Wow, thank you so much. One last question, where can I begin to learn these kind of text file manipulations? I was going to spend a lot of time writing something in Python, but this seems so usable and useful. Thank you again.
I learned by simply googling "how to do X in unix/bash" and by reading man pages of awk, grep, sed, comm, join, cat, rev, tr, head, tail, cut, sort, etc. Every Bioinformatics 101 should really start with an introduction to GNU Coreutils and Bash scripting..