BLAST command line scripting
1
1
Entering edit mode
19 months ago
Ali ▴ 10

I have Windows OS and need to know if someone can help me write a few Bash scripts on which I can run some blastp queries. The first one is as follows: I need to know for all 6-amino acid long peptides, made by combining all possibilities of all 20 human AAs, is there any human protein that does NOT align well with one of those peptide. That means first all possibilities of 6-AA long peptides (made from the 20 human AAs) is figured out. Then those peptides, in a FASTA format, is submitted to the ncbi's nr database (with homo sapiens being as the organism) and a blastp is run. The e-value of the blastp 'algorithm parameters' should be set as high as 20000 to be able to see all the possible alignments/misalignments. My assumption is that for a 6-AA peptide, it is very unlikely to find a human peptide that aligns LESS than 4 amino acids (the placing of the AAs doesn't matter in my research). I highly appreciate if you would direct me on how to run this script in Windows. This is actually part of a HUGE academic research.

BASH • 2.3k views
ADD COMMENT
0
Entering edit mode

What do you need help with - writing the scripts or writing them for Windows? We could point you to resources for the latter, but the former is something you'll need to read manuals and try to write on your own - we do not provide ready to use code.

ADD REPLY
0
Entering edit mode

Whatever help you can provide is appreciated. I do need to learn how to write the script and I don't have that much time but I'll try. Yes, I use Windows 11 64-bit and need to know how to download the ncbi nr database, how to navigate through the command line to find the right directory for blastp, and then hopefully there's a help command line to get me where I need to go (i.e., segmenting peptides and running queries).

ADD REPLY
0
Entering edit mode

how to download the ncbi nr database

Since you are referring to using windows 11 just a word of caution. NCBI nr pre-made database indexes are hundreds of GB (over 600). You will not be able to search against this on a desktop.

ADD REPLY
2
Entering edit mode
19 months ago
Mensur Dlakic ★ 28k

What you need to do is a non-traditional and a non-trivial task, even for someone who knows how to use local BLAST. The platform you plan to use (Windows) further complicates things. It is kind of like driving down a narrow road on a steep mountain, and then someone adds ice on the road and strong winds to the challenge.

I am going to try and save you some time. This is not a project for someone with your background on a short deadline. That HUGE academic effort you mention should be able to find someone who knows how to do it, or find people locally who will support you. Short of someone on this website writing a step-by-step 10-page manual for you - it is very unrealistic that will happen - I suggest you go to your boss(es) and tell them that domain knowledge doesn't grow on trees, and it isn't something that can be acquired on command.

ADD COMMENT
0
Entering edit mode

Is there any software that can generate all the possibilities of X amino acid long peptides, using the 20 amino acids, in a FASTA format? It will be 20 to the power of 6 possibilities. It just needs a mathematical combination generator. If I have that list, I will directly submit it to the ncbi webpage and that way, I won't have to deal with the wrong OS or lack of hard drive.

ADD REPLY
0
Entering edit mode

Are you sure web based BLAST can accept 20^6 sequences? Like Mensur said, this is not a project for a beginner.

ADD REPLY
0
Entering edit mode

Yes, a few lines of python (or any language) could generate all possible kmers of length 6 with all 20 amino acids. There are also methods for k-merizing sequences from fasta files, so if you had a human proteome fasta, you could potentially convert this problem to a kmer analysis problem. Why are you talking about downloading NCBI nr if all you need is a comparison against human proteins?

ADD REPLY
0
Entering edit mode

Why are you talking about downloading NCBI nr if all you need is a comparison against human proteins?

This quote from OP's description may help to answer your question. It may be something to keep in mind when making suggestions.

need to know how to download the ncbi nr database, how to navigate through the command line to find the right directory for blastp, and then hopefully there's a help command line to get me where I need to go

ADD REPLY
0
Entering edit mode

Thank you. I get that, but this line from the OP

...is there any human protein that does NOT align well with one of those peptide.

suggests they are only interested in comparing their peptides to human proteins, thus I'm querying why they want to download all of nr. Perhaps they want it for some other purpose. Otherwise I'm trying to clarify their thinking: if they are interested specifically in comparison to human peptides, why is nr required? (or maybe I'm missing something and someone can enlighten me why nr would be required for human-only comparisons).

ADD REPLY
0
Entering edit mode

So it seems I do need to write a few lines of code and there's no software that can output the peptide possibilities for me. As for the nr database, do you suggest UniProt as an alternative? It seems to me that the reference databases have very few Igs otherwise you could imagine they would fill up with millions of different sequences, all of them related.They belong in their own database. I need the entire human proteome as a database to run my peptide sequences against. You're right though: It's not likely that I can submit my sequences to the webpage and after all, viewing millions of sequences one by one is impossible so I have to use the command line and screen out the sequences that don't have at least 2 mismatches with human sequences.

ADD REPLY
0
Entering edit mode

Writing the code to generate the peptides is the easy bit: Finding 16 mer not present in GRCh38

ADD REPLY

Login before adding your answer.

Traffic: 2641 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6