Does anyone know of a simple scripting solution to take a list of accessions and pull out the gene name from the header of FASTA sequences?
For instance given the accession: XP_016469325.1
and given the FASTA entry:
XP_016469325.1 Nicotiana tabacum|C3H|C3H family protein MEEELLKRNTDCVYFLASPLTCKKGIECEYRHSEIARLNPRDCWYWLAESCLNPTCAFRH PPLESHAETSSESAPPQHKSAVPVNKTNVPCYFYFNGYCIKGERCSFLHGPDDGTTTWKS SKIASGVPDGPTAEKKTSVGSETGPASVEKPSNSSETGSKAAAHEYIKSQVDLISMTNDV GEQSASHETSGSPSEEATAVRLDSLVPAEGFTQGGSDLSPDWSSDEEVEDNVEREEWLES SPGFDVLVDDRIEGWSHKDDHSYLLQHDRECDERFAGYDFENNLEYDPAYPDMRIVSDEE LDDSYYSKVESHEVNEYAREIVIPAHGRQSIPHKRKFPREPGFCARGNVDLRDLLKKRRV IESDPPNYLSRRLDLSRFNAREQCRDRHRPQGSRWMPQSLASKLESNSSFSSGFVDATRL EGANQLKKLRQSHRSSYRQQHFKDRRRGRSQPFANETPRRMASRQRSTEVPKIFGGPKTL AQIREEKIKGREDGNSFERTVPSGGSEREDFSGPKPLSEILKDKRRLSSVVNFSN
I would like to have output the gene name "C3H" that is spanned by the "|"
This script I modified from a previous post can grab the gene names, however I'm not sure how to only get the gene names corresponding to a separate list of accessions (accessions.list).
with open('PlantTFDB_ALL_TF_pep.fas','r') as f:
for line in f:
if '>' in line:
line = line.strip().split('|')
print(line[1])
Thank you for this awk approach, unfortunately I am needing each line in the accession to be searched one at a time against the FASTA database, retaining duplicate instances. The idea here was to take the accessions and get back the gene name for each. For example an acc.txt file that looked like this:
I would expect to see this:
The output of this awk command however is this:
which looks like it is just extracting the gene name for each FASTA entry regardless of the acc.txt: