Question

Retaining parts of the header file in multiple fasta files.

1

Entering edit mode

2.8 years ago

nitinra ▴ 50

Hello,

I have ~300 genome fasta files from NCBI that have different headers and whitespaces. I want to make the header simple and only retain the accession number and the scaffold identity. How do I go about doing it recursively for all the 300 odd genomes I have.

Here is an example of 2 of the genome headers I have:

>CM025345.1 Vespula pensylvanica isolate Volc-1 chromosome 10, whole genome shotgun sequence

>KK868930.1 Zootermopsis nevadensis unplaced genomic scaffold scaffold14, whole genome shotgun sequence

>CM000285.3 Tribolium castaneum strain Georgia GA2 linkage group LG10, whole genome shotgun sequence'

I want it changed to:

>CM025345.1_chromosome_10
>KK868930.1_scaffold14
>CM000285.3_LG10

How do I go about doing it?

I am currently using sed individually to change the header, but it is taking forever.

gnu awk fasta • 1.4k views

ADD COMMENT • link updated 2.1 years ago by LauferVA 4.7k • written 2.8 years ago by nitinra ▴ 50

0

Entering edit mode

Can you explain to a computer how to pick these "scaffold identity" terms? Will it always be "chromosome" followed by a space and a string that ends in a comma/the word scaffold followed by another scaffold with a number/alphanumeric string attached to it that ends at the word boundary/the phrase "linkage group" followed by the required idenfier that ends with a comma? If you need to eyeball the header to pick the identifier, the computer cannot be automated to do that for you.

ADD REPLY • link 2.8 years ago by Ram 45k

0

Entering edit mode

This isn't so much a bioinformatics problem as a general regex problem.

ADD REPLY • link 2.8 years ago by swbarnes2 15k

0

Entering edit mode

i dont think this is necessarily true. in fact, precisely because it IS a bioinformatics problem, a clean solution is possible...

ADD REPLY • link 2.8 years ago by LauferVA 4.7k

score 3 · Accepted Answer · 2022-08-23

Find:

^([A-Z][A-Z][0-9].*?) .*(chromosome) ([0-9]+).*

Replace:

$1_$2_$3

Apologies I think I misunderstood your post the first time around. You dont have 300 lines, you have 300 organisms (each with many lines).

Here, there are two approaches I can think of ...

First, I think this issue can be considered a bioinformatics problem, which is in part why it is easily soluble.

Obtain a list of Unique Identifiers exactly corresponding to those found in ALL your genome fasta files using $1 from the regex like the one I had in the original answer above. For instance, CM10030304.1, CM42994.1 on and on and on.
Go to a large database like RefSeq. For every UID in your list above, you want to now pull associated records from nuccore. This may be done, for instance, using eFetch.
Using any method of accessing [nuccore][2] programmatically that you would like (like eFetch), obtain the records linked to the unique identifier.
Finally, process the output from 2. in such a way that you have a hashtable-like object in which the keys are the UID, and the values are the linked records. For instance, in Python3, a Dict() comprehension would work, such that you have something like:

Key:

UID=CM025336.1

Values:

/bp:1..19704315
/organism="Vespulapensylvanica"
/mol_type="genomicDNA"
/isolate="Volc-1"
/db_xref="taxon:30213"
/chromosome="1"
/sex="male"
/tissue_type="Wholeadult"
/country="USA:Volcano,HI"
/lat_lon="19.43N155.21W"
/collection_date="2017-08"

Finally, simply loop over all lines in all 300 genome fasta files. For instance in python3, we could write:

#usr/bin/python
import re
allLines=[l.strip() for l in f for f in fasta.readlines() for fasta in fastas]
for ls in allLines:
    ls=ls.strip()
    nuccoreUID=ls.re("regular expression that generates $1 in my answer above")
    shrinkingLine=ls
    for nuccoreAttribute in allAttributeDict[nuccoreUID]:
        shrinkingLine=shrinkingLine.replace(nuccoreAttribute, "")

Let me know if you have further questions, but this should more or less take care of everything. shrinkingLine should now be a string containing ONLY the parts of your string that were precisely the hardest for which to define matches, as you note in your original post (i.e., LG10, chromosome some such, scaffold what-have-you, etc.). at this point, your computer should be holding everything in memory that you need to do what you propose in your post easily.

To close, I think there are other approaches to this problem, for instance an approach based on information entropy, that could be used. I do think that those and other approaches would not be a bioinformatics problem (more akin to statistical learning in that case).