If the company has not reverse-complemented the sequences:

Question

search text in one file and then replace with text from another file

0

Entering edit mode

7.1 years ago

mforthman ▴ 50

I have two DNA sequence files, one generated by a company based off a data file I sent them. The company's file has different sequence headers for some sequences than the data file I sent, and it's important that all of it conforms to the format I sent for certain programs to use it. Is there a way that I can search a part or all of the sequence header in the company files and then replace the entire line with the corresponding header from my original data file? Additional notes: 1) the company reverse complemented some of the sequences, and this was necessary. Thus, I do not want to alter the sequences from the company file, just get the headers looking like those from the original one. 2) Those sequences that were reverse complemented have an _rc appended to the end of the headers.

For example: Company's header

>uce-265_p7-|design:hemiptera-v1,designer:faircloth_rc 
TGAGTATTCAATATTCCCTGCGCAATATTCAATGGACATACATGGCTATGTTCTTGTTTATTCTATTACATCACTTAAGTCATTCGAAGTTGTGCAGGTCATTTATGAAAAGTTGCTCGA

Original file

>uce-265_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-265,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold391,probes-global-start:41871,probes-global-end:41991,probes-local-start:0,probes-local-end:120
TCGAGCAACTTTTCATAAATGACCTGCACAACTTCGAATGACTTAAGTGATGTAATAGAATAAACAAGAACATAGCCATGTATGTCCATTGAATATTGCGCAGGGAATATTGAATACTCA

The company's headers should look like the original. My initial thought is an if/then loop with grep, but I'm having trouble imaging how this would work in this case.

search replace multiple files • 4.0k views

ADD COMMENT • link 7.1 years ago by mforthman ▴ 50

0

Entering edit mode

Correct, sequences were reverse complemented by the company as well in many cases.

ADD REPLY • link 7.1 years ago by mforthman ▴ 50

0

Entering edit mode

What was the reason for reverse complementing the sequences?

ADD REPLY • link 7.1 years ago by GenoMax 147k

0

Entering edit mode

prevent cross-hybridization potential

ADD REPLY • link 7.1 years ago by mforthman ▴ 50

0

Entering edit mode

Not sure what you are referring to?

ADD REPLY • link 7.1 years ago by GenoMax 147k

0

Entering edit mode

Posted a new solution to my initial answer, after you gave extra information - thanks!

ADD REPLY • link 7.1 years ago by Kevin Blighe 88k

0

Entering edit mode

7.1 years ago

5heikki 11k

Turn your files into

HEADER<TAB>SEQUENCE

format and use join.

edit. this was assuming that the sequences were the same. I guess that's not the case.

ADD COMMENT • link 7.1 years ago by 5heikki 11k

0

Entering edit mode

Correct, sequences were reverse complemented by the company as well in many cases.

ADD REPLY • link 7.1 years ago by mforthman ▴ 50

0

Entering edit mode

7.1 years ago

Alex Reynolds 36k

Read the sequence header of the original file into a hash table (also called an associative array or dictionary), and process the company's result based on that table.

Untested, but I think this should work or at least get you close. This assumes your inputs look like your example, and that your FASTA files are single-line (i.e. header on one line, sequence on the next line, not split across lines).

#!/usr/bin/env python

import sys

original_fn = sys.argv[1]
company_fn = sys.argv[2]

map = {}

with open(original_fn, "r") as original_fh:
    for line in original_fh:
        if line.startswith('>'):
            (k, v) = line.strip().split('|')
            # remove trailing space from key
            k = k[:-1]
            map[k] = v

with open(company_fn, "r") as company_fh:
    skip = False
    for line in company_fh:
        if line.startswith('>'):
            (k, v) = line.strip().split('|')
            # remove trailing character from key
            k = k[:-1]
            if k not in map:
                skip = True
            else:
                sys.stdout.write("%s |%s\n" % (k, map[k]))
        else:
            if not skip:
                sys.stdout.write("%s" % (line))
            skip = False

Then you'd run it something like so:

$ ./reprocess.py original.fa company.fa > reprocessed_company.fa

ADD COMMENT • link 7.1 years ago by Alex Reynolds 36k

0

Entering edit mode

I just attempted and got the following traceback error message:

Traceback (most recent call last): File "./reprocess.py", line 22, in <module> (k, v) = line.strip().split('|') ValueError: need more than 1 value to unpack

ADD REPLY • link 7.1 years ago by mforthman ▴ 50

1

Entering edit mode

Maybe check that your input looks like your examples. Print debug messages around here and see what is failing, e.g.:

try:
    (k, v) = line.strip().split('|')
except ValueError as err:
    sys.stderr.write("Error: %s" % (line))
    sys.exit(-1)

You need the input to have a consistent format, in order to parse it in a way where you can reliably pull out record headers.

ADD REPLY • link 7.1 years ago by Alex Reynolds 36k

0

Entering edit mode

Regarding the header format, some have _rc at the end while others do not.

ADD REPLY • link 7.1 years ago by mforthman ▴ 50

0

Entering edit mode

I think that suffix would not throw the exception you reported in your comment above. Can you run the code with the addition of the try..except code block I posted above? That should highlight the line that fails to split on the pipe character.

ADD REPLY • link 7.1 years ago by Alex Reynolds 36k

0

Entering edit mode

It fails becomes some sequence headers do not include the pipe character. There are other types of sequences from different sources, so different header formats exist. I'm just trying to get the bulk of them taken care of quickly, which are the ones that follow the format above. I should have prefaced with this in my initial post. For example:

>Clavigralla_tomentosicollis_gi_512427358_gb_GAJX01007276.1_0_rc

>Anasa_tristis_comp3229_c0_seq1_0_rc

>Anasa_tristis_comp7367_c0_seq1_78

ADD REPLY • link 7.1 years ago by mforthman ▴ 50

0

Entering edit mode

Are these the "company" headers or your "original" headers? It would help to see examples of both. You can use the try..except block to deal with cases like these, but we'd need to see both types of headers to figure out how to parse both varieties.

ADD REPLY • link 7.1 years ago by Alex Reynolds 36k

0

Entering edit mode

Example original headers from one dataset source:

>uce-2015_p10 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-2015,probes-probe:10,probes-source:oncfas1,probes-global-chromo:Scaffold80,probes-global-start:539805,probes-global-end:539925,probes-local-start:40,probes-local-end:160
GATTGCAGCCTACATCATTCCAGAGCCAGTTACTGCCTCCATCCAGGACCACACAGTTTTGTTCACCATTGTAATTATTGGGCTGGTCCTTCCCCCATGCCGGGTGGCTGATTACCTCTC
>uce-3225_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410155,probes-global-end:410275,probes-local-start:0,probes-local-end:120
AGCGTTGGCTCGCCTTTCGGCAATACCAAGTTGTTCTCGGGCATCATCCCTAGCTCTTTGTTCTTCCTCGAGTGCGGTCTGGACATCCTTAAGTTGTTGTTGGTATTTCTTGATGGATTT
>uce-5416_p11 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-5416,probes-probe:11,probes-source:oncfas1,probes-global-chromo:Scaffold1485,probes-global-start:20950,probes-global-end:21070,probes-local-start:0,probes-local-end:120
ATAGGCCCCATTCCTGGGCCTCCTCCACCATGAGGTATACAAAATGTTTTGTGAAGATTAAGGTGCGATACATCTGACCCATAATGACCTGGGCAGCAAAGCCCTACTTGAGCATTCATA

Example company file with sequences I'm currently targeting (those with 'uce' prefix and that correspond with original file) combined with other sequences from other datasets that I'm not targeting for reformatting at this time:

>uce-2015_p10-|design:hemiptera-v1,designer:faircloth
GATTGCAGCCTACATCATTCCAGAGCCAGTTACTGCCTCCATCCAGGACCACACAGTTTTGTTCACCATTGTAATTATTGGGCTGGTCCTTCCCCCATGCCGGGTGGCTGATTACCTCTC
>uce-3225_p7-|design:hemiptera-v1,designer:faircloth_rc
AAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGCCCGAGAACAACTTGGTATTGCCGAAAGGCGAGCCAACGCT
>uce-5416_p11-|design:hemiptera-v1,designer:faircloth
ATAGGCCCCATTCCTGGGCCTCCTCCACCATGAGGTATACAAAATGTTTTGTGAAGATTAAGGTGCGATACATCTGACCCATAATGACCTGGGCAGCAAAGCCCTACTTGAGCATTCATA
>Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_103_rc
CATTGCAGCAACTAACAGAGTTGATATATTAGATCCAGCCCTTCTCCGATCAGGCAGGCTAGACAGAAAAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAAT
>Anasa_tristis_comp3229_c0_seq1_0_rc
AGGGCTTGTGATTCCCTTGAGCACATCGCAAGCCTCTGTTCTAGACAAAACATTCCACATTTGGTCAATAATGCTTTTGGTTTGCAAAGTGCACGTCTCATGCATTTAATTCAAGAGGCT
>Alydus_pilosus_comp24037_c0_seq1_37
TCCAGCACAAATGTACACAACATCGTCCTTTCACCTGCTTTCATTGGCATTTTATGAATCAGAGACGTCGGAGGCCTGTAAGAAAGAGAGATGGATCTTTTAACTACAGCGCTGACAATT

The end goal of this is to get the company file to look like the following, without altering the sequences in the company file (i.e., just the headers should be modified). Note that headers without a 'uce' prefix should remain unmodified. Also note that some sequences are not the same between original and company files (i.e., reverse complemented):

>uce-2015_p10 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-2015,probes-probe:10,probes-source:oncfas1,probes-global-chromo:Scaffold80,probes-global-start:539805,probes-global-end:539925,probes-local-start:40,probes-local-end:160
GATTGCAGCCTACATCATTCCAGAGCCAGTTACTGCCTCCATCCAGGACCACACAGTTTTGTTCACCATTGTAATTATTGGGCTGGTCCTTCCCCCATGCCGGGTGGCTGATTACCTCTC
>uce-3225_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410155,probes-global-end:410275,probes-local-start:0,probes-local-end:120
AAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGCCCGAGAACAACTTGGTATTGCCGAAAGGCGAGCCAACGCT
>uce-5416_p11 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-5416,probes-probe:11,probes-source:oncfas1,probes-global-chromo:Scaffold1485,probes-global-start:20950,probes-global-end:21070,probes-local-start:0,probes-local-end:120
ATAGGCCCCATTCCTGGGCCTCCTCCACCATGAGGTATACAAAATGTTTTGTGAAGATTAAGGTGCGATACATCTGACCCATAATGACCTGGGCAGCAAAGCCCTACTTGAGCATTCATA
>Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_103_rc
CATTGCAGCAACTAACAGAGTTGATATATTAGATCCAGCCCTTCTCCGATCAGGCAGGCTAGACAGAAAAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAAT
>Anasa_tristis_comp3229_c0_seq1_0_rc
AGGGCTTGTGATTCCCTTGAGCACATCGCAAGCCTCTGTTCTAGACAAAACATTCCACATTTGGTCAATAATGCTTTTGGTTTGCAAAGTGCACGTCTCATGCATTTAATTCAAGAGGCT
>Alydus_pilosus_comp24037_c0_seq1_37
TCCAGCACAAATGTACACAACATCGTCCTTTCACCTGCTTTCATTGGCATTTTATGAATCAGAGACGTCGGAGGCCTGTAAGAAAGAGAGATGGATCTTTTAACTACAGCGCTGACAATT

ADD REPLY • link 7.1 years ago by mforthman ▴ 50

0

Entering edit mode

7.1 years ago

Kevin Blighe 88k

This bash script with grep will do it:

If the company has not reverse-complemented the sequences:

#!/bin/bash

paste company.fasta | while read line
do
{
        if [[ "${line}" != ">"* ]] ;
        then
                grep -B1 -w -e "${line}" original.fasta ;
        fi
}
done

If the company has indeed reverse-complemented the sequences

#!/bin/bash

cat company.fasta | while read L; do  echo $L; read L; echo "$L" | rev | tr "ATGC" "TACG" ; done > company_RevComp.fasta

paste company_RevComp.fasta | while read line
do
{
        if [[ "${line}" != ">"* ]] ;
        then
                grep -B1 -w -e "${line}" original.fasta ;
        fi
}
done

NB - the reverse-complement command is Pierre's: fasta - reverse complement sequence

ADD COMMENT • link 7.1 years ago by Kevin Blighe 88k

0

Entering edit mode

The idea is to not alter the sequences in the company's file itself, but just the headers of these sequences, some of which have _rc at the end (to indicate it is reverse complement form what I sent), while others do not.

ADD REPLY • link 7.1 years ago by mforthman ▴ 50

0

Entering edit mode

Absolutely no problem mate. I presume that you're working with immunological clonal sequences.

Try this [original.fasta is your original data; company.fasta is the company's data]:

#Version 1
#!/bin/bash

cat original.fasta | while read L; do  echo $L; read L; echo "$L" | rev | tr "ATGC" "TACG" ; done > original_RevComp.fasta ;

paste company.fasta | while read line
do
{
        if [[ "${line}" != ">"* ]] ;
        then
                grep -B1 -w -e "${line}" original.fasta original_RevComp.fasta | sed 's/^[a-zA-Z0-9_]*.fasta[-:]//g' ;
        fi
}
done



#######################################
#######################################



#Version 2
#!/bin/bash

cat original.fasta | while read L; do  echo $L; read L; echo "$L" | rev | tr "ATGC" "TACG" ; done > original_RevComp.fasta ;

paste company.fasta | while read line
do
{
        if [[ "${line}" != ">"* ]] ;
        then
                grep -B1 -w -e "${line}" original.fasta ;
                grep -B1 -w -e "${line}" original_RevComp.fasta ;
        fi
}
done

ADD REPLY • link 7.1 years ago by Kevin Blighe 88k

0

Entering edit mode

I've tried executing Version 1 it has done something interesting. It produced the original_RevComp.fasta, but it has outputted the following:

>uce-5419_p14 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-5419,probes-probe:14,probes-source:pacven1,probes-global-chromo:Scaffold2635,probes-global-start:22761,probes-global-end:22881,probes-local-start:40,probes-local-end:160
> TCAGATGTTGGGAGACCCATTTCTTTTTGTCGTTGGTCATACATCATTTTCTCGACAAGA
> CGAACAGAAAAAACAAAGTGTTCTGCAAAAGTTTATGGAGAGCCATCCCGAAATGGACTT
> 021:dne-lacol-seborp,0:trats-lacol-seborp,092491:dne-labolg-seborp,071491:trats-labolg-seborp,987265LC:omorhc-labolg-seborp,1corpohr:ecruos-seborp,51:eborp-seborp,9145-ecu:sucol-seborp,htolcriaf:rengised,1v-aretpimeh:ngised|
> 51p_9145-ecu>
> AAATCAATCCGGAACCTTCAAAACTCTCAGATCTTGACGGAGAAACCCGTGGGATGGTTG
> GTTTTTCATCCGATGTTGGTAGTCCTAGTTCTTTTTGTCTCTGATCATACATCATCTTTT
> >uce-5419_p16 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-5419,probes-probe:16,probes-source:rhoproc1,probes-global-chromo:GL562789,probes-global-start:194210,probes-global-end:194330,probes-local-start:40,probes-local-end:160
> AGTCCTAGTTCTTTTTGTCTCTGATCATACATCATCTTTTCAACCATCCCACGGGTTTCT
> ACCAACATCGGATGAAAAACAAAAAAGAGATATGCTGGCAAAGTATGTGCTTTAAGTTTT

In every other line, it outputs the header backwards. For the ones in which the header is backwards, the sequence isn't reverse complemented.

The script then seems stuck after completing original_RevComp.fasta with no output or changes observed to the company file.

*Note: just spotted the potential issue, original file had the 120 bps split to 60 bp in two lines instead of having it all on one.

ADD REPLY • link 7.1 years ago by mforthman ▴ 50

0

Entering edit mode

Yes, I was worried that some of your sequences were going to be split onto multiple lines.

Execute this on original.fasta before running my script, and save it as a new input FASTA:

awk '/^>/{print "\n"$0}; !/^>/{printf $0}' original.fasta | sed '/^$/d'

That will put all sequences onto single lines.

ADD REPLY • link 7.1 years ago by Kevin Blighe 88k

0

Entering edit mode

I did get the sequences onto single lines, separate from the headers and this fixed the problem. Now, the commands are printing the replaced headers to screen. I tried inserting -i in the sed command, but this gives an error -i may not be used with stdin. I was under the impression that -i modifies the file rather than producing new output or printing to screen, but I guess I'm wrong? What's going to screen is definitely correct, just need it to modify the file.

ADD REPLY • link 7.1 years ago by mforthman ▴ 50

0

Entering edit mode

I cannot see your screen, so, can't see exactly what's happening. The better version of my code to run is probably Version 2. The code above should change the header's in the company's file with the original header(?).

With sed, you cannot use -i in redirect command. -i is used when you wan to edit a file in place without redirection. e.g. sed -i 's/a/b/g' file.text will convert all 'a' to 'b' and modify the file in place

ADD REPLY • link 7.1 years ago by Kevin Blighe 88k

0

Entering edit mode

@Kevin is likely using GNU sed. mforthman : Are you using macOS? Default sed in macOS is different. You could install the GNU version easily though.

ADD REPLY • link 7.1 years ago by GenoMax 147k

0

Entering edit mode

Yes, I use Ubuntu 14.04.

To OP, I would save a new input file for my scripts such as:

awk '/^>/{print "\n"$0}; !/^>/{printf $0}' original.fasta | sed '/^$/d' > original.new.fasta

...and then modify the filenames in my script by changing original.fasta to original.new.fasta

ADD REPLY • link 7.1 years ago by Kevin Blighe 88k

0

Entering edit mode

Yes, I'm using macOS. I have and am using gnu-sed version. I have been able to get the sequences all on one line and generate original_RevComp.fasta without issues. It's the next step that is not modifying the company.fasta file (or any file). Both version 1 and 2, the headers are printed to Terminal's screen rather than changing them in the company.fasta file. I can't copy and paste from Terminal into the file because the company.fasta file also includes other (but comparably fewer) sequences that I need to reformat the original headers before replacing the company headers with it (this is from combining different datasets from different sources). These other sequences don't start with 'uce-" or follow any similar format as the ones I'm currently targeting. An example of what gets printed to Terminal's screen:

>uce-5416_p12 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-5416,probes-probe:12,probes-source:oncfas1,probes-global-chromo:Scaffold1485,probes-global-start:20990,probes-global-end:21110,probes-local-start:40,probes-local-end:160
AAAATGTTTTGTGAAGATTAAGGTGCGATACATCTGACCCATAATGACCTGGGCAGCAAAGCCCTACTTGAGCATTCATATTTGCTCCATCCAAATAAACCTTGATAAAAATATCAAATA
>uce-5419_p11 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-5419,probes-probe:11,probes-source:oncfas1,probes-global-chromo:Scaffold51,probes-global-start:350828,probes-global-end:350948,probes-local-start:0,probes-local-end:120
TAAGATCAATCCAGAGCCATCCAAGCTGTCAGACCTCGATGGAGAAACTCGTGGGTTAGTGGAAAAGATGATGTATGATCAAAGACAACGAGAACTGGGCCTTCCAACCTCTGATGAACA
>uce-5419_p12 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-5419,probes-probe:12,probes-source:oncfas1,probes-global-chromo:Scaffold51,probes-global-start:350868,probes-global-end:350988,probes-local-start:40,probes-local-end:160
GGAGAAACTCGTGGGTTAGTGGAAAAGATGATGTATGATCAAAGACAACGAGAACTGGGCCTTCCAACCTCTGATGAACAAAAGAAAAAGGACATGCTACAGAAGTAAGTTTACAGAAAA
Owners-MacBook-Air:MyBaits_prelim_probe_kit_order Forthman$

ADD REPLY • link 7.1 years ago by mforthman ▴ 50

0

Entering edit mode

Yes, I programmed it to output to screen.

I recommend running the script (either version 1 or version 2) and re-direct the output to a new FASTA file, like this:

./script.sh > company.new.fasta

ADD REPLY • link 7.1 years ago by Kevin Blighe 88k

0

Entering edit mode

Here is an updated solution that does everything (puts sequence data onto single lines and also does everything else). Start with original.fasta (sequences on multiple lines if you wish) and the company.fasta. Keep extensions as fasta and not fa

#Version 1
#!/bin/bash

#Bring sequences on single lines in the input file
awk '/^>/{print "\n"$0}; !/^>/{printf $0}' original.fasta | sed '/^$/d' > original.new.fasta

#Then convert to reverse complement
cat original.new.fasta | while read L; do  echo $L; read L; echo "$L" | rev | tr "ATGC" "TACG" ; done > original_RevComp.fasta ;

paste company.fasta | while read line
do
{
        if [[ "${line}" != ">"* ]] ;
        then
                grep -B1 -w -e "${line}" original.new.fasta original_RevComp.fasta | sed 's/^[a-zA-Z0-9_\\.]*.fasta[-:]//g' ;
    fi
}
done



#######################################
#######################################



#Version 2
#!/bin/bash

#Bring sequences on single lines in the input file
awk '/^>/{print "\n"$0}; !/^>/{printf $0}' original.fasta | sed '/^$/d' > original.new.fasta

#Then convert to reverse complement
cat original.new.fasta | while read L; do  echo $L; read L; echo "$L" | rev | tr "ATGC" "TACG" ; done > original_RevComp.fasta ;

paste company.fasta | while read line
do
{
        if [[ "${line}" != ">"* ]] ;
        then
                grep -B1 -w -e "${line}" original.new.fasta ;
        grep -B1 -w -e "${line}" original_RevComp.fasta ;
        fi
}
done

ADD REPLY • link 7.1 years ago by Kevin Blighe 88k

score 2 · Accepted Answer · 2017-10-16

2

Entering edit mode

7.1 years ago

Alex Reynolds 36k

One way to do this may be to attempt (try) to parse on the pipe character, and then just print out the header and sequence as-is if there is no split:

#!/usr/bin/env python

import sys

original_fn = sys.argv[1]
company_fn = sys.argv[2]

map = {}

with open(original_fn, "r") as original_fh:
    for line in original_fh:
        if line.startswith('>'):
            try:
                 (k, v) = line.strip().split('|')
                 # remove trailing space from key
                 k = k[:-1]
                 map[k] = v
            except ValueError as err:
                 k = line.strip()
                 map[k] = None

with open(company_fn, "r") as company_fh:
    for line in company_fh:
        if line.startswith('>'):
            try:
                (k, v) = line.strip().split('|')
                # remove trailing character from key
                k = k[:-1]
            except ValueError as err:
                k = line.strip()
            if k not in map:
                sys.stdout.write("%s\n" % (k))
            else:
                sys.stdout.write("%s |%s\n" % (k, map[k]))
        else:
            sys.stdout.write("%s" % (line))

ADD COMMENT • link 7.1 years ago by Alex Reynolds 36k

0

Entering edit mode

PERFECT! Thank you and everyone else for all of your help.

ADD REPLY • link 7.1 years ago by mforthman ▴ 50

0

Entering edit mode

How might this script be modified to deal with other headers that are formatted differently in the company.fasta file? Specifically, I'm targeting the following example headers now:

>Clavigralla_tomentosicollis_gi_512427358_gb_GAJX01007276.1_0_rc
TGAGAAGCTCTCGCAGTCACACTGGCCCACGACCACTCTCAGATATCACCTATATTGATGAAGACCCTGTGAAAACTCCATCGCCACACAATAACTCCCTAAACAATCCACGACTGAATC
>Anasa_tristis_comp3229_c0_seq1_0_rc
AGGGCTTGTGATTCCCTTGAGCACATCGCAAGCCTCTGTTCTAGACAAAACATTCCACATTTGGTCAATAATGCTTTTGGTTTGCAAAGTGCACGTCTCATGCATTTAATTCAAGAGGCT
>Anoplocnemis_curvipes_gi_512406097_gb_GAJV01010223.1_70
ATCCTCTGTCTGACTTGAAATGGAAAGAAGTGAAACGTATGGCCCTTCATGAAATGGTTGAATATGTCACCACATACCATGATGTTATTACTGAAGTCATTTATCCTGAAGCTGTTAATA
>Alydus_pilosus_comp24037_c0_seq1_0
TTAAAAGAATTCAGAGTTGAGCAGTGCTCCTTGTTTCTCCAGCACAAATGTACACAACATCGTCCTTTCACCTGCTTTCATTGGCATTTTATGAATCAGAGACGTCGGAGGCCTGTAAGA

Some similarly formatted headers also include _A_ or _B_ before the last digit in the sequence names. I don't want these headers to be targeted. Example of such headers:

>Boisea_trivittata_comp15501_c0_seq2_A_39
CTGCGGTAACTGGTTCAGGGCCCCAGCCGCCCGCTCAGCCACCTACTTCAACGTCAGCAGACCCTGAGAAAAGGAAGCTCATCCAACAACAGCTCGTCCTCCTCCTGCATGCCCACAAGT
>Boisea_trivittata_comp15501_c0_seq2_A_78
CACCTACTTCAACGTCAGCAGACCCTGAGAAAAGGAAGCTCATCCAACAACAGCTCGTCCTCCTCCTGCATGCCCACAAGTGTCAAAGAAGAGAAACACAATCCAACGGTGAAAATATAC

The corresponding headers in the original.fasta file have the following example formats (these don't match with above examples):

>OFAS000119-RA-EXON01 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS000119-RA-EXON01,probes-probe:,probes-source:Alydus_pilosus_comp16418_c0_seq1
>OFAS000280-RA-EXON05 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS000280-RA-EXON05,probes-probe:,probes-source:Boisea_trivittata_comp12981_c0_seq1
>OFAS000975-RA-EXON05 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS000975-RA-EXON05,probes-probe:,probes-source:Anoplocnemis_curvipes_gi_512414378_gb_GAJV01001942.1
>OFAS001601-RA-EXON05 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS001601-RA-EXON05,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512430460_gb_GAJX01004174.1

ADD REPLY • link 7.1 years ago by mforthman ▴ 50

0

Entering edit mode

You could use a regex (regular expression) that tests for _A_n or _B_n at the end of the header string, where n is any number made of one or more digits.

If the regex test finds a match, then we don't process that header.

The pattern of your headers look like they could be matched with the expression _[AB]_[0-9]+$.

To break down this regex:

1) We start with an _ underscore

2) [AB] is a match with either A or B

3) Another _ underscore

4) [0-9]+ is a match with one or more (+) digits 0 through 9

5) $ anchors the match to the end of the search string — in this case, the end of the company header

Here's some untested code:

#!/usr/bin/env python

import sys
import re

pattern = '_[AB]_[0-9]+$'

header_that_should_not_match = 'Clavigralla_tomentosicollis_gi_512427358_gb_GAJX01007276.1_0_rc'
header_that_should_match = 'Boisea_trivittata_comp15501_c0_seq2_A_78'

if re.search(pattern, header_that_should_not_match):
    sys.stderr.write("1 - Something went wrong!\n")
else:
    sys.stdout.write("1 - Correct!\n")

if re.search(pattern, header_that_should_match):
    sys.stdout.write("2 - Correct!\n")
else:
    sys.stderr.write("2 - Something went wrong!\n")

I can't really copy and paste what I did before, because your headers have changed. But basically you only need to add the import re statement, the pattern, and incorporate the if re.search(...) test into the block of code where you are testing company header matches.

If the test passes, then skip over reformatting the header and just print the line, e.g. something like:

...
with open(company_fn, "r") as company_fh:
    for line in company_fh:
        if line.startswith('>') and not re.search(pattern, line.strip()):
            # reformat header here
            # ...
        else:
            sys.stdout.write("%s" % (line))
...

ADD REPLY • link 7.1 years ago by Alex Reynolds 36k

0

Entering edit mode

Thanks for the info. I've modified the regex pattern to also skip over other headers that shouldn't be modified, e.g., like the ones in the original posting (they are all in one file). If I tested correctly, the new pattern is (uce | ENSOFAS | _[AB]_[0-9]+$). Regarding the last part of your reply, I'm not sure I understand what is suppose to go here; I could be very wrong, but it looks like I'm suppose to insert code that formats the company.fasta file headers in a specific way where you say # reformat header here. Instead, I'm wanting to replace them with the ones in the original.fasta file. Also, the code block before the with open(company_fn, "r") as company_fh: seems to split up the headers in the original.fasta file with the | as a delimiter. My guess is that this was to help set up the search/replace between the strings before the | between files. For the new set of headers, the only thing that matches between the files is the taxon/seq IDs. I guess my second question is would I still use line.strip().split() with multiple punctuation delimiters (and how I would do that), and somehow indicate the delimited taxon name as the function to match to between company.fasta and original.fasta (I'm guessing this ultimately is something like the map[k] = v.

I was playing around and thought this might be close, but when I test it, it doens't modify the company.fasta file. Sorry if none of what I attempted makes any sense:

#!/usr/bin/env python

import sys
import re

original_fn = sys.argv[1]
company_fn = sys.argv[2]

pattern1 = '(uce | ENSOFAS | _[AB]_[0-9]+$ | _[AB]_[0-9]_rc+$)'
pattern2 = '(_[0-9]+_rc$ | _[0-9]+$)'
pattern3 = 'probes-source:'

map = {}

with open(original_fn, "r") as original_fh:
    for line in original_fh:
        if line.startswith('>'):
            try:
                 (k, v) = line.strip().split(pattern3)
                 # remove trailing space from key
                 #v = v[:-1]
                 map[v] = k
            except ValueError as err:
                 v = line.strip()
                 map[v] = None

with open(company_fn, "r") as company_fh:
    for line in company_fh:
        if line.startswith('>') and not re.search(pattern1, line.strip()):
            try:
                (k, v) = line.strip().split(pattern2)
                # remove trailing character from key
                #v = v[:-1]
            except ValueError as err:
                k = line.strip()
            if k not in map:
                sys.stdout.write("%s\n" % (k))
            else:
                sys.stdout.write("%s |%s\n" % (k, map[v]))
        else:
            sys.stdout.write("%s" % (line))

ADD REPLY • link 7.1 years ago by mforthman ▴ 50

0

Entering edit mode

Your headers and tests have changed since the original question, and I'm a bit lost. Could you please post your (pre-regex, post-header-changes) script and I could look at how you might modify that to add a regex pattern search test?

ADD REPLY • link 7.1 years ago by Alex Reynolds 36k

0

Entering edit mode

Correct, I kept the original question simple in original post. In reality, the company.fasta file had many differently formatted headers. I used the script you had provided to process just the uce headers, which worked. I already had the ENSOFAS headers. This leaves the others to be formatted. To show this, here are the kinds of headers that are present in company.fasta after using your script for processing uce headers:

>Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_103_rc
CATTGCAGCAACTAACAGAGTTGATATATTAGATCCAGCCCTTCTCCGATCAGGCAGGCTAGACAGAAAAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAAT
>Anasa_tristis_comp3229_c0_seq1_0_rc
AGGGCTTGTGATTCCCTTGAGCACATCGCAAGCCTCTGTTCTAGACAAAACATTCCACATTTGGTCAATAATGCTTTTGGTTTGCAAAGTGCACGTCTCATGCATTTAATTCAAGAGGCT
>ENSOFAS009761_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS009761,probes-probe:2,probes-source:Anoplocnemis_curvipes_contig5129
TTAAGAATCTCGAGAAAACCCCTCAGGATGATGAATTACTTGAAATATATGCTCTCTATAAACAAGCAACTGTAGGAGACTGTGACACAAGTAAGCCTGGGATGTTTGATTTCAAAGGGA1
>uce-3225_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410155,probes-global-end:410275,probes-local-start:0,probes-local-end:120
AAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGCCCGAGAACAACTTGGTATTGCCGAAAGGCGAGCCAACGCT
>Anasa_tristis_comp8051_c0_seq1_A_0
ATCCTCCTGATTGGGCAGAAATTTTGAACCATTTTCGAGGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAA
>Anasa_tristis_comp8051_c0_seq1_B_0
TAACGTCCTAGGTTAGGTTTCTGTTTACCAGCTAAAATCTTGAGGGCTGTAGACTTTCCAATGCCATTAGTTCCAACCAGACCTAAAACTTCTCCTGGTCTTGGAATTGGAAGTCTGTGG

I'm working with a different original.fasta, which have the following format:

>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
CATTGCAGCAACTAACAGAGTTGATATATTAGATCCAGCCCTTCTCCGATCAGGCAGGCTAGACAGAAAAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAAT
>OFAS016134-RA-EXON02 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS016134-RA-EXON02,probes-probe:,probes-source:Anasa_tristis_comp3229_c0_seq1
AGGGCTTGTGATTCCCTTGAGCACATCGCAAGCCTCTGTTCTAGACAAAACATTCCACATTTGGTCAATAATGCTTTTGGTTTGCAAAGTGCACGTCTCATGCATTTAATTCAAGAGGCT

You reply earlier today helps to skip those headers in the company.fasta file that have an _[AB]_[0-9]+$. I took this and inserted into your working script, as you had suggested. However, I also need to make sure it is skipping the uce and ENSOFAS headers and any _[AB]_[0-9] headers that also ended in _rc, so I added these to the regex pattern for this purpose:

#!/usr/bin/env python

import sys
import re

original_fn = sys.argv[1]
company_fn = sys.argv[2]

pattern = (uce | ENSOFAS | _[AB]_[0-9]+$ | _[AB]_[0-9]_rc+$)

map = {}
with open(original_fn, "r") as original_fh:
#remainder of original code...

The code I previously posted was just an example of me playing around in hopes of stumbling upon the answer or having an epiphany, but I'm obviously not getting there. In the end, the modified company file should look like:

>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
CATTGCAGCAACTAACAGAGTTGATATATTAGATCCAGCCCTTCTCCGATCAGGCAGGCTAGACAGAAAAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAAT
>OFAS016134-RA-EXON02 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS016134-RA-EXON02,probes-probe:,probes-source:Anasa_tristis_comp3229_c0_seq1
AGGGCTTGTGATTCCCTTGAGCACATCGCAAGCCTCTGTTCTAGACAAAACATTCCACATTTGGTCAATAATGCTTTTGGTTTGCAAAGTGCACGTCTCATGCATTTAATTCAAGAGGCT
>ENSOFAS009761_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS009761,probes-probe:2,probes-source:Anoplocnemis_curvipes_contig5129
TTAAGAATCTCGAGAAAACCCCTCAGGATGATGAATTACTTGAAATATATGCTCTCTATAAACAAGCAACTGTAGGAGACTGTGACACAAGTAAGCCTGGGATGTTTGATTTCAAAGGGA1
>uce-3225_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410155,probes-global-end:410275,probes-local-start:0,probes-local-end:120
AAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGCCCGAGAACAACTTGGTATTGCCGAAAGGCGAGCCAACGCT
>Anasa_tristis_comp8051_c0_seq1_A_0
ATCCTCCTGATTGGGCAGAAATTTTGAACCATTTTCGAGGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAA
>Anasa_tristis_comp8051_c0_seq1_B_0
TAACGTCCTAGGTTAGGTTTCTGTTTACCAGCTAAAATCTTGAGGGCTGTAGACTTTCCAATGCCATTAGTTCCAACCAGACCTAAAACTTCTCCTGGTCTTGGAATTGGAAGTCTGTGG

ADD REPLY • link 7.1 years ago by mforthman ▴ 50

1

Entering edit mode

I'll try to take a look in a little while, but this pattern is probably not what you want:

_[AB]_[0-9]_rc+$

This would find matches to

..._A_2_rcccccc
..._B_9_rcccc

But not:

..._A_6_rc
..._B_2829_rc

Instead, you may want:

_[AB]_[0-9]+_rc$

In this second pattern, by moving the + after the digits, you can match a number with one or more digits.

ADD REPLY • link 7.1 years ago by Alex Reynolds 36k

1

Entering edit mode

You are correct, that was my mistake. Thanks for catching that! And I appreciate you taking a further look at this later.

ADD REPLY • link 7.1 years ago by mforthman ▴ 50

0

Entering edit mode

I've still been tinkering with the script and feel as though I might be getting closer to the solution:

#!/usr/bin/env python

import sys
import re

original_fn = sys.argv[1]
company_fn = sys.argv[2]

pattern = '(uce.+$|ENSOFAS.+$|[AB]_[0-9]+$)'

map = {}

with open(original_fn, "r") as original_fh:
    for line in original_fh:
        if line.startswith('>'):
            try:
                 (k, v) = line.strip().rsplit(':',1)
                 # remove trailing space from key
                 #k = k[:-1]
                 map[k] = v
                 #print k
                 #print v
                 #print map[k]
            except ValueError as err:
                 k = line.strip()
                 map[k] = None

with open(company_fn, "r") as company_fh:
    for line in company_fh:
        if line.startswith('>') and not re.search(pattern, line.strip()):
            try:
                line=line.strip('>')
                (v, k) = line.strip().rsplit('_',1)
                # remove trailing character from key
                #k = k[:-1]
                #print k
                #print v
            except ValueError as err:
                k = line.strip()
            if v not in map:
                sys.stdout.write("%s\n" % (v))
            else:
                sys.stdout.write("%s |%s\n" % (v, map[k]))
        else:
            sys.stdout.write("%s" % (line))

Currently, this will modify the targeted headers, but what it does is delete the > (because I line.striped it in order to get the taxon/seq IDs as the key) and the last underscore and anything beyond it. It doesn't replace the header with the corresponding original.fasta header yet.

ADD REPLY • link 7.1 years ago by mforthman ▴ 50