Question

Add a seq (the outgroup) to the top of its corresponding gene

0

Entering edit mode

10.6 years ago

hosseinv ▴ 20

I have a fasta file containing N sequences (outgroups). On the other hand, I have N fast files each having the orthologous gene sequences of several samples.

The idea is to put each of the outgroups at the top of their corresponding genes.

Example:

The outgroup looks like below, OUTGROUPS.fas

> GENE 1
TTAACTCCTGCTACTTTG
> GENE 2
TCTGTCGACGGCAACTGTGAAACTTATC
> GENE 3
GCACCCTGAGCCGAACTGAATTC
> GENE 4
GGTTAACAGAACTTGTTCTTCACATGCAGAGTCTTGA

Here is the fasta file of one of my genes, called GENE1.fas

>Ind_1
TTAAATCCTGCTTCTTTG
>Ind_2
TTAAATCCTGCTTCTTTG
>Ind_3
TTAAATCCTGCTTCTTTG

I want to get the following:

> GENE 1
TTAACTCCTGCTACTTTG
>Ind_1
TTAAATCCTGCTTCTTTG
>Ind_2
TTAAATCCTGCTTCTTTG
>Ind_3
TTAAATCCTGCTTCTTTG

This is done for GENE1, I need same job for the rest of the GENES. As you see, the name of entries in outgroup fasta file (GENE1, GENE2, ...) are the same as their orthologous genes (GENE1.fas, GENE2.fas ...).

I appreciate any help on this post.

Best, Hossein

python fasta concat perl • 4.7k views

ADD COMMENT • link updated 3.2 years ago by Ram 44k • written 10.6 years ago by hosseinv ▴ 20

0

Entering edit mode

Lemme make sure I understand this correctly:

Each gene FASTA file will always have a name (GENE1.fas) which corresponds to its associated header in the OUTGROUPS.fas file (>GENE 1), correct?

Do you want all of the results in a single output file?

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by Dan D 7.4k

1

Entering edit mode

10.6 years ago

umer.zeeshan.ijaz ★ 1.8k

Alright, I quickly wrote the solution for you (assuming the FASTA files have the same name as the FASTA IDs and have an extension .fasta.

Say I begin with three files in the current folder:

$ ls
GENE_1.fasta  GENE_2.fasta  OUTGROUPS.fasta

The content of these files are:

$ cat OUTGROUPS.fasta
>GENE_1
TTAACTCCTGCTACTTTG
>GENE_2
TCTGTCGACGGCAACTGTGAAACTTATC
>GENE_3
GCACCCTGAGCCGAACTGAATTC
>GENE_4
GGTTAACAGAACTTGTTCTTCACATGCAGAGTCTTGA

$ cat GENE_1.fasta
>Ind_1
TTAAATCCTGCTTCTTTG
>Ind_2
TTAAATCCTGCTTCTTTG
>Ind_3
TTAAATCCTGCTTCTTTG

$ cat GENE_2.fasta
>Ind_11
TTAAATCCGTCTGCTTCTTTG
>Ind_22
TTAAATCCTGGTGCTTCTTTG
>Ind_33
TTAAATCCTGCTTCTGTCTTTG

Then you can use the following one-liner to accomplish all this:

$ cat OUTGROUPS.fasta | perl -pe '/^>/?s/^>/\n>/:s/\s*$// if$.>1' | xargs -n 2 | while read -r -a Arr; do ( IFS=$'\n'; echo "${Arr[*]}" ); cat ${Arr[0]:1}".fasta" 2>/dev/null; done
>GENE_1
TTAACTCCTGCTACTTTG
>Ind_1
TTAAATCCTGCTTCTTTG
>Ind_2
TTAAATCCTGCTTCTTTG
>Ind_3
TTAAATCCTGCTTCTTTG
>GENE_2
TCTGTCGACGGCAACTGTGAAACTTATC
>Ind_11
TTAAATCCGTCTGCTTCTTTG
>Ind_22
TTAAATCCTGGTGCTTCTTTG
>Ind_33
TTAAATCCTGCTTCTGTCTTTG
>GENE_3
GCACCCTGAGCCGAACTGAATTC
>GENE_4
GGTTAACAGAACTTGTTCTTCACATGCAGAGTCTTGA

Please note that perl -pe '/^>/?s/^>/\n>/:s/\s*$// if$.>1' is to linearize the OUTGROUPS.fasta file and if it is already linearized then you can do without this bit of code.

You may redirect the output by adding > GROUPS.fasta at the end of the one-liner to have one output file. Otherwise, use this oneliner to generate separate files (with extension _OUT.fasta):

$ cat OUTGROUPS.fasta | perl -pe '/^>/?s/^>/\n>/:s/\s*$// if$.>1' | xargs -n 2 | while read -r -a Arr; do ( IFS=$'\n'; echo "${Arr[*]}" ) > ${Arr[0]:1}"_OUT.fasta"; cat ${Arr[0]:1}".fasta" 2>/dev/null >> ${Arr[0]:1}"_OUT.fasta"; done

Best Wishes,
Umer

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by umer.zeeshan.ijaz ★ 1.8k

0

Entering edit mode

Dear Umer,

Thank you for your quick reply. I was away from my computer for a few days, that's why I'm writing you late.

I ran the first code with > GROUP.fasta at the end, It was run with no error but the result is not what I wanted.

Here are some more details about my data.

I have same individuals in each single GENE_?.fasta and I want to add the their corresponding outgroup individually too, don't want to combine files.

The number of entries in OUTGROUP.fasta is more than the number of genes. Say, if I have 2000 entries in my OUTGROUPS.fasta as following:

$ cat OUTGROUPS.fasta
>GENE_1
TTAACTCCTGCTACTTTG
>GENE_2
TCTGTCGACGGCAACTGTGAAACTTATC
.
.
.
>GENE_1999
GCACCCTGAGCCGAACTGAATTC
>GENE_2000
GGTTAACAGAACTTGTTCTTCACATGCAGAGTCTTGA

the number of genes is a sporadic subset of that, say 500 GENE_?.fasta, like following

$ ls
GENE_1.fasta  GENE_2.fasta  ... GENE_499.fasta GENE_500.fasta OUTGROUPS.fasta

In terms of being linearized or not, both OUTGROUPS.fasta and GENE_?.fasta look like the way you illustrated for the content. Each entry has two lines: one as the header starting with > and the other one as the sequence.

And, when I ran the second code, it gave me the following errors:

./Umer2.sh: line 3: syntax error near unexpected token `newline'
./Umer2.sh: line 3: `cat OUTGROUPS.fasta | perl -pe '/^>/?s/^>/\n>/:s/\s*$// if$.>1' | xargs -n 2 | while read -r -a Arr; do ( IFS=$'\n'; echo "${Arr[*]}" ) > '

I thank you again for your help and hope to have you feedback on this

Kind regards
Hossein

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by hosseinv ▴ 20

Ram · Accepted Answer · 2014-04-29

2

Entering edit mode

10.6 years ago

Dan D 7.4k

Assuming:

The fasta file for each gene will always have a name which corresponds to its associated header in the OUTGROUPS.fas file, minus the spaces, and with an fas extension
Your fas files are all in the same directory
You want all of the results in a single output file

Then:

Paste the following into a file and save it as any name you choose, followed by the extension .py
Place the file in the same directory with all the other .fas files and run it

seq_dict = {}

def populate_gene(gene):
    gene_file_name = gene + ".fas"
    gfn = open(gene_file_name, 'r')
    gene_file_data = gfn.readlines()
    gfn.close()
    return gene_file_data

with open('OUTGROUPS.fas', 'r') as og:
    for rawline in og:
        line = rawline.strip()
        if line.startswith('>'):
            gene = line[1:].replace(' ', '')
            seq_dict[gene] = {'raw_gene_name': line, 'head_seq': []}

            try:
                seq_dict[gene]['gene_parts'] = populate_gene(gene)

            except Exception:
                print "Wasn't able to open file for {0}.".format(gene)
                pass
        else:
            seq_dict[gene]['head_seq'].append(rawline)

for gene in seq_dict:
    if 'gene_parts' in seq_dict[gene]:
        with open('{0}_OUT.fas'.format(gene), 'w') as r:
            r.write(seq_dict[gene]['raw_gene_name'] + '\n')
            r.writelines(seq_dict[gene]['head_seq'])
            r.writelines(seq_dict[gene]['gene_parts'])

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by Dan D 7.4k

0

Entering edit mode

details modified below.

ADD REPLY • link 10.6 years ago by hosseinv ▴ 20

0

Entering edit mode

Run the script like this:

python Deedee.py

Not like this:

./Deedee.py

I also updated the script to ignore missing gene files (since you implied that there wouldn't always be a corresponding file for a gene name.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by Dan D 7.4k

0

Entering edit mode

Dear Deedee, I ran python Deedee.py this time with the new script you wrote and now I get the following error:

Traceback (most recent call last):
  File "Deedee2.py", line 22, in <module>
    print "Wasn't able to open file for {}.".format(gene)
ValueError: zero length field name in format

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by hosseinv ▴ 20

0

Entering edit mode

Oops, Sunday is not my best day for looking at code :)

Please try the updated script.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by Dan D 7.4k

0

Entering edit mode

Looks we are moving, but still needs some corrections. This time the code was run and output the result.fas, plus some errors. for the sake of consistency I suggest we both use the following example files:

$ ls
GENE_1.fas GENE_3.fas OUTGROUPS.fas

$ cat GENE_1.fas
>Ind_1
TTAAATCCTGCTTCTTTG
>Ind_2
TTAAATCCTGCTTCTTTG
>Ind_3
TTAAATCCTGCTTCTTTG

$ cat GENE_3.fas
>Ind_1
TTAAATCCGTCTGCTTCTTTG
>Ind_2
TTAAATCCTGGTGCTTCTTTG
>Ind_3
TTAAATCCTGCTTCTGTCTTTG

$ cat OUTGROUP.fas
>GENE_1
TTAACTCCTGCTACTTTG
>GENE_2
TCTGTCGACGGCAACTGTGAAACTTATC
>GENE_3
GCACCCTGAGCCGAACTGAATTC
>GENE_4
GGTTAACAGAACTTGTTCTTCACATGCAGAGTCTTGA

As you see, OUTGROUPS.fas has four sequences (genes), whereas I have population data only for two genes (GENE_1.fas, and GENE_3.fas).

The important thing is I want the results INDIVIDUALLY for each of the two available genes. like following:

$ ls

GENE_1.fas GENE_3.fas  OUTGROUPS.fas GENE_1_OUT.fas GENE_3_OUT.fas

$ cat GENE_1_OUT.fas
>GENE_1
TTAACTCCTGCTACTTTG
>Ind_1
TTAAATCCTGCTTCTTTG
>Ind_2
TTAAATCCTGCTTCTTTG
>Ind_3
TTAAATCCTGCTTCTTTG

$ cat GENE_3_OUT.fas
>GENE_3
GCACCCTGAGCCGAACTGAATTC
>Ind_1
TTAAATCCGTCTGCTTCTTTG
>Ind_2
TTAAATCCTGGTGCTTCTTTG
>Ind_3
TTAAATCCTGCTTCTGTCTTTG

Thank you Deedee for your patience.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by hosseinv ▴ 20

0

Entering edit mode

OK I edited the script. I didn't get a chance to test it but it wasn't a big change.

ADD REPLY • link 10.6 years ago by Dan D 7.4k

0

Entering edit mode

This time the code gave me two separate fasta files, GENE_1_OUT.fas and GENE_2_OUT.fas;

The first one is done perfectly, but the second one contains just the outgroup sequence (>GENE_2), plus I am expecting the second file to be GENE_3_OUT.fas not GENE_2_OUT.fas

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by hosseinv ▴ 20

1

Entering edit mode

Shifting requirements are fun! Code updated.

ADD REPLY • link 10.6 years ago by Dan D 7.4k

0

Entering edit mode

THANKS A LOT Deedee!!

The updated version of the code worked perfectly in my example data set. And I was wondering if I can shift one more time the requirement!!

Well the fact is that my gene name in my real data is not like GENE1 GENE2 ..., but instead they look like below:

EGLOB_v21_A_0010_147754_150523

So, the entry in the OUTGROUP.fas is like:

>EGLOB_v21_A_0010_147754_150523

and, the fasta file will be like:

EGLOB_v21_A_0010_147754_150523.fas

The code presumably does not accept these names, says:

Wasn't able to open file for EGLOB_v21_A_0010_147754_150523

They are thousands of genes so I don't want to change the names as GENE_1 and so forth.

Thanks again for all your help, Deedee.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by hosseinv ▴ 20

0

Entering edit mode

Hmmm, that gene name should be okay.

Two quick questions:

Did it successfully process other files in the same run where you saw this error?
Can you verify that the file it claims is missing actually exists and has exactly the name EGLOB_v21_A_0010_147754_150523.fas?

Thanks!

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by Dan D 7.4k

0

Entering edit mode

Yes I checked it, the names exist and are correct.

I think its simply because of the names, as I changed the previous example files with this new name format and it did not work, whereas with names like GENE_1 ..., it works well.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by hosseinv ▴ 20

0

Entering edit mode

The new name format on your genes and files shouldn't make any difference at all, as long as there are no spaces anywhere in the file name. The script removes any spaces it finds in the gene name. The script is trying to find EGLOB_v21_A_0010_147754_150523.fas and is not able to.

The script should be in the same directory as all of your FASTA files, and should be run from that directory. Similarly, all the FASTA files it's looking for, including OUTGROUPS.fas, should be in that directory.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by Dan D 7.4k

0

Entering edit mode

Dear Dedee,

Nothing changed unless my gene names (other conditions remained same).

I tried a number of names like below:

EGLOB_v21_A_0010_147754_150523.fas
EGLOB_v21_A_0010.fas
EGLOB_v21_A.fas
EGLOB_v21.fas
EGLOB_v.fas
EGLOBv.fas
EGLOB_1.fas

and I finally found that ((only the last configuration** (EGLOB_1.fas) works, even EGLOBv.fas does not work.

Thank you, Hossein

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by hosseinv ▴ 20

0

Entering edit mode

Hmmm, I'd like to take a look at your actual files. Can you post the compressed files to Dropbox and PM a link to me?

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by Dan D 7.4k

0

Entering edit mode

Dear Deedee,

Here is a link to a subset of my data, compressed.

https://www.dropbox.com/s/sj7oqhje452k5z0/Deedee.7z

Thanks again for your time and patience.

Best,
Hossein

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by hosseinv ▴ 20

1

Entering edit mode

Thanks!

For some reason, I've no idea why, it seems that using the format() function in your filename string is causing the problem in the version of python you're using. I tried on three different test boxes and I was able to replicate your problem on the third, which was running an older version of Python 2.6. I updated the script accordingly. Let me know if it doesn't work.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by Dan D 7.4k

0

Entering edit mode

Hi Deedee :)

DONE !

The version of mine is 2.6.6, but the updated script did the job perfectly.

I thank you a lot for all your help in the past days.

Best wishes,
Hossein

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.5 years ago by hosseinv ▴ 20