Question

Trimming an alignment around a given sequence

0

Entering edit mode

4.4 years ago

QLFblaireau ▴ 30

Hello, I have a simple question but actually I have troubles finding how to solve it. I have many alignment files. And I want for each of them get rid in the alignment of overhang bases. The twist is that I want this done relatively to a specific sequence, which is always the first of the alignment and is always named "Scaffoldxxxx" (where x are numbers and others). Here is a pic

what needs to be cropped is in the red squares

As you see, I want to trim everything that is upstream and downstream of the start and end of the first sequence. That's easy in a sequence editor such as Jalview. But as I have thousands of alignments, I need to automate it. Surprisingly I don't even know where to start.

Many thanks for any help or insight.

As requested, here is a sample alignment :

>Scaffold_2:57492774-57492872
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
-------------------------------cttgagctggggtctggccatggggtaaa
gaagcagcagcagagacagaccaatgccaatgaggattccatactgcacacagtcacaag
catgggtta---------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
-----------
>_R_TRINITY_DN28760_c0_g2_i7
tgtttgtgtgtcttgatttacaaaaatgatgcacagtaaatgttgataatttttacgact
gctgaggagatacaaggaacatggtaattgtgtaatgaagacaatgccagcttactaaat
gtattactttctgctgtgtgacaatgatacttacacgggtgcggcataaaactatctcaa
ctcctttccgttccccttccaacaccatgcttcgtataccatgtagactggagaagtcga
ggcccgatatgtggaccatgtccaggatgacggccctggggacgtcctcctccagggcca
ggcccatgaccttgtcctccaggtactctatggctggaaagaacaggccctggtcaggct
ggatcagcaccagccccggctgctccttcaccttgagttgtggccgggccatggggtaga
gcaggagcaacaaggacagcccgataccgatcaggatcccatactcaatgcccacacata
gtgaacccaaaaaggtggccacatgcacaaacagatcccatttattggtgcgccacaaaa
cggggatgattttgtagtcgaccatctgcatgacggccatgatgatgaccgcggccagcg
ctgacttggggatgtagtaacagtagggcaccaggaaggccagtactaacaggatcaggg
accctgtgaaaagaccattcgccggtgttcttacaccgctctgtgagttgacagcagttc
tggaaaaactgccggtgacaggataggaatgaacaaaggaactgagaatgttggcagtac
ctatagctatcaactcttgtgtaggatcaatcttatagttattcacacgagctgcaacat
aaacaaaagtgtcatgaatttcatatcagcgacaaaaacttttacctaataaataaagtt
ttaaaaaggag

I would need to trim this alignment so that its length is the length of the first sequence. I don't want in the second sequence what is upstream and downstream of the bases that match the first sequence.

pairwise script crop alignment • 2.3k views

ADD COMMENT • link 4.4 years ago by QLFblaireau ▴ 30

0

Entering edit mode

it would help better to understand the issue if you post the data instead of images.

ADD REPLY • link 4.4 years ago by cpad0112 21k

0

Entering edit mode

I have updated accordingly.

ADD REPLY • link 4.4 years ago by QLFblaireau ▴ 30

0

Entering edit mode

first sequence is not in the second sequence.

ADD REPLY • link 4.4 years ago by cpad0112 21k

score 3 · Accepted Answer · 2021-05-21

If you knew how much to trim you can do it with any data trimmer, like cutadapt

figuring out how many to strip at start or end is a little trickier to do in a single command-line solution but it can be done with some scripting:

import sys

stream = sys.stdin

# Skip first sequence name.
line = next(stream)

seq = ''

line = next(sys.stdin)

# Collect the first sequence.
while not line.startswith(">"):
    seq += line.strip()
    line = next(stream)

lcount = len(seq) - len(seq.lstrip("-"))
rcount = len(seq) - len(seq.rstrip("-"))

print (lcount, rcount)

run it with:

cat foo.fa | python parse.py

it prints:

391 422

now use cutadapt like so:

 cutadapt -u 391 -u -422  foo.fa > out.fa

produces:

>Scaffold_2:57492774-57492872
cttgagctggggtctggccatggggtaaagaagcagcagcagagacagaccaatgccaatgaggattccatactgcacacagtcacaagcatgggtta
>_R_TRINITY_DN28760_c0_g2_i7
cttgagttgtggccgggccatggggtagagcaggagcaacaaggacagcccgataccgatcaggatcccatactcaatgcccacacatagtgaaccca

score 3 · Accepted Answer · 2021-05-21

It's a two step procedure. Thanks Istvan Albert for making me under stand the query.

$ seqkit head -n 1 test.fa | seqkit locate -Pdip '(N+)' 

seqID   patternName pattern strand  start   end matched
Scaffold_2:57492774-57492872    (N+)    (N+)    +   392 489 cttgagctggggtctggccatggggtaaagaagcagcagcagagacagaccaatgccaatgaggattccatactgcacacagtcacaagcatgggtta

Note down the coordinates: 392-489 here

$ seqkit -w 0 subseq -r 392:489 test.fa

>Scaffold_2:57492774-57492872
cttgagctggggtctggccatggggtaaagaagcagcagcagagacagaccaatgccaatgaggattccatactgcacacagtcacaagcatgggtta
>_R_TRINITY_DN28760_c0_g2_i7
cttgagttgtggccgggccatggggtagagcaggagcaacaaggacagcccgataccgatcaggatcccatactcaatgcccacacatagtgaaccca

Download seqkit from here