How can I count the symbols in a given string / text and, as a result of that count, remove the characters
0
0
Entering edit mode
4.2 years ago
USER • 0
f = open('Denv4-X-gb_AY947539.txt', 'r')
z = f.read()
count_inicio = sum(map(lambda x : 1 if '-' in x else 0, z)) 
count_fim = sum(map(lambda x : 1 if '-' in x else 0, reversed(z))) 
print(count_inicio, count_fim)
Output>
479 479

file contents:

lcl|NC_002640.1_cds_NP_073286.1_1 [gene=POLY] [locus_tag=DV4_gp1]
     [db_xref=GeneID:5075729] [protein=polyprotein]
     [protein_id=NP_073286.1] [location=102..10265] [gbkey=CDS]
     ------------------------------------------------------------ ---------------------------------atgaaccaacgaaaaaaggtggttaga ccacctttcaatatgctgaaacgcgagagaaaccgcgtatcaacccctcaagggttggtg
     aagagattctcaaccggacttttttctgggaaaggacccttacggatggtgctagcattc
     atcacgtttttgcgagtcctttccatcccaccaacagcagggattctgaagagatgggga
     cagttgaagaaaaataaggccatcaagatactgattggattcaggaaggagataggccgc
     ------------------------------------------------------------ 

gb:AY947539|Organism:Dengue virus 4|Strain
     Name:H241|Segment:null|Subtype:4|Host:Human
     ggtcgtgtggaccgacaaggacagttccaaatcggaagcttgcttaacacagttctaaca
     gtttgtttagatagagagcagatctctggaaaaatgaaccaacgaaaaaaggtggttaga
     ccacctttcaatatgctgaaacgcgagagaaaccgcgtatcaacccctcaagggttggtg
     aagagattctcaaccggacttttttccgggaaaggacccttacggatggtgctagcattc
     atcacgtttttgcgagtcctttccatcccaccaacagcagggattctgaaaagatgggga
     cagttgaagaaaaacaaggccatcaaaatactgactggattcaggaaggagataggccgc
     atgctgaacatcttgaatggaagaaaaaggtcaacaatgacattgctgtgcttgattccc

For example I need to take the sequence lcl | NC_002640.1_cds_NP_073286.1_1> --- AATG-GG ---- and count the number of "-" at the beginning and end

And then cut into Myseq1 gb: AY947539 | Organism: Dengue virus 4 | GGGAATG-GGAAAA characters according to the amount of "-"

TALE 3 "-" in Myseq start and 3 at the end 4 ... So the output I want is AATF-GG. But first I need to make this "-" count from the beginning and the end.

How do I count symbols in a given string / text and as a result of that count remove characters from another string / text in the same file?

genome alignment gene sequence software error • 916 views
ADD COMMENT
0
Entering edit mode

1) first understand your format, looks like some multiple alignment format, so you can check if BioPython has a module to read it

2) if not, you need to read your sequences, you have a header of 3 lines in the first sequence (is it not a single line? that facilitates reading it), and a header of 2 lines in sequence 2, then each block has the nucleotide sequence, so add the sequence 1 in a string and iterate over it to get "-"

3) load the sequence 2 in another string and remove the blocks (use string ranges)

ADD REPLY
0
Entering edit mode

The header is on the same line in my FASTA file. And I turned it into txt because I couldn't read it with biopython. The alignment was done with mafft

arquivo.fasta.aln ou arquivo.aln turned into txt

Could you give an example

Input:

lcl | NC_002640.1_cds_NP_073286.1_1>
 --- AATG-GG ----
gb: AY947539 | Organism: Dengue virus 4 |
GGGAATG-GGAAAA

output:

 gb: AY947539 | Organism: Dengue virus 4 |
 AATF-GG

I need to count the number of "-" of the first string and cut the characters of the second string according to that amount find a second string ... To be only with the CDs in the second string

ADD REPLY
1
Entering edit mode

Por favor cara, precisamos um Input e Output esperado;

Per exemplo:

Input:

> header
------AAAA---BBBBB-----
-----------ATGCATGC---
---ATGCATGCCCCC

> GB proteinA proteinB
aactgtgactgcatgcatgactgactg
tacactactgcatgcatgactgactgc

Desired output:

> GB proteinA ----- proteinB
aactgtgactgcatgcatgactgactg
tacactactgcatgcatgactgactgc
ADD REPLY
0
Entering edit mode

ok you can check now

ADD REPLY

Login before adding your answer.

Traffic: 1673 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6