Entering edit mode
4.2 years ago
USER
•
0
f = open('Denv4-X-gb_AY947539.txt', 'r')
z = f.read()
count_inicio = sum(map(lambda x : 1 if '-' in x else 0, z))
count_fim = sum(map(lambda x : 1 if '-' in x else 0, reversed(z)))
print(count_inicio, count_fim)
Output>
479 479
file contents:
lcl|NC_002640.1_cds_NP_073286.1_1 [gene=POLY] [locus_tag=DV4_gp1]
[db_xref=GeneID:5075729] [protein=polyprotein]
[protein_id=NP_073286.1] [location=102..10265] [gbkey=CDS]
------------------------------------------------------------ ---------------------------------atgaaccaacgaaaaaaggtggttaga ccacctttcaatatgctgaaacgcgagagaaaccgcgtatcaacccctcaagggttggtg
aagagattctcaaccggacttttttctgggaaaggacccttacggatggtgctagcattc
atcacgtttttgcgagtcctttccatcccaccaacagcagggattctgaagagatgggga
cagttgaagaaaaataaggccatcaagatactgattggattcaggaaggagataggccgc
------------------------------------------------------------
gb:AY947539|Organism:Dengue virus 4|Strain
Name:H241|Segment:null|Subtype:4|Host:Human
ggtcgtgtggaccgacaaggacagttccaaatcggaagcttgcttaacacagttctaaca
gtttgtttagatagagagcagatctctggaaaaatgaaccaacgaaaaaaggtggttaga
ccacctttcaatatgctgaaacgcgagagaaaccgcgtatcaacccctcaagggttggtg
aagagattctcaaccggacttttttccgggaaaggacccttacggatggtgctagcattc
atcacgtttttgcgagtcctttccatcccaccaacagcagggattctgaaaagatgggga
cagttgaagaaaaacaaggccatcaaaatactgactggattcaggaaggagataggccgc
atgctgaacatcttgaatggaagaaaaaggtcaacaatgacattgctgtgcttgattccc
For example I need to take the sequence lcl | NC_002640.1_cds_NP_073286.1_1> --- AATG-GG ---- and count the number of "-" at the beginning and end
And then cut into Myseq1 gb: AY947539 | Organism: Dengue virus 4 | GGGAATG-GGAAAA characters according to the amount of "-"
TALE 3 "-" in Myseq start and 3 at the end 4 ... So the output I want is AATF-GG. But first I need to make this "-" count from the beginning and the end.
How do I count symbols in a given string / text and as a result of that count remove characters from another string / text in the same file?
1) first understand your format, looks like some multiple alignment format, so you can check if BioPython has a module to read it
2) if not, you need to read your sequences, you have a header of 3 lines in the first sequence (is it not a single line? that facilitates reading it), and a header of 2 lines in sequence 2, then each block has the nucleotide sequence, so add the sequence 1 in a string and iterate over it to get "-"
3) load the sequence 2 in another string and remove the blocks (use string ranges)
The header is on the same line in my FASTA file. And I turned it into txt because I couldn't read it with biopython. The alignment was done with mafft
arquivo.fasta.aln ou arquivo.aln turned into txt
Could you give an example
Input:
output:
I need to count the number of "-" of the first string and cut the characters of the second string according to that amount find a second string ... To be only with the CDs in the second string
Por favor cara, precisamos um Input e Output esperado;
Per exemplo:
Input:
Desired output:
ok you can check now