So I use the following script:
#Cromossomo 1
import csv
fieldnames = ['CHROM', 'POS', 'ID', 'REF', 'ALT', 'chrom', 'start', 'end', 'strand']
with open("/mnt/c/Users/Usuário/Desktop/Indels_por_cromossomo/INDELS_chr1.csv", "r") as indel, open("/mnt/c/Users/Usuário/Desktop/miRNA_maduros_por_cromossomo/miRNA_chr1_maduros.csv", 'r') as miRNA, open("/mnt/c/Users/Usuário/Desktop/result_chr1.csv", "w") as output_file:
indel_reader = csv.DictReader(indel, delimiter=',')
miRNA_reader = csv.DictReader(miRNA, delimiter=',')
writer = csv.DictWriter(output_file, fieldnames=fieldnames)
writer.writeheader()
for row in indel_reader:
position = int(row['POS'])
miRNA.seek(0)
next(miRNA_reader) # Skip the header row in the second CSV
for row2 in miRNA_reader:
start_pos = int(row2['start'])
end_pos = int(row2['end'])
if start_pos <= position <= end_pos:
writer.writerow({
'CHROM': row['CHROM'],
'POS': row['POS'],
'ID': row['ID'],
'REF': row['REF'],
'ALT': row['ALT'],
'chrom': row2['chrom'],
'start': row2['start'],
'end': row2['end'],
'strand': row2['strand']
})
print("Fizemos o cromossomo 1. Resultado está em result_chr1.csv")
For each row of the indels file (INDELS_chr1.csv
), it checks whether the indel position (POS
) is within the range of positions of a miRNA (defined by the start and end columns of the miRNA file). If the indel position (POS
) is within the range of a miRNA, the script creates a new row in the output file (result_chr1.csv
), containing information about both the indel and the corresponding miRNA (such as CHROM, POS, ID, REF, ALT, chrom, start, end, and strand
). Therefore, the script is associating indels with overlapping miRNAs, i.e. those that are in the same chromosome region. The final result is a new CSV file that contains combined data of indels and mature miRNAs for chromosome 1, stored in the result_chr1.csv file.
However, the final file, which contains combined data of indels and mature miRNAs for chromosome 1, is coming empty, with only the header. I know that this could mean that on chromosome 1 there was no indel within the position range of a miRNA, however, I asked chatgpt and he gave me the command below to check if this was really the case:
awk -F',' 'NR>1 {print $2}' INDELS_chr1.csv | \
while read pos
do
awk -F',' -v p=$pos 'NR>1 && $2 <= p && $3 >= p {print "Match found: " p " in " $2 "-" $3}' miRNA_chr1_maduros.csv
done
The command gave me several lines as a result. I'm putting some below:
Match found: 60656161 in 17593894-miRNA
Match found: 60656161 in 44445019-miRNA
Match found: 60656161 in 49030671-miRNA
Match found: 60656161 in 17593894-miRNA
Match found: 60656161 in 44445019-miRNA
Match found: 60656161 in 49030671-miRNA
Match found: 60656161 in 17593894-miRNA
Match found: 60656161 in 44445019-miRNA
Match found: 60656161 in 49030671-miRNA
Match found: 60656161 in 17593894-miRNA
Match found: 60656161 in 44445019-miRNA
Match found: 60656161 in 49030671-miRNA
Match found: 60656161 in 17593894-miRNA
Match found: 60656161 in 44445019-miRNA
Match found: 60656161 in 49030671-miRNA
Match found: 60656197 in 17593894-miRNA
Match found: 60656197 in 44445019-miRNA
Match found: 60656197 in 49030671-miRNA
Match found: 60656222 in 17593894-miRNA
Match found: 60656222 in 44445019-miRNA
Match found: 60656222 in 49030671-miRNA
Match found: 60656222 in 17593894-miRNA
Match found: 60656222 in 44445019-miRNA
Match found: 60656222 in 49030671-miRNA
Match found: 60656222 in 17593894-miRNA
Match found: 60656222 in 44445019-miRNA
Match found: 60656222 in 49030671-miRNA
Match found: 60656222 in 17593894-miRNA
Match found: 60656222 in 44445019-miRNA
Match found: 60656222 in 49030671-miRNA
Match found: 60656222 in 17593894-miRNA
Match found: 60656222 in 44445019-miRNA
Match found: 60656222 in 49030671-miRNA
The chat told me that this meant that the awk command would have indicated that there are overlaps between the positions of the INDELS_chr1.csv
and miRNA_chr1_maduros.csv files.
Another command that the chat gave me was to put a print inside the loop, like this:
for row in indel_reader:
position = int(row['POS'])
print(f"Checking indel position: {position}")
miRNA.seek(0)
next(miRNA_reader) # Skip the header row in the second CSV
for row2 in miRNA_reader:
start_pos = int(row2['start'])
end_pos = int(row2['end'])
print(f"Checking miRNA range: {start_pos}-{end_pos}")
if start_pos <= position <= end_pos:
print(f"Match found for position {position} in range {start_pos}-{end_pos}")
writer.writerow({
'CHROM': row['CHROM'],
'POS': row['POS'],
'ID': row['ID'],
'REF': row['REF'],
'ALT': row['ALT'],
'chrom': row2['chrom'],
'start': row2['start'],
'end': row2['end'],
'strand': row2['strand']
})
After executing the loop with the print above, several messages appeared on my terminal during the process. I am putting just some of these messages below:
Checking indel position: 122675497
Checking indel position: 122675592
Checking indel position: 122675592
Checking indel position: 122675592
Checking indel position: 122675592
Checking indel position: 122675592
Checking indel position: 122675592
Checking indel position: 122675595
Checking indel position: 122675597
Checking indel position: 122675666
Checking indel position: 122675695
Checking indel position: 122675761
Checking indel position: 122675763
Checking indel position: 122676144
Checking indel position: 122676976
Checking indel position: 122677294
Checking indel position: 122677381
Checking indel position: 122677726
Checking indel position: 122677727
Checking indel position: 122677727
Checking indel position: 122677727
Checking indel position: 122677727
Checking indel position: 122677727
Checking indel position: 122677728
Checking indel position: 122678249
Checking indel position: 122678548
Checking indel position: 122678589
Checking indel position: 122678728
Checking indel position: 122678761
Checking indel position: 122678762
Checking miRNA range: 44445019-44445043
Checking miRNA range: 49030671-49030693
Checking miRNA range: 71677608-71677630
Checking miRNA range: 71678117-71678137
Checking miRNA range: 71678353-71678371
Checking miRNA range: 75509765-75509787
Checking miRNA range: 86818494-86818515
Checking miRNA range: 93194363-93194382
Checking miRNA range: 97902042-97902066
Checking miRNA range: 97903987-97904008
Checking miRNA range: 98604031-98604051
Checking miRNA range: 103406620-103406641
Checking miRNA range: 103606523-103606544
Checking miRNA range: 104785575-104785598
Checking miRNA range: 105400311-105400332
Checking miRNA range: 105400785-105400806
Checking miRNA range: 105400963-105400984
Checking miRNA range: 106998756-106998777
Checking miRNA range: 107121595-107121619
Checking miRNA range: 109618365-109618387
Checking miRNA range: 109922155-109922176
Checking miRNA range: 113031383-113031407
Checking miRNA range: 122413086-122413106
Fizemos o cromossomo 1. Resultado está em result_chr1.csv
However, the result_chr1.csv
file was empty this time too. The chat said that this probably meant that the match conditions (start <= position <= end
) were not met during the loop.
He gave me several commands to check if the contents of the indels and miRNA files were correct, and in all of them the results showed that both files had no problems. I will share with you the first 10 lines of each file.
Data from the indels files:
CHROM,POS,ID,REF,ALT
chr1,236,.,CT,C
chr1,240,.,TG,T
chr1,265,.,CT,C
chr1,293,.,TG,T
chr1,308,.,C,CAACA
chr1,378,.,CA,C
chr1,399,.,AG,A
chr1,548,.,AAC,A
chr1,550,.,CA,C
miRNA files data:
CHROM,POS,ID,REF,ALT,chrom,start,end,strand
chr1,17593894,miRNA,17593915,.,chr1,17593894,17593915,-
chr1,44445019,miRNA,44445043,.,chr1,44445019,44445043,-
chr1,49030671,miRNA,49030693,.,chr1,49030671,49030693,-
chr1,71677608,miRNA,71677630,.,chr1,71677608,71677630,-
chr1,71678117,miRNA,71678137,.,chr1,71678117,71678137,-
chr1,71678353,miRNA,71678371,.,chr1,71678353,71678371,-
chr1,75509765,miRNA,75509787,.,chr1,75509765,75509787,+
chr1,86818494,miRNA,86818515,.,chr1,86818494,86818515,+
chr1,93194363,miRNA,93194382,.,chr1,93194363,93194382,+
Even after checking that there was nothing wrong with the data in the files corresponding to chromosome 1, I tried to run the script with the data from the files for chromosome 10 and the result was the same: the generated file was empty, containing only the header.
I have been modifying the script in several ways for a week, but the result is always the same: an empty file with only the header. I don't know what to do anymore. What could be happening?
Thank you for your attention
how about 'just' converting both files to bed using awk and 'just' using 'bedtools intersect' ?
Thank you for your answer, but I am new to bioinformatics and do not know the tools you mentioned. Could you explain to me how I could do what you said with my data? Thank you for your attention!
I'm not sure I understand your problem but I feel your basically re-inventing the wheel.
bedtools intersect
is the de-facto tool to find overlapping genomic features https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html. But your data are not in BED format, that is why you need awk to convert your data to BED. For example: