Hi all,
I have two dataframes:
First data frame has three columns: ID, sire.ID, and Dam.ID
The second dataframe has the same three columns plus genotyping calls for 17 markers in 17 columns.
Here is what I need to be done in R:
If sire.ID in the first dataframe exists in the ID column in the second dataframe, then return the sire.ID and its calls from the second dataframe in first dataframe for the relevant offspring, and same for Dam.ID.
Simply, for each ID in the first dataframe I want to get the calls for its sire and dam in front of the ID and in two rows.
Thanks in advance for your help
Hi @HG, The community is not here to provide tailored programming solutions/ tutorials. Please amend your post to show some reproducible data to your question, along with code showing what you've tried.
ID Sire.ID Dam.ID marker1 marker2 marker3 marker4 marker5 marker6 marker7 marker8 marker9
D j K 12 21 11 11 12 22 11 22 11
G l m 12 21 11 11 12 22 11 22 11
H n o 12 21 11 11 12 22 11 22 11
I need this:
ID Sire.ID Dam.ID
A D G D j K 12 21 11 11 12 22 11 22 11
A D G G l m 12 21 11 11 12 22 11 22 11
Split each ID (each row) in the first dataframe into two rows and return the sire and dam calls from the second data frame in first and second splitted rows for each ID.
If the information for the sire or dam doesn't exists in the second dataframe then write NA in the splitted rows.
I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:
Your expected output seems to be incorrect. G and H from Dam.ID of data frame 1 matches with that from G and H from ID of Second frame.
code:
> library(sqldf)
> a=read.csv("testa.txt", sep="\t", strip.white = T, stringsAsFactors = F)
> b=read.csv("testb.txt", sep="\t", strip.white = T, stringsAsFactors = F)
> sqldf ('SELECT * FROM a INNER JOIN b ON a."Sire.ID" = b.ID OR a."Dam.ID" = b.ID')
output:
ID Sire.ID Dam.ID ID..4 Sire.ID..5 Dam.ID..6 marker1 marker2 marker3 marker4 marker5 marker6 marker7
1 A D G D j K 12 21 11 11 12 22 11
2 A D G G l m 12 21 11 11 12 22 11
3 B E H H n o 12 21 11 11 12 22 11
marker8 marker9
1 22 11
2 22 11
3 22 11
input:
> a
ID Sire.ID Dam.ID
1 A D G
2 B E H
3 C F I
> b
ID Sire.ID Dam.ID marker1 marker2 marker3 marker4 marker5 marker6 marker7 marker8 marker9
1 D j K 12 21 11 11 12 22 11 22 11
2 G l m 12 21 11 11 12 22 11 22 11
3 H n o 12 21 11 11 12 22 11 22 11
I have changed your post to "question", a tutorial is when you show how to do stuff.
Your question seems rather R instead of bioinformatics. It is likely someone will close your post for this reason...
Hi @HG, The community is not here to provide tailored programming solutions/ tutorials. Please amend your post to show some reproducible data to your question, along with code showing what you've tried.
Good description of data. But please post some example data and expected output. HG
Here is part of the first dataframe:
part of the Second dataframe:
I need this:
Split each ID (each row) in the first dataframe into two rows and return the sire and dam calls from the second data frame in first and second splitted rows for each ID. If the information for the sire or dam doesn't exists in the second dataframe then write NA in the splitted rows.
Thanks
I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below: