Hello Marcelo,
I’m trying to calculate Gower distances (GD) for a series of data sets consisting only of categorical variables. However, for some data sets containing large amounts of cells with missing data, such as the one attached here (Dataset.csv), I obtained drastically different results for GD between your script implemented in the gower library for Python, and GD functions in two distinct R packages: daisy{cluster} and gower.dist {StatMatch}. From what I could assess, these differences seem to stem from the treatment of NaNs:
Compare final D matrices for Variables 1 X Variable 2: GD = 0 (daisy) but GD = 0.06977 (gower_distance). These two variables have exactly the same data for every cell, with the exception of NaNs. Daisy seems to be ignoring NaNs when estimating GD, therefore finding a GD=0, whereas gower_distance seems to be including it somehow as to not output distances=0. Could you please clarify the treatment of NaNs in your function?
Python script:
import pandas as pd
import numpy as np
import gower
df = pd.read_csv('Dataset.csv')
df = df.replace(['?'], [None])
df = df.replace(['-'], [None])
X = np.asarray(df)
g_dist = gower.gower_matrix(np.transpose(X))
np.savetxt("Python_GD.csv", g_dist, delimiter=",")
R script:
library(cluster)
Dataraw <- read.csv("Dataset.csv")
Data <- as.data.frame(t(Dataraw))
Data <- replace_with_na_all(Data, condition = ~.x %in% c("?", "-"))
Data <- Data %>% mutate_all(as.factor)
###### Create Gower distance matrix from transposed matrix using Daisy
D<- daisy(Data, metric = "gower")
D <- as.matrix(D)
summary(D)
write.csv(D, file="R_daisy_GD.csv")
Hi Tiago, this code is too obsolete, don't use this one, I'm going to
decomission that one. You should have a look at PR #16834 of scikit-learn.
They took over the initial code, and implemented gower in another way, I
think it's almost done, including nans tratment.
Best regards,
Marcelo Beckmann
On Sun 20 Jun 2021, 21:43 Tiago, tsimoes@users.sourceforge.net wrote:
Related
Tickets: #2
Thanks, Marcelo!