Menu

#2 Treatment of NaNs

2.0
open
None
2021-06-21
2021-06-20
Tiago
No

Hello Marcelo,

I’m trying to calculate Gower distances (GD) for a series of data sets consisting only of categorical variables. However, for some data sets containing large amounts of cells with missing data, such as the one attached here (Dataset.csv), I obtained drastically different results for GD between your script implemented in the gower library for Python, and GD functions in two distinct R packages: daisy{cluster} and gower.dist {StatMatch}. From what I could assess, these differences seem to stem from the treatment of NaNs:

Compare final D matrices for Variables 1 X Variable 2: GD = 0 (daisy) but GD = 0.06977 (gower_distance). These two variables have exactly the same data for every cell, with the exception of NaNs. Daisy seems to be ignoring NaNs when estimating GD, therefore finding a GD=0, whereas gower_distance seems to be including it somehow as to not output distances=0. Could you please clarify the treatment of NaNs in your function?

Python script:

import pandas as pd
import numpy as np
import gower

df = pd.read_csv('Dataset.csv')
df = df.replace(['?'], [None])
df = df.replace(['-'], [None])

X = np.asarray(df)

g_dist = gower.gower_matrix(np.transpose(X))

np.savetxt("Python_GD.csv", g_dist, delimiter=",")

R script:

library(cluster)
Dataraw <- read.csv("Dataset.csv")
Data <- as.data.frame(t(Dataraw))
Data <- replace_with_na_all(Data, condition = ~.x %in% c("?", "-"))
Data <- Data %>% mutate_all(as.factor)

###### Create Gower distance matrix from transposed matrix using Daisy
D<- daisy(Data, metric = "gower")
D <- as.matrix(D)
summary(D)
write.csv(D, file="R_daisy_GD.csv")
3 Attachments

Related

Tickets: #2

Discussion

  • Marcelo Beckmann

    Hi Tiago, this code is too obsolete, don't use this one, I'm going to
    decomission that one. You should have a look at PR #16834 of scikit-learn.
    They took over the initial code, and implemented gower in another way, I
    think it's almost done, including nans tratment.

    Best regards,

    Marcelo Beckmann

    On Sun 20 Jun 2021, 21:43 Tiago, tsimoes@users.sourceforge.net wrote:


    Status: open
    Milestone: 2.0
    Created: Sun Jun 20, 2021 08:43 PM UTC by Tiago
    Last Updated: Sun Jun 20, 2021 08:43 PM UTC
    Owner: Marcelo Beckmann
    Attachments:

    Hello Marcelo,

    I’m trying to calculate Gower distances (GD) for a series of data sets
    consisting only of categorical variables. However, for some data sets
    containing large amounts of cells with missing data, such as the one
    attached here (Dataset.csv), I obtained drastically different results for
    GD between your script implemented in the gower library for Python, and GD
    functions in two distinct R packages: daisy{cluster} and gower.dist
    {StatMatch}. From what I could assess, these differences seem to stem from
    the treatment of NaNs:

    Compare final D matrices for Variables 1 X Variable 2: GD = 0 (daisy) but
    GD = 0.06977 (gower_distance). These two variables have exactly the same
    data for every cell, with the exception of NaNs. Daisy seems to be ignoring
    NaNs when estimating GD, therefore finding a GD=0, whereas gower_distance
    seems to be including it somehow as to not output distances=0. Could you
    please clarify the treatment of NaNs in your function?

    Python script:

    import pandas as pdimport numpy as npimport gower
    df = pd.read_csv('Dataset.csv')df = df.replace(['?'], [None])df = df.replace(['-'], [None])
    X = np.asarray(df)
    g_dist = gower.gower_matrix(np.transpose(X))
    np.savetxt("Python_GD.csv", g_dist, delimiter=",")

    R script:

    library(cluster)Dataraw <- read.csv("Dataset.csv")Data <- as.data.frame(t(Dataraw))Data <- replace_with_na_all(Data, condition = ~.x %in% c("?", "-"))Data <- Data %>% mutate_all(as.factor)

    Create Gower distance matrix from transposed matrix using DaisyD<- daisy(Data, metric = "gower")D <- as.matrix(D)summary(D)write.csv(D, file="R_daisy_GD.csv")

    Sent from sourceforge.net because you indicated interest in
    https://sourceforge.net/p/gower-distance-4python/tickets/2/

    To unsubscribe from further messages, please visit
    https://sourceforge.net/auth/subscriptions/

     

    Related

    Tickets: #2

    • Tiago

      Tiago - 2021-06-21

      Thanks, Marcelo!

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.