Dense pairwise similarity matrix of categorical variables
Source:R/assoc_similarity.R
assoc_similarity.RdComputes the full \(p \times p\) matrix of pairwise effect sizes for categorical variables, including pairs with zero association.
Usage
assoc_similarity(
data,
method = "cramers_v",
corrected = FALSE,
correct = FALSE,
simulate_p = FALSE,
B = 2000L,
alpha = 0.5,
what = c("effect_size", "p_value", "n", "all")
)Arguments
- data
A data frame of categorical variables (same requirements as
catgraph).- method
Character. Association metric to use. One of
"cramers_v"(default),"cramers_v_corrected","nmi","ami", or"bayesian_cramers_v". Seebuild_graphfor details.- corrected
Logical. Deprecated shortcut for
method = "cramers_v_corrected". DefaultFALSE.- correct
Logical. Yates' continuity correction for chi-square. Default
FALSE.- simulate_p
Logical. Monte Carlo simulation for p-values (affects only the p-value matrix, not the effect-size matrix). Default
FALSE.- B
Integer. Monte Carlo resamples. Default
2000L.- alpha
Numeric. Dirichlet prior concentration for
method = "bayesian_cramers_v". Default0.5(Jeffreys prior). Ignored for all other methods.- what
Character. What to return:
"effect_size"(default),"p_value","n"(pairwise-complete observation count), or"all"(a list of matrices).
Value
A symmetric numeric matrix (or a list of matrices when
what = "all"). Diagonal is NA. Row and column names are
the variable names.
Details
This function is the correct input for heatmap-style visualisation
and any analysis that requires a dense similarity matrix. The
igraph object returned by catgraph is the correct
input for topology (centrality, clustering, density, community
detection): it represents zero-association pairs as absent edges and
therefore would give misleading heatmaps.
This function duplicates the computation done by
build_graph but does not collapse the result into a graph,
so all pairs are represented. In v0.3.0 and earlier, the same output was
extracted from the graph via assoc_matrix(), but because the
graph forced zero-weight pairs to .Machine$double.eps, the
resulting matrix silently conflated "zero association" with "near-zero
association". From 0.4.0 onwards, use assoc_similarity() when you
want the full dense matrix and assoc_matrix() (the graph
extractor) when you want the matrix of actual edges.
Examples
df <- expand_table(Titanic)
S <- assoc_similarity(df)
round(S, 3)
#> Class Sex Age Survived
#> Class NA 0.399 0.232 0.294
#> Sex 0.399 NA 0.111 0.456
#> Age 0.232 0.111 NA 0.098
#> Survived 0.294 0.456 0.098 NA
# All three components at once
out <- assoc_similarity(df, what = "all")
str(out, max.level = 1)
#> List of 3
#> $ effect_size: num [1:4, 1:4] NA 0.399 0.232 0.294 0.399 ...
#> ..- attr(*, "dimnames")=List of 2
#> $ p_value : num [1:4, 1:4] NA 1.56e-75 1.69e-25 5.00e-41 1.56e-75 ...
#> ..- attr(*, "dimnames")=List of 2
#> $ n : num [1:4, 1:4] NA 2201 2201 2201 2201 ...
#> ..- attr(*, "dimnames")=List of 2