Computes pairwise effect sizes (phi or Cramer's V) for all pairs of
categorical columns in a data frame and returns the underlying
igraph object. This is the lower-level computational engine used by
catgraph. Most users should call catgraph
unless they specifically need direct access to the raw igraph
representation.
Usage
build_graph(
data,
method = "cramers_v",
corrected = FALSE,
correct = FALSE,
simulate_p = FALSE,
B = 2000L,
alpha = 0.5
)Arguments
- data
A data frame or tibble. All columns are treated as categorical. Non-factor, non-character, non-logical columns are coerced to character with a message. Columns with only one unique observed value (after pairwise deletion) are dropped with a warning.
- method
Character. Association metric used to weight edges. One of:
"cramers_v"(default, classical phi / Cramer's V),"cramers_v_corrected"(bias-corrected via Bergsma 2013),"nmi"(Normalised Mutual Information),"ami"(Adjusted Mutual Information, corrects NMI for chance), or"bayesian_cramers_v"(Dirichlet-smoothed Cramér's V).- corrected
Logical. Deprecated shortcut: if
TRUE, overridesmethodto"cramers_v_corrected". Kept for backward compatibility. DefaultFALSE.- correct
Logical. Yates' continuity correction for chi-square. Default
FALSE.- simulate_p
Logical. Use Monte Carlo simulation for p-values. Default
FALSE.- B
Integer. Number of Monte Carlo resamples. Default
2000L.- alpha
Numeric. Dirichlet prior concentration for
method = "bayesian_cramers_v". Default0.5(Jeffreys prior). Ignored for all other methods.
Value
An igraph undirected graph. Pairs with a true zero effect
size (no association whatsoever) are represented as absent edges
rather than near-zero edges, so the graph is sparse rather than
structurally complete. The attribute "processed_data" on the
returned graph holds the data frame actually used for estimation (after
coercion and constant-column removal), which downstream functions such
as catgraph_ci use when resampling.
The graph attribute "pair_results" stores the full pairwise
results table before zero-weight edges are omitted.
For ordinary package use, prefer catgraph, which wraps this
graph together with processed data, metadata, and S3 methods.
Vertex and edge attributes:
- Vertices
One per column in
data, with vertex attributenameset to the column name. Isolated vertices are preserved even if all their pairs have zero effect size.- Edge attribute
weight The phi or Cramer's V value.
- Edge attribute
metric "phi"or"cramers_v".- Edge attribute
corrected Whether bias correction was applied.
- Edge attribute
p_value Chi-square p-value.
- Edge attribute
statistic Chi-square statistic.
- Edge attribute
df Degrees of freedom.
- Edge attribute
n Pairwise-complete observation count for that pair. Values can differ across edges when missingness is present (pairwise deletion).
- Edge attribute
type "2x2"or"RxC".- Edge attribute
estimable Logical indicating whether the pairwise effect size was estimable before zero-weight omission.
Details
Computes effect sizes (phi or Cramer's V) for all pairs of categorical
columns in a data frame and returns an igraph object whose edge
weights correspond to those effect sizes. This is the main computational
engine used by the top-level catgraph constructor.
Scope. The returned graph represents pairwise marginal
association strength. It is not a conditional-independence graphical model:
an edge between A and B does not imply that the variables
remain dependent after controlling for the other variables in the data.
See the package vignette section "Scope and interpretation" for details.
All variable pairs with non-zero effect size are included by default. Use
prune_edges to remove edges below a weight or adjusted-p
threshold after construction.
Note on zero-weight pairs. In earlier versions of the package
(<= 0.3.0), zero-weight pairs were stored as edges with weight
.Machine$double.eps to guarantee a fully connected graph. This
made the graph structurally complete and silently inflated density-based
measures. From 0.4.0 onwards, zero-weight pairs are absent edges; a dense
similarity matrix suitable for heatmaps is available separately via
assoc_similarity or assoc_matrix.
References
Bergsma, W. (2013). A bias-correction for Cramer's V and Tschuprow's T.
Good, I. J. (1965). The Estimation of Probabilities: An Essay on Modern Bayesian Methods. MIT Press.
Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley. doi:10.1002/047174882X
Vinh, N. X., Epps, J., & Bailey, J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalisation and correction for chance. Journal of Machine Learning Research, 11, 2837–2854. https://jmlr.org/papers/v11/vinh10a.html Journal of the Korean Statistical Society, 42(3), 323–328. doi:10.1016/j.jkss.2012.10.002
Csardi, G., & Nepusz, T. (2006). The igraph software package for complex network research. InterJournal, Complex Systems, 1695. https://igraph.org
Examples
data(HairEyeColor)
df <- expand_table(HairEyeColor)
g <- build_graph(df[, c("Hair", "Eye")])
igraph::E(g)$weight
#> [1] 0.2790446