In many cases, the connected families that we have to integrate over may be large but sparsely connected. As an example, consider the following three generation family:
# the data set we use
dat <- data.frame(
id = 1:48,
mom = c(NA, NA, 2L, 2L, 2L, NA, NA, 7L, 7L, 7L, 3L, 3L, 3L, 3L, NA, 15L, 15L, 43L, 18L, NA, NA, 21L, 21L, 9L, 9L, 9L, 9L, NA, NA, 29L, 29L, 29L, 30L, 30L, NA, NA, 36L, 36L, 36L, 38L, 38L, NA, NA, 43L, 43L, 43L, 32L, 32L),
dad = c(NA, NA, 1L, 1L, 1L, NA, NA, 6L, 6L, 6L, 8L, 8L, 8L, 8L, NA, 4L, 4L, 42L, 5L, NA, NA, 20L, 20L, 22L, 22L, 22L, 22L, NA, NA, 28L, 28L, 28L, 23L, 23L, NA, NA, 35L, 35L, 35L, 31L, 31L, NA, NA, 42L, 42L, 42L, 45L, 45L),
sex = c(1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L))
# plot the pedigree
library(kinship2, quietly = TRUE)
ped <- with(dat, pedigree(id = id, dadid = dad, momid = mom, sex = sex))
par(mar = c(1, 1, 1, 1))
plot(ped)
In the plot, circles are females and squares are males. An individual may be repeated in which case duplicate entries are illustrated by a dashed line. If we observe outcomes for all 48 members of the family then we have to do 48 dimensional as everybody are marginally dependent if we have an additive genetic effect or similar effects. We assume that this is true in the rest of this vignette. Though it is possible also use the same method we describe here in other cases if one specifies a suitable graph, like the one we will get to soon, for the random effect structure.
To see that everybody are connected, notice that
The relationships can also be represented as graph. For this purposes, we assign the functions below to create a list of edges between parents and their children. You can likely skip the functions.
# returns the mothers id.
#
# Args:
# pedigree: the pedigree object.
get_momid <- function(peddat)
with(peddat,
vapply(mindex, function(x) if(x > 0L) id[x] else NA_integer_, 1L))
# returns the fathers id.
#
# Args:
# pedigree: the pedigree object.
get_dadid <- function(peddat)
with(peddat,
vapply(findex, function(x) if(x > 0L) id[x] else NA_integer_, 1L))
# creates an edge list to pass to igraph. An edge is included between children
# and parents.
#
# Args:
# pedigree: the pedigree object.
create_igraph_input <- function(peddat){
id <- peddat$id
father <- get_dadid(peddat)
mother <- get_momid(peddat)
# TODO: this is O(n^2)
stopifnot(anyDuplicated(id) < 1)
out <- lapply(id, function(x){
# find the children
children_idx <- which(x == father | x == mother)
children <- if(length(children_idx) > 0)
id[children_idx] else NULL
# get the correct order (smallest first) and return
is_larger <- x > children
cbind(
ifelse(is_larger, children, x ),
ifelse(is_larger, x , children))
})
out <- do.call(rbind, out)
as.data.frame(out[!duplicated(out), ])
}
The graph version of the family looks like this:
# For some reason, kinship2::pedigree requires that we provide both a father
# and mother or none. Therefore, we create a mock object. You can skip this
get_pedigree_mock <- function(id, dadid, momid, sex){
if(is.factor(sex))
sex <- as.integer(sex)
# checks
n <- length(id)
stopifnot(n > 0, length(dadid) == n, length(momid) == n, length(sex) == n,
all(is.finite(sex)), all(sex %in% 1:2),
all(is.na(dadid) | dadid %in% id),
all(is.na(momid) | momid %in% id),
all(is.finite(id)))
# create objects to return
findex <- match(dadid, id, nomatch = 0L)
mindex <- match(momid, id, nomatch = 0L)
structure(
list(famid = rep(1L, n), id = id, findex = findex, mindex = mindex,
sex = factor(sex, levels = 1:2, labels = c("male", "famle"))),
class = "pedigree")
}
# assign function to plot the pedigree as a graph
do_graph_plot <- function(dat){
ped <- with(dat, get_pedigree_mock(
id = id, dadid = dad, momid = mom, sex = sex))
library(igraph, quietly = TRUE)
g_dat <- create_igraph_input(ped)
graph_fam <- graph.data.frame(g_dat, directed = FALSE)
par(mar = c(1, 1, 1, 1))
plot(graph_fam, vertex.size = 12, vertex.color = "gray",
vertex.label.cex = .75, layout = layout_with_kk)
}
do_graph_plot(dat)
A node/vertex in the graph represents an individual and an edge indicates a parent-child relation. The graph have the property that if we split (find a connected partition of) the above into two sub-families (subgraphs) by removing child-parent relations (removing edges) then we only have to do two smaller integrals rather than one large. Having said that, the 48 dimensional integral that we show here is not a problem but higher dimensional integrals may be.
The above suggests the following procedure for simplifying the computational problem:
We have implemented procedures to do just this. It also turns out that it is fine to not satisfy number 1. That is, to not get two connected sets in the partition. We will return to this later and to what we have implemented but first provide an example. We start by splitting the above family in way that only satisfies 1.:
# get the partition
library(pedmod)
only_balanced <- max_balanced_partition_pedigree(
id = dat$id, father.id = dat$dad, mother.id = dat$mom, trace = 2L)
#> Exited early. Balance criterion is 4
#> Exited early. Balance criterion is 7
#> Cannot improve balance criterion further
#> Found split. Balance criterion is 24
#> Found perfect balanced connected partition early. Balance criterion is 24
#> Found approximately balanced connected partition. Balance criterion is 24
only_balanced$balance_criterion # the smallest of the two sets
#> [1] 24
removed_edges <- only_balanced$removed_edges
removed_edges # the relationships we have removed
#> from to
#> [1,] 6 8
#> [2,] 6 10
#> [3,] 7 9
#> [4,] 32 47
#> [5,] 32 48
# assign function to get the new data. You can likely skip this
get_reduced_data <- function(dat, removed_edges){
for(i in 1:nrow(removed_edges)){
. <- function(child, parent){
idx <- which(dat$id == removed_edges[i, child])
idx_m <- which(dat$mom[idx] == removed_edges[i, parent])
if(length(idx_m > 0)){
dat[idx[idx_m], "mom"] <<- NA_integer_
return(TRUE)
}
idx_d <- which(dat$dad[idx] == removed_edges[i, parent])
if(length(idx_d > 0)){
dat[idx[idx_d], "dad"] <<- NA_integer_
return(TRUE)
}
FALSE
}
if(.(1L, 2L))
next
.(2L, 1L)
}
dat
}
new_dat <- get_reduced_data(dat, removed_edges)
# redo the plot
do_graph_plot(new_dat)
The above shows the family after we remove the 5 relationships (edges) that our algorithm finds. To illustrate where the cut is in the original graph, we can color the vertices according to which set they are in:
# assign function to show the split in the original graph
show_split <- function(dat, partition_obj){
ped <- with(dat, pedigree(id = id, dadid = dad, momid = mom, sex = sex))
g_dat <- create_igraph_input(ped)
graph_fam <- graph.data.frame(g_dat, directed = FALSE)
nam <- vertex_attr(graph_fam)$name
V(graph_fam)$color[nam %in% partition_obj$set_1] <- "lightblue"
V(graph_fam)$color[nam %in% partition_obj$set_2] <- "lightgray"
par(mar = c(1, 1, 1, 1))
plot(graph_fam, vertex.size = 12, vertex.label.cex = .75,
layout = layout_with_kk)
}
show_split(dat, only_balanced)
Next, we use the slack
argument to reduce the number of relationships we remove (the number of edges we cut):
also_cut <- max_balanced_partition_pedigree(
id = dat$id, father.id = dat$dad, mother.id = dat$mom, trace = 2L,
slack = .1)
#> Exited early. Balance criterion is 4
#> Exited early. Balance criterion is 7
#> Cannot improve balance criterion further
#> Found split. Balance criterion is 24
#> Found perfect balanced connected partition early. Balance criterion is 24
#> Found approximately balanced connected partition. Balance criterion is 24
#> Starting to reduce the cost of the cut in a block of size 38. The partition in the block consists of two sets of size 17 and 21. The cut cost is 5
#> Starting iteration 1
#> Found 6 vertices to move
#> Keept 2 moves with a gain of 1 and a balance criterion of 24
#> The cut cost is 4
#> Starting iteration 2
#> Found 6 vertices to move
#> Keept 0 moves
#> The cut cost is 4
also_cut$balance_criterion # the smallest of the two sets
#> [1] 24
removed_edges <- also_cut$removed_edges
removed_edges # the relationships we have removed
#> from to
#> [1,] 6 9
#> [2,] 7 9
#> [3,] 32 48
#> [4,] 45 47
# redo the plot
show_split(dat, also_cut)
We only removed 4 relationships (edges) this time.
In many applications, we only observe some individuals. Therefore, we want the to split the family into equal sizes in terms of the individuals that we actually observe. We can do this by providing individual weights (vertex weights).
As an example, we will take the family we have been working with and assume that we only observe individuals in the final generation. Thus, we will set the weight of these individuals to a one and the remaining to some small positive number (say \(10^{-5}\)):
# add the weights
is_final <- c(16:17, 11:14, 24:27, 33:34, 40:41, 47:48, 19L)
dat$id_weight <- ifelse(dat$id %in% is_final, 1., 1e-5)
weighted_partition <- max_balanced_partition_pedigree(
id = dat$id, father.id = dat$dad, mother.id = dat$mom, trace = 2L,
slack = .1, id_weight = dat$id_weight)
#> Exited early. Balance criterion is 4e-05
#> Exited early. Balance criterion is 2.00005
#> Cannot improve balance criterion further
#> Found split. Balance criterion is 8.00019
#> Exited early. Balance criterion is 2.00001
#> Found approximately balanced connected partition. Balance criterion is 8.00019
#> Starting to reduce the cost of the cut in a block of size 38. The partition in the block consists of two sets of size 20 and 18. The cut cost is 6
#> Starting iteration 1
#> Found 8 vertices to move
#> Keept 1 moves with a gain of 2 and a balance criterion of 8.00018
#> The cut cost is 4
#> Starting iteration 2
#> Found 7 vertices to move
#> Keept 0 moves
#> The cut cost is 4
weighted_partition$balance_criterion # the smallest of the two sets
#> [1] 8
removed_edges <- weighted_partition$removed_edges
removed_edges # the relationships we have removed
#> from to
#> [1,] 6 8
#> [2,] 7 8
#> [3,] 32 47
#> [4,] 32 48
# plot the new graph
plot_weighted <- function(dat, removed_edges){
new_dat <- get_reduced_data(dat, removed_edges)
ped <- with(new_dat, get_pedigree_mock(
id = id, dadid = dad, momid = mom, sex = sex))
g_dat <- create_igraph_input(ped)
graph_fam <- graph.data.frame(g_dat, directed = FALSE)
V(graph_fam)$color <- ifelse(vertex_attr(graph_fam)$name %in% is_final,
"gray", "white")
par(mar = c(1, 1, 1, 1))
plot(graph_fam, vertex.size = 12, vertex.label.cex = .75,
layout = layout_with_kk)}
plot_weighted(dat, removed_edges)
show_split(dat, weighted_partition)
We have colored vertices which we observe in the first graph to make it easy to see that there is a as close as possible to an equal number in each sub-family.
We can notice that the previous solution removes relations between the individuals we observe and their immediate ancestors. We can add a larger weight to these relationships to avoid or discourage this. To do so, we need to use the father_weight
and mother_weight
arguments as shown below:
# add the weights
is_final <- c(16:17, 11:14, 24:27, 33:34, 40:41, 47:48, 19L)
dat$id_weight <- ifelse(dat$id %in% is_final, 1., 1e-5)
# add the edge weights
dat$father_weight <- dat$mother_weight <- ifelse(dat$id %in% is_final, 10., 1.)
# find the partition
weighted_partition <- max_balanced_partition_pedigree(
id = dat$id, father.id = dat$dad, mother.id = dat$mom, trace = 2L,
slack = .1, id_weight = dat$id_weight, father_weight = dat$father_weight,
mother_weight = dat$mother_weight)
#> Exited early. Balance criterion is 4e-05
#> Exited early. Balance criterion is 2.00005
#> Cannot improve balance criterion further
#> Found split. Balance criterion is 8.00019
#> Exited early. Balance criterion is 2.00001
#> Found approximately balanced connected partition. Balance criterion is 8.00019
#> Starting to reduce the cost of the cut in a block of size 38. The partition in the block consists of two sets of size 20 and 18. The cut cost is 60
#> Starting iteration 1
#> Found 8 vertices to move
#> Keept 2 moves with a gain of 56 and a balance criterion of 8.00017
#> The cut cost is 4
#> Starting iteration 2
#> Found 6 vertices to move
#> Keept 0 moves
#> The cut cost is 4
weighted_partition$balance_criterion # the smallest of the two sets
#> [1] 8
removed_edges <- weighted_partition$removed_edges
removed_edges # the relationships we have removed
#> from to
#> [1,] 6 8
#> [2,] 7 8
#> [3,] 28 32
#> [4,] 29 32
# plot the new graph
plot_weighted(dat, removed_edges)
show_split(dat, weighted_partition)
We end up cutting the link to the grandparents instead.
We may not require that the two sets in the partition is connected. The unconnected_partition_pedigree
function does not impose this constraint. An example is given below:
# without weights
partition <- unconnected_partition_pedigree(
id = dat$id, father.id = dat$dad, mother.id = dat$mom, trace = 2L,
slack = .1)
#> Starting from an initial paratition with a balance criterion of 0 with a total vertex weight of 48
#> The cut cost is 0
#> Starting to build the first partition that satisfy the balance criterion
#> Starting from an initial paratition with a balance criterion of 20 and a cut cost of 10
#> Starting iteration 1
#> Found 47 vertices to move
#> Keept 14 moves with a gain of 3 and a balance criterion of 24
#> The cut cost is 7
#> Starting iteration 2
#> Found 47 vertices to move
#> Keept 15 moves with a gain of 1 and a balance criterion of 21
#> The cut cost is 6
#> Starting iteration 3
#> Found 47 vertices to move
#> Keept 3 moves with a gain of 0 and a balance criterion of 24
#> The cut cost is 6
#> Starting iteration 4
#> Found 47 vertices to move
#> Keept 0 moves
#> The cut cost is 6
partition$removed_edges # the relationships we have removed
#> from to
#> [1,] 1 3
#> [2,] 2 3
#> [3,] 28 31
#> [4,] 29 31
#> [5,] 32 47
#> [6,] 32 48
# show the partition
show_split(dat, partition)
# with weights
weighted_partition <- unconnected_partition_pedigree(
id = dat$id, father.id = dat$dad, mother.id = dat$mom, trace = 2L,
slack = .1, id_weight = dat$id_weight, father_weight = dat$father_weight,
mother_weight = dat$mother_weight)
#> Starting from an initial paratition with a balance criterion of 0 with a total vertex weight of 17.0003
#> The cut cost is 0
#> Starting to build the first partition that satisfy the balance criterion
#> Starting from an initial paratition with a balance criterion of 7.00025 and a cut cost of 10
#> Starting iteration 1
#> Found 47 vertices to move
#> Keept 5 moves with a gain of 6 and a balance criterion of 7.0002
#> The cut cost is 4
#> Starting iteration 2
#> Found 47 vertices to move
#> Keept 0 moves
#> The cut cost is 4
weighted_partition$removed_edges # the relationships we have removed
#> from to
#> [1,] 1 5
#> [2,] 2 5
#> [3,] 20 22
#> [4,] 21 22
# show the partition
plot_weighted(dat, weighted_partition$removed_edges)
show_split(dat, weighted_partition)
We can also start from a connected partition like below:
# get the connected partition
connected_partition <- max_balanced_partition_pedigree(
id = dat$id, father.id = dat$dad, mother.id = dat$mom, trace = 2L,
slack = .1, id_weight = dat$id_weight, father_weight = dat$father_weight,
mother_weight = dat$mother_weight)
#> Exited early. Balance criterion is 4e-05
#> Exited early. Balance criterion is 2.00005
#> Cannot improve balance criterion further
#> Found split. Balance criterion is 8.00019
#> Exited early. Balance criterion is 2.00001
#> Found approximately balanced connected partition. Balance criterion is 8.00019
#> Starting to reduce the cost of the cut in a block of size 38. The partition in the block consists of two sets of size 20 and 18. The cut cost is 60
#> Starting iteration 1
#> Found 8 vertices to move
#> Keept 2 moves with a gain of 56 and a balance criterion of 8.00017
#> The cut cost is 4
#> Starting iteration 2
#> Found 6 vertices to move
#> Keept 0 moves
#> The cut cost is 4
# get the plausibly unconnected partition
partition <- unconnected_partition_pedigree(
id = dat$id, father.id = dat$dad, mother.id = dat$mom, trace = 2L,
slack = .1, id_weight = dat$id_weight, father_weight = dat$father_weight,
mother_weight = dat$mother_weight, init = connected_partition$set_2)
#> Starting from an initial paratition with a balance criterion of 8.00017 with a total vertex weight of 17.0003
#> The cut cost is 4
#> Starting to build the first partition that satisfy the balance criterion
#> Starting from an initial paratition with a balance criterion of 8.00017 and a cut cost of 4
#> Starting iteration 1
#> Found 47 vertices to move
#> Keept 0 moves
#> The cut cost is 4
# show the partition
show_split(dat, partition)
We will go through the details of the implemented methods in this section. All the methods we have implemented are for general graphs so we will cast the problem for a general graph \(G = (V, E)\) with vertices \(V\) and edges \(E\) with some vertex weight function \(w\). The first problem we need to solve is to implement a method to approximately solve the maximally balanced connected partition problem. That is, we want to find a partition (\(V_1\), \(V_2\)) of the graph such that \(V_1\) is connected and \(V_2\) is connected. The partition we find needs to maximize \(\min (w(V_1), w(V_2))\).
We have implemented the method suggested by Chlebíková (1996). This methods requires that one can find the cut vertices (articulation points) and can construct the block-cut tree. Thus, we have implemented the method suggested by Hopcroft and Tarjan (1973). The method suggested by Hopcroft and Tarjan (1973) can be accessed through the biconnected_components
and block_cut_tree
functions. We illustrate this below on a simulated graph:
# simulates a connected graph of a given size
sim_graph <- function(size){
stopifnot(size > 2)
w_all <- (500:1)^(-2)
out <- lapply(3:size, function(n){
n_links <- sample.int(4L, 1L, prob = c(.5, .4, .05, .05))
n_links <- min(n - 1L, n_links)
idx_start <- max(1L, n - 500L)
idx_end <- n - 1L
prob <- if(n > 502L) w_all else (idx_end:idx_start)^(-2)
cbind(from = n, to = sample(idx_start:idx_end, n_links, prob = prob))
})
out <- c(list(cbind(1L, 2L)), out)
out <- do.call(rbind, out)
ids <- sample.int(2L * size, size)
out[] <- ids[out]
out <- t(apply(out, 1L, sort))
setNames(as.data.frame(out), c("from", "to"))
}
# simulates a data set and illustrate it
set.seed(45L)
sim_dat <- sim_graph(50L)
par(mar = c(1, 1, 1, 1))
plot(graph.data.frame(sim_dat, directed = FALSE), layout = layout_with_kk,
vertex.size = 12, vertex.color = "gray", vertex.label.cex = .75)
# get the biconnected components
cut_pts <- biconnected_components(from = sim_dat$from, to = sim_dat$to)
length(cut_pts) # number of biconnected components
#> [1] 4
The algorithm suggested by Chlebíková (1996) can be used through the max_balanced_partition
function with slack == 0
:
# find the approximately balanced connected partition
max_part <- max_balanced_partition(from = sim_dat$from, to = sim_dat$to,
trace = 2L)
#> Exited early. Balance criterion is 2
#> Exited early. Balance criterion is 3
#> Cannot improve balance criterion further
#> Found split. Balance criterion is 25
#> Found perfect balanced connected partition early. Balance criterion is 25
#> Found approximately balanced connected partition. Balance criterion is 25
max_part$balance_criterion # the balance criterion
#> [1] 25
# the removed edges
removed_edges <- max_part$removed_edges
removed_edges
#> from to
#> [1,] 3 6
#> [2,] 3 64
#> [3,] 3 68
#> [4,] 5 68
#> [5,] 5 77
#> [6,] 6 51
#> [7,] 9 44
#> [8,] 11 47
#> [9,] 11 68
#> [10,] 11 73
#> [11,] 11 77
#> [12,] 28 29
#> [13,] 29 32
#> [14,] 32 37
#> [15,] 32 78
#> [16,] 32 85
#> [17,] 33 69
#> [18,] 33 96
#> [19,] 34 56
#> [20,] 34 68
#> [21,] 34 96
#> [22,] 51 64
#> [23,] 62 68
# assign function to plot the partition
do_plot <- function(dat, partition_obj){
removed_edges <- partition_obj$removed_edges
to_remove <- apply(removed_edges, 1L, function(x)
which(dat$from == x[1] & dat$to == x[2]))
new_dat <- dat[-to_remove, ]
par(mar = c(1, 1, 1, 1))
plot(graph.data.frame(new_dat, directed = FALSE), layout = layout_with_kk,
vertex.size = 12, vertex.color = "gray", vertex.label.cex = .75)
}
do_plot(sim_dat, max_part)
# assign function to show the split in the original graph
show_split <- function(dat, partition_obj, vertex.size = 12,
vertex.label = NULL){
graph_fam <- graph.data.frame(dat, directed = FALSE)
nam <- vertex_attr(graph_fam)$name
V(graph_fam)$color[nam %in% partition_obj$set_1] <- "lightblue"
V(graph_fam)$color[nam %in% partition_obj$set_2] <- "lightgray"
par(mar = c(1, 1, 1, 1))
plot(graph_fam, vertex.size = vertex.size, vertex.label.cex = .75,
layout = layout_with_kk, vertex.label = vertex.label)
}
show_split(sim_dat, max_part)
Our implementation of the method by Chlebíková (1996) is outlined at Implementation Details. Currently, we re-compute the cut vertices (articulation points) at every iteration but it may be possible to update these which presumably will be faster and reduce the overall computation time. We do find a perfectly balanced partition above where each set is of size 25. However, we cut 23 edges to do this. Thus, we implemented a greedy method to reduce the number of cut edges while maintaining a roughly balanced connected partition which we outline next.
Given an approximately balanced connected partition \((V_1, V_2)\) and vertex weight function \(w_H\) as defined by Chlebíková (1996), let \(B\) be the block in which the partition is split, \(\tilde V_1 = V_1 \cap B\) and \(\tilde V_2 = V_2 \cap B\), and let \(I_1 \subseteq \tilde V_2\) be the vertices that are connected with any vertex in \(\tilde V_1\) and define \(I_2\) similarly. Then
The method is very similar to the original Kernighan–Lin algorithm. It differs by the constraint that the two sets in the partition should be connected, that we only move one vertex in every move (like Fiduccia and Mattheyses 1982), and that the partition is only approximately balanced (like Fiduccia and Mattheyses 1982). We also limit the number of moves we make as Karypis and Kumar (1998). Our implementations to find a potentially unconnected partition (unconnected_partition
and unconnected_partition_pedigree
) proceeds in a similar manner as above except that we do not require that the sets are connected. This substantially simplifies the algorithm.
\(\epsilon\) is set with the slack
argument of max_balanced_partition
and the maximum number of moves to make in each iteration is set by the max_kl_it_inner
argument of max_balanced_partition
. We illustrate the method below:
# do the same but with also with the refinement step
max_part <- max_balanced_partition(
from = sim_dat$from, to = sim_dat$to, slack = .1, max_kl_it_inner = 50L,
max_kl_it = 1000L, trace = 1L)
#> Found perfect balanced connected partition early. Balance criterion is 25
#> Found approximately balanced connected partition. Balance criterion is 25
#> Starting to reduce the cost of the cut in a block of size 46. The partition in the block consists of two sets of size 22 and 24. The cut cost is 23
#> Starting iteration 1
#> Found 46 vertices to move
#> Keept 6 moves with a gain of 9 and a balance criterion of 25
#> The cut cost is 14
#> Starting iteration 2
#> Found 46 vertices to move
#> Keept 15 moves with a gain of 6 and a balance criterion of 21
#> The cut cost is 8
#> Starting iteration 3
#> Found 46 vertices to move
#> Keept 6 moves with a gain of 3 and a balance criterion of 25
#> The cut cost is 5
#> Starting iteration 4
#> Found 46 vertices to move
#> Keept 0 moves
#> The cut cost is 5
max_part$balance_criterion # the balance criterion
#> [1] 25
removed_edges <- max_part$removed_edges
removed_edges
#> from to
#> [1,] 3 68
#> [2,] 4 86
#> [3,] 6 68
#> [4,] 30 35
#> [5,] 34 68
# redo the plot
do_plot(sim_dat, max_part)
show_split(sim_dat, max_part)
We will take a larger example where we also focus on the computation time. The example is shown below:
# simulate a data set with 1000 observations
set.seed(45L)
sim_dat <- sim_graph(1000L)
# get the biconnected components
cut_pts <- biconnected_components(from = sim_dat$from, to = sim_dat$to)
length(cut_pts) # number of biconnected components
#> [1] 56
# find the approximately balanced connected partition
system.time(
max_part <- max_balanced_partition(from = sim_dat$from, to = sim_dat$to,
trace = 1L))
#> Found perfect balanced connected partition early. Balance criterion is 500
#> Found approximately balanced connected partition. Balance criterion is 500
#> user system elapsed
#> 0.088 0.000 0.088
max_part$balance_criterion # the balance criterion
#> [1] 500
NROW(max_part$removed_edges) # number of removed edges
#> [1] 282
# improve on the number of cut vertices
system.time(
max_part <- max_balanced_partition(from = sim_dat$from, to = sim_dat$to,
trace = 1L, slack = .1,
max_kl_it_inner = 1000L))
#> Found perfect balanced connected partition early. Balance criterion is 500
#> Found approximately balanced connected partition. Balance criterion is 500
#> Starting to reduce the cost of the cut in a block of size 940. The partition in the block consists of two sets of size 494 and 446. The cut cost is 282
#> Starting iteration 1
#> Found 876 vertices to move
#> Keept 207 moves with a gain of 143 and a balance criterion of 496
#> The cut cost is 139
#> Starting iteration 2
#> Found 713 vertices to move
#> Keept 69 moves with a gain of 34 and a balance criterion of 500
#> The cut cost is 105
#> Starting iteration 3
#> Found 806 vertices to move
#> Keept 30 moves with a gain of 6 and a balance criterion of 500
#> The cut cost is 99
#> Starting iteration 4
#> Found 845 vertices to move
#> Keept 0 moves
#> The cut cost is 99
#> user system elapsed
#> 0.886 0.000 0.887
max_part$balance_criterion # the balance criterion
#> [1] 500
NROW(max_part$removed_edges) # number of removed edges
#> [1] 99
# show the split
show_split(sim_dat, max_part, vertex.size = 3, vertex.label = NA)
We can also use the method for the potentially unconnected partition:
# assign a variable with the connected partition
connected_partition <- max_part
# get the partition which may be unconnected
system.time(
max_part <- unconnected_partition(from = sim_dat$from, to = sim_dat$to,
trace = 1L, slack = .1,
max_kl_it_inner = 1000L))
#> Starting from an initial paratition with a balance criterion of 0 with a total vertex weight of 1000
#> The cut cost is 0
#> Starting to build the first partition that satisfy the balance criterion
#> Starting from an initial paratition with a balance criterion of 400 and a cut cost of 75
#> Starting iteration 1
#> Found 999 vertices to move
#> Keept 135 moves with a gain of 40 and a balance criterion of 403
#> The cut cost is 35
#> Starting iteration 2
#> Found 999 vertices to move
#> Keept 10 moves with a gain of 5 and a balance criterion of 413
#> The cut cost is 30
#> Starting iteration 3
#> Found 999 vertices to move
#> Keept 975 moves with a gain of 11 and a balance criterion of 438
#> The cut cost is 19
#> Starting iteration 4
#> Found 999 vertices to move
#> Keept 0 moves
#> The cut cost is 19
#> user system elapsed
#> 0.006 0.000 0.007
max_part$balance_criterion # the balance criterion
#> [1] 438
NROW(max_part$removed_edges) # number of removed edges
#> [1] 19
# show the split
show_split(sim_dat, max_part, vertex.size = 3, vertex.label = NA)
# we can start with random initialization
init <- unique(c(sim_dat$from, sim_dat$to))
init <- sample(init, ceiling(length(init) / 2))
system.time(
max_part <- unconnected_partition(from = sim_dat$from, to = sim_dat$to,
trace = 1L, slack = .1,
max_kl_it_inner = 1000L,
init = init))
#> Starting from an initial paratition with a balance criterion of 500 with a total vertex weight of 1000
#> The cut cost is 856
#> Starting to build the first partition that satisfy the balance criterion
#> Starting from an initial paratition with a balance criterion of 500 and a cut cost of 856
#> Starting iteration 1
#> Found 999 vertices to move
#> Keept 415 moves with a gain of 692 and a balance criterion of 491
#> The cut cost is 164
#> Starting iteration 2
#> Found 999 vertices to move
#> Keept 81 moves with a gain of 52 and a balance criterion of 500
#> The cut cost is 112
#> Starting iteration 3
#> Found 999 vertices to move
#> Keept 44 moves with a gain of 8 and a balance criterion of 492
#> The cut cost is 104
#> Starting iteration 4
#> Found 999 vertices to move
#> Keept 981 moves with a gain of 1 and a balance criterion of 473
#> The cut cost is 103
#> Starting iteration 5
#> Found 999 vertices to move
#> Keept 23 moves with a gain of 12 and a balance criterion of 480
#> The cut cost is 91
#> Starting iteration 6
#> Found 999 vertices to move
#> Keept 38 moves with a gain of 1 and a balance criterion of 442
#> The cut cost is 90
#> Starting iteration 7
#> Found 999 vertices to move
#> Keept 996 moves with a gain of 2 and a balance criterion of 438
#> The cut cost is 88
#> Starting iteration 8
#> Found 999 vertices to move
#> Keept 996 moves with a gain of 4 and a balance criterion of 434
#> The cut cost is 84
#> Starting iteration 9
#> Found 999 vertices to move
#> Keept 30 moves with a gain of 7 and a balance criterion of 452
#> The cut cost is 77
#> Starting iteration 10
#> Found 999 vertices to move
#> Keept 0 moves
#> The cut cost is 77
#> user system elapsed
#> 0.016 0.000 0.016
max_part$balance_criterion # the balance criterion
#> [1] 452
NROW(max_part$removed_edges) # number of removed edges
#> [1] 77
# show the split
show_split(sim_dat, max_part, vertex.size = 3, vertex.label = NA)
# we can also start from the connected partition
system.time(
max_part <- unconnected_partition(from = sim_dat$from, to = sim_dat$to,
trace = 1L, slack = .1,
max_kl_it_inner = 1000L,
init = connected_partition$set_1))
#> Starting from an initial paratition with a balance criterion of 500 with a total vertex weight of 1000
#> The cut cost is 99
#> Starting to build the first partition that satisfy the balance criterion
#> Starting from an initial paratition with a balance criterion of 500 and a cut cost of 99
#> Starting iteration 1
#> Found 999 vertices to move
#> Keept 79 moves with a gain of 30 and a balance criterion of 495
#> The cut cost is 69
#> Starting iteration 2
#> Found 999 vertices to move
#> Keept 990 moves with a gain of 4 and a balance criterion of 495
#> The cut cost is 65
#> Starting iteration 3
#> Found 999 vertices to move
#> Keept 17 moves with a gain of 2 and a balance criterion of 500
#> The cut cost is 63
#> Starting iteration 4
#> Found 999 vertices to move
#> Keept 52 moves with a gain of 6 and a balance criterion of 494
#> The cut cost is 57
#> Starting iteration 5
#> Found 999 vertices to move
#> Keept 8 moves with a gain of 9 and a balance criterion of 500
#> The cut cost is 48
#> Starting iteration 6
#> Found 999 vertices to move
#> Keept 0 moves
#> The cut cost is 48
#> user system elapsed
#> 0.01 0.00 0.01
max_part$balance_criterion # the balance criterion
#> [1] 500
NROW(max_part$removed_edges) # number of removed edges
#> [1] 48
# compare with
NROW(connected_partition$removed_edges)
#> [1] 99
show_split(sim_dat, max_part, vertex.size = 3, vertex.label = NA)
It is clear that the method is quite sensitive to the initialization.
Here is an even larger example:
# simulate a data set with 100000 observations
set.seed(45L)
sim_dat <- sim_graph(100000L)
# get the biconnected components
cut_pts <- biconnected_components(from = sim_dat$from, to = sim_dat$to)
length(cut_pts) # number of biconnected components
#> [1] 5694
# find the approximately balanced connected partition
system.time(
max_part <- max_balanced_partition(from = sim_dat$from, to = sim_dat$to,
trace = 1L))
#> Found perfect balanced connected partition early. Balance criterion is 50000
#> Found approximately balanced connected partition. Balance criterion is 50000
#> user system elapsed
#> 5.088 0.000 5.088
max_part$balance_criterion # the balance criterion
#> [1] 50000
NROW(max_part$removed_edges) # number of removed edges
#> [1] 890
# improve on the number of cut vertices
system.time(
max_part <- max_balanced_partition(from = sim_dat$from, to = sim_dat$to,
trace = 1L, slack = .1,
max_kl_it_inner = 1000L))
#> Found perfect balanced connected partition early. Balance criterion is 50000
#> Found approximately balanced connected partition. Balance criterion is 50000
#> Starting to reduce the cost of the cut in a block of size 7270. The partition in the block consists of two sets of size 2484 and 4786. The cut cost is 890
#> Starting iteration 1
#> Found 1000 vertices to move
#> Keept 530 moves with a gain of 414 and a balance criterion of 49895
#> The cut cost is 476
#> Starting iteration 2
#> Found 1000 vertices to move
#> Keept 223 moves with a gain of 95 and a balance criterion of 50000
#> The cut cost is 381
#> Starting iteration 3
#> Found 1000 vertices to move
#> Keept 121 moves with a gain of 5 and a balance criterion of 49983
#> The cut cost is 376
#> Starting iteration 4
#> Found 1000 vertices to move
#> Keept 19 moves with a gain of 1 and a balance criterion of 50000
#> The cut cost is 375
#> Starting iteration 5
#> Found 1000 vertices to move
#> Keept 315 moves with a gain of 6 and a balance criterion of 49998
#> The cut cost is 369
#> Starting iteration 6
#> Found 1000 vertices to move
#> Keept 204 moves with a gain of 45 and a balance criterion of 49996
#> The cut cost is 324
#> Starting iteration 7
#> Found 1000 vertices to move
#> Keept 91 moves with a gain of 7 and a balance criterion of 49975
#> The cut cost is 317
#> Starting iteration 8
#> Found 1000 vertices to move
#> Keept 21 moves with a gain of 0 and a balance criterion of 50000
#> The cut cost is 317
#> Starting iteration 9
#> Found 1000 vertices to move
#> Keept 0 moves
#> The cut cost is 317
#> user system elapsed
#> 25.37 0.00 25.37
max_part$balance_criterion # the balance criterion
#> [1] 50000
NROW(max_part$removed_edges) # number of removed edges
#> [1] 317
The computation time may depend on the ordering of the data set. It may be advantageous to allow our implementation to re-order the vertices e.g. when they are completely randomly ordered. We can see this my shuffling the data:
# shuffle the data
set.seed(1L)
sim_dat <- sim_dat[sample.int(NROW(sim_dat)), ]
# re-compute the partition without re-ordering the vertices
system.time(
max_part_shuffled <- max_balanced_partition(
from = sim_dat$from, to = sim_dat$to, trace = 1L, slack = .1,
max_kl_it_inner = 1000L, do_reorder = FALSE))
#> Found perfect balanced connected partition early. Balance criterion is 50000
#> Found approximately balanced connected partition. Balance criterion is 50000
#> Starting to reduce the cost of the cut in a block of size 7270. The partition in the block consists of two sets of size 2491 and 4779. The cut cost is 967
#> Starting iteration 1
#> Found 1000 vertices to move
#> Keept 604 moves with a gain of 463 and a balance criterion of 49934
#> The cut cost is 504
#> Starting iteration 2
#> Found 1000 vertices to move
#> Keept 395 moves with a gain of 94 and a balance criterion of 49999
#> The cut cost is 410
#> Starting iteration 3
#> Found 1000 vertices to move
#> Keept 199 moves with a gain of 47 and a balance criterion of 49994
#> The cut cost is 363
#> Starting iteration 4
#> Found 1000 vertices to move
#> Keept 24 moves with a gain of 7 and a balance criterion of 50000
#> The cut cost is 356
#> Starting iteration 5
#> Found 1000 vertices to move
#> Keept 0 moves
#> The cut cost is 356
#> user system elapsed
#> 23.25 0.00 23.25
max_part_shuffled$balance_criterion # the balance criterion
#> [1] 50000
NROW(max_part_shuffled$removed_edges) # number of removed edges
#> [1] 356
# re-compute the partition with re-ordering the vertices
system.time(
max_part_shuffled_and_ordered <- max_balanced_partition(
from = sim_dat$from, to = sim_dat$to, trace = 1L, slack = .1,
max_kl_it_inner = 1000L, do_reorder = TRUE))
#> Found perfect balanced connected partition early. Balance criterion is 50000
#> Found approximately balanced connected partition. Balance criterion is 50000
#> Starting to reduce the cost of the cut in a block of size 7270. The partition in the block consists of two sets of size 2491 and 4779. The cut cost is 967
#> Starting iteration 1
#> Found 1000 vertices to move
#> Keept 604 moves with a gain of 463 and a balance criterion of 49934
#> The cut cost is 504
#> Starting iteration 2
#> Found 1000 vertices to move
#> Keept 395 moves with a gain of 94 and a balance criterion of 49999
#> The cut cost is 410
#> Starting iteration 3
#> Found 1000 vertices to move
#> Keept 199 moves with a gain of 47 and a balance criterion of 49994
#> The cut cost is 363
#> Starting iteration 4
#> Found 1000 vertices to move
#> Keept 24 moves with a gain of 7 and a balance criterion of 50000
#> The cut cost is 356
#> Starting iteration 5
#> Found 1000 vertices to move
#> Keept 0 moves
#> The cut cost is 356
#> user system elapsed
#> 22.13 0.00 22.13
# we got the same
all.equal(max_part_shuffled_and_ordered, max_part_shuffled)
#> [1] TRUE
# may not match!
all.equal(max_part, max_part_shuffled)
#> [1] "Component \"removed_edges\": Attributes: < Component \"dim\": Mean relative difference: 0.123 >"
#> [2] "Component \"removed_edges\": Numeric: lengths (634, 712) differ"
#> [3] "Component \"set_1\": Mean relative difference: 0.0006784"
#> [4] "Component \"set_2\": Mean relative difference: 0.0006786"
The solution will possibly not match when you reorder the data you pass to max_balanced_partition
.
We can also compare the above with what we would have gotten if we had not required that the partition is connected.
# assign a variable with the connected partition
connected_partition <- max_part_shuffled
# get potentially unconnected partition
system.time(
max_part <- unconnected_partition(from = sim_dat$from, to = sim_dat$to,
trace = 1L, slack = .1,
max_kl_it_inner = 1000L))
#> Starting from an initial paratition with a balance criterion of 0 with a total vertex weight of 100000
#> The cut cost is 0
#> Starting to build the first partition that satisfy the balance criterion
#> Starting from an initial paratition with a balance criterion of 40000 and a cut cost of 7565
#> Starting iteration 1
#> Found 1000 vertices to move
#> Keept 1000 moves with a gain of 516 and a balance criterion of 40000
#> The cut cost is 7049
#> Starting iteration 2
#> Found 1000 vertices to move
#> Keept 1000 moves with a gain of 41 and a balance criterion of 40000
#> The cut cost is 7008
#> Starting iteration 3
#> Found 1000 vertices to move
#> Keept 6 moves with a gain of 2 and a balance criterion of 40000
#> The cut cost is 7006
#> Starting iteration 4
#> Found 1000 vertices to move
#> Keept 12 moves with a gain of 4 and a balance criterion of 40000
#> The cut cost is 7002
#> Starting iteration 5
#> Found 1000 vertices to move
#> Keept 432 moves with a gain of 9 and a balance criterion of 40000
#> The cut cost is 6993
#> Starting iteration 6
#> Found 1000 vertices to move
#> Keept 30 moves with a gain of 4 and a balance criterion of 40000
#> The cut cost is 6989
#> Starting iteration 7
#> Found 1000 vertices to move
#> Keept 32 moves with a gain of 15 and a balance criterion of 40000
#> The cut cost is 6974
#> Starting iteration 8
#> Found 1000 vertices to move
#> Keept 474 moves with a gain of 16 and a balance criterion of 40000
#> The cut cost is 6958
#> Starting iteration 9
#> Found 1000 vertices to move
#> Keept 112 moves with a gain of 16 and a balance criterion of 40000
#> The cut cost is 6942
#> Starting iteration 10
#> Found 1000 vertices to move
#> Keept 160 moves with a gain of 22 and a balance criterion of 40000
#> The cut cost is 6920
#> Starting iteration 11
#> Found 1000 vertices to move
#> Keept 462 moves with a gain of 13 and a balance criterion of 40000
#> The cut cost is 6907
#> Starting iteration 12
#> Found 1000 vertices to move
#> Keept 246 moves with a gain of 29 and a balance criterion of 40000
#> The cut cost is 6878
#> Starting iteration 13
#> Found 1000 vertices to move
#> Keept 292 moves with a gain of 24 and a balance criterion of 40000
#> The cut cost is 6854
#> Starting iteration 14
#> Found 1000 vertices to move
#> Keept 350 moves with a gain of 41 and a balance criterion of 40000
#> The cut cost is 6813
#> Starting iteration 15
#> Found 1000 vertices to move
#> Keept 370 moves with a gain of 7 and a balance criterion of 40000
#> The cut cost is 6806
#> Starting iteration 16
#> Found 1000 vertices to move
#> Keept 390 moves with a gain of 6 and a balance criterion of 40000
#> The cut cost is 6800
#> Starting iteration 17
#> Found 1000 vertices to move
#> Keept 750 moves with a gain of 24 and a balance criterion of 40000
#> The cut cost is 6776
#> Starting iteration 18
#> Found 1000 vertices to move
#> Keept 456 moves with a gain of 19 and a balance criterion of 40000
#> The cut cost is 6757
#> Starting iteration 19
#> Found 1000 vertices to move
#> Keept 466 moves with a gain of 3 and a balance criterion of 40000
#> The cut cost is 6754
#> Starting iteration 20
#> Found 1000 vertices to move
#> Keept 484 moves with a gain of 13 and a balance criterion of 40000
#> The cut cost is 6741
#> Starting iteration 21
#> Found 1000 vertices to move
#> Keept 504 moves with a gain of 9 and a balance criterion of 40000
#> The cut cost is 6732
#> Starting iteration 22
#> Found 1000 vertices to move
#> Keept 518 moves with a gain of 5 and a balance criterion of 40000
#> The cut cost is 6727
#> Starting iteration 23
#> Found 1000 vertices to move
#> Keept 530 moves with a gain of 8 and a balance criterion of 40000
#> The cut cost is 6719
#> Starting iteration 24
#> Found 1000 vertices to move
#> Keept 538 moves with a gain of 2 and a balance criterion of 40000
#> The cut cost is 6717
#> Starting iteration 25
#> Found 1000 vertices to move
#> Keept 0 moves
#> The cut cost is 6717
#> user system elapsed
#> 2.692 0.000 2.691
max_part$balance_criterion # the balance criterion
#> [1] 40000
NROW(max_part$removed_edges) # number of removed edges
#> [1] 6717
# improve on the connected partition
system.time(
max_part <- unconnected_partition(from = sim_dat$from, to = sim_dat$to,
trace = 1L, slack = .1,
max_kl_it_inner = 1000L,
init = connected_partition$set_1))
#> Starting from an initial paratition with a balance criterion of 50000 with a total vertex weight of 100000
#> The cut cost is 356
#> Starting to build the first partition that satisfy the balance criterion
#> Starting from an initial paratition with a balance criterion of 50000 and a cut cost of 356
#> Starting iteration 1
#> Found 1000 vertices to move
#> Keept 162 moves with a gain of 73 and a balance criterion of 49910
#> The cut cost is 283
#> Starting iteration 2
#> Found 1000 vertices to move
#> Keept 90 moves with a gain of 11 and a balance criterion of 49926
#> The cut cost is 272
#> Starting iteration 3
#> Found 1000 vertices to move
#> Keept 27 moves with a gain of 0 and a balance criterion of 49953
#> The cut cost is 272
#> Starting iteration 4
#> Found 1000 vertices to move
#> Keept 92 moves with a gain of 3 and a balance criterion of 49861
#> The cut cost is 269
#> Starting iteration 5
#> Found 1000 vertices to move
#> Keept 82 moves with a gain of 0 and a balance criterion of 49943
#> The cut cost is 269
#> Starting iteration 6
#> Found 1000 vertices to move
#> Keept 0 moves
#> The cut cost is 269
#> user system elapsed
#> 0.613 0.000 0.613
max_part$balance_criterion # the balance criterion
#> [1] 49943
NROW(max_part$removed_edges) # number of removed edges
#> [1] 269
The method completely failed this time except when we started from the balanced connected partition.
The algorithm by Chlebíková (1996) is given below. Given a block (2-connected graph) \(G=(V,E)\) and a vertex weight function \(w\):
The \(V_1 \cup \{u\}\) condition in step 2. is easy to deal with as one can keep track of vertices that have neighbors in the other set. We cannot remove \(u\in V_2\) in 2. if \(u\) is an articulation point (cut vertex) of \(V_2\) as then \(V_2\) is no longer connected. Therefore, an implementation may look as follows:
We can find the articulation points in \(\mathcal O(\lvert V_2\rvert)\) using the method suggested by Hopcroft and Tarjan (1973) and we do not need to do anything for every vertex \(u\in I\). Perhaps we can update \(A\) quicker then this though.
Chlebíková, Janka. 1996. “Approximating the Maximally Balanced Connected Partition Problem in Graphs.” Information Processing Letters 60 (5): 225–30. https://doi.org/10.1016/S0020-0190(96)00175-5.
Fiduccia, C.M., and R.M. Mattheyses. 1982. “A Linear-Time Heuristic for Improving Network Partitions.” In 19th Design Automation Conference, 175–81. https://doi.org/10.1109/DAC.1982.1585498.
Hopcroft, John, and Robert Tarjan. 1973. “Algorithm 447: Efficient Algorithms for Graph Manipulation.” Commun. ACM 16 (6): 372–78. https://doi.org/10.1145/362248.362272.
Karypis, George, and Vipin Kumar. 1998. “A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs.” SIAM Journal on Scientific Computing 20 (1): 359–92. https://doi.org/10.1137/S1064827595287997.
Kernighan, B. W., and S. Lin. 1970. “An Efficient Heuristic Procedure for Partitioning Graphs.” The Bell System Technical Journal 49 (2): 291–307. https://doi.org/10.1002/j.1538-7305.1970.tb01770.x.