Title: | Longitudinal Consensus Clustering with 'flexmix' |
---|---|
Description: | An adaption of the consensus clustering approach from 'ConsensusClusterPlus' for longitudinal data. The longitudinal data is clustered with flexible mixture models from 'flexmix', while the consensus matrices are hierarchically clustered as in 'ConsensusClusterPlus'. By using the flexibility from 'flexmix' and 'FactoMineR', one can use mixed data types for the clustering. |
Authors: | Jonas Hagenberg [aut, cre] , Matt Wilkerson [aut, cph], Peter Waltman [aut, cph], Max Planck Institute of Psychiatry [cph] |
Maintainer: | Jonas Hagenberg <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.2.0 |
Built: | 2024-11-12 06:13:34 UTC |
Source: | https://github.com/cellmapslab/longmixr |
This function uses the ConsensusClusterPlus
function from the package
with the same name with defaults for clustering data with categorical
variables. As the distance function, the Gower distance is used.
crosssectional_consensus_cluster( data, reps = 1000, finalLinkage = "ward.D2", innerLinkage = "ward.D2", ... )
crosssectional_consensus_cluster( data, reps = 1000, finalLinkage = "ward.D2", innerLinkage = "ward.D2", ... )
data |
a matrix or data.frame containing variables that should be used
for computing the distance. This argument is passed to |
reps |
number of repetitions, same as in |
finalLinkage |
linkage method for final clustering,
same as in |
innerLinkage |
linkage method for clustering steps,
same as in |
... |
other arguments passed to |
data
can take all input data types that gower.dist
can handle, i.e. numeric
, character
/factor
, ordered
and logical
.
The output is produced by ConsensusClusterPlus
dc <- mtcars # scale continuous variables dc <- sapply(mtcars[, 1:7], scale) # code factor variables dc <- cbind(as.data.frame(dc), vs = as.factor(mtcars$vs), am = as.factor(mtcars$am), gear = as.factor(mtcars$gear), carb = as.factor(mtcars$carb)) cc <- crosssectional_consensus_cluster( data = dc, reps = 10, seed = 1 )
dc <- mtcars # scale continuous variables dc <- sapply(mtcars[, 1:7], scale) # code factor variables dc <- cbind(as.data.frame(dc), vs = as.factor(mtcars$vs), am = as.factor(mtcars$am), gear = as.factor(mtcars$gear), carb = as.factor(mtcars$carb)) cc <- crosssectional_consensus_cluster( data = dc, reps = 10, seed = 1 )
A simulated data set containing observations of 100 individuals at four time points. The data was simulated in two groups (50 individuals each) and contains two questionnaires with five items each, one questionnaire with five continuous variables and one additional cross-sectional continuous variable. In this data set the group variable from the simulation is included. You typically don't have this group variable in your data.
fake_questionnaire_data
fake_questionnaire_data
A data frame with 400 rows and 20 variables:
patient ID
time point of the observation
to which simulated group the observation belongs to
age of the patient at time point 1
a cross-sectional continuous variable, i.e. there is only one unique value per individual
the first item of questionnaire A with categories 1 to 5
the second item of questionnaire A with categories 1 to 5
the third item of questionnaire A with categories 1 to 5
the fourth item of questionnaire A with categories 1 to 5
the fifth item of questionnaire A with categories 1 to 5
the first item of questionnaire B with categories 1 to 5
the second item of questionnaire B with categories 1 to 5
the third item of questionnaire B with categories 1 to 5
the fourth item of questionnaire B with categories 1 to 5
the fifth item of questionnaire B with categories 1 to 5
the first continuous variable of questionnaire C
the second continuous variable of questionnaire C
the third continuous variable of questionnaire C
the fourth continuous variable of questionnaire C
the fifth continuous variable of questionnaire C
simulated data
This functions extracts the cluster assignments from an lcc
object.
One can specify which for which number of clusters the assignments
should be returned.
get_clusters(cluster_solution, number_clusters = NULL)
get_clusters(cluster_solution, number_clusters = NULL)
cluster_solution |
an |
number_clusters |
default is |
a data.frame
with an ID column (the name of the ID column
was specified by the user when calling the
longitudinal_consensus_cluster
) function and one column with cluster
assignments for every specified number of clusters. Only the assignments
included in number_clusters
are returned in the form of columns with
the names assignment_num_clus_x
# not run set.seed(5) test_data <- data.frame(patient_id = rep(1:10, each = 4), visit = rep(1:4, 10), var_1 = c(rnorm(20, -1), rnorm(20, 3)) + rep(seq(from = 0, to = 1.5, length.out = 4), 10), var_2 = c(rnorm(20, 0.5, 1.5), rnorm(20, -2, 0.3)) + rep(seq(from = 1.5, to = 0, length.out = 4), 10)) model_list <- list(flexmix::FLXMRmgcv(as.formula("var_1 ~ .")), flexmix::FLXMRmgcv(as.formula("var_2 ~ ."))) clustering <- longitudinal_consensus_cluster( data = test_data, id_column = "patient_id", max_k = 2, reps = 3, model_list = model_list, flexmix_formula = as.formula("~s(visit, k = 4) | patient_id")) cluster_assignments <- get_clusters(clustering, number_clusters = 2) # end not run
# not run set.seed(5) test_data <- data.frame(patient_id = rep(1:10, each = 4), visit = rep(1:4, 10), var_1 = c(rnorm(20, -1), rnorm(20, 3)) + rep(seq(from = 0, to = 1.5, length.out = 4), 10), var_2 = c(rnorm(20, 0.5, 1.5), rnorm(20, -2, 0.3)) + rep(seq(from = 1.5, to = 0, length.out = 4), 10)) model_list <- list(flexmix::FLXMRmgcv(as.formula("var_1 ~ .")), flexmix::FLXMRmgcv(as.formula("var_2 ~ ."))) clustering <- longitudinal_consensus_cluster( data = test_data, id_column = "patient_id", max_k = 2, reps = 3, model_list = model_list, flexmix_formula = as.formula("~s(visit, k = 4) | patient_id")) cluster_assignments <- get_clusters(clustering, number_clusters = 2) # end not run
This function performs longitudinal clustering with flexmix. To get robust
results, the data is subsampled and the clustering is performed on this
subsample. The results are combined in a consensus matrix and a final
hierarchical clustering step performed on this matrix. In this, it follows
the approach from the ConsensusClusterPlus
package.
longitudinal_consensus_cluster( data = NULL, id_column = NULL, max_k = 3, reps = 10, p_item = 0.8, model_list = NULL, flexmix_formula = as.formula("~s(visit, k = 4) | patient_id"), title = "untitled_consensus_cluster", final_linkage = c("average", "ward.D", "ward.D2", "single", "complete", "mcquitty", "median", "centroid"), seed = 3794, verbose = FALSE )
longitudinal_consensus_cluster( data = NULL, id_column = NULL, max_k = 3, reps = 10, p_item = 0.8, model_list = NULL, flexmix_formula = as.formula("~s(visit, k = 4) | patient_id"), title = "untitled_consensus_cluster", final_linkage = c("average", "ward.D", "ward.D2", "single", "complete", "mcquitty", "median", "centroid"), seed = 3794, verbose = FALSE )
data |
a |
id_column |
name (character vector) of the ID column in |
max_k |
maximum number of clusters, default is |
reps |
number of repetitions, default is |
p_item |
fraction of samples contained in subsampled sample, default is
|
model_list |
either one |
flexmix_formula |
a |
title |
name of the clustering; used if |
final_linkage |
linkage used for the last hierarchical clustering step on
the consensus matrix; has to be |
seed |
seed for reproducibility |
verbose |
|
The data types longitudinal_consensus_cluster
can handle depends on
how the flexmix
models are set up, in principle all data types are
supported for which there is a flexmix
driver with the desired
outcome variable.
If you follow the dimension reduction approach outlined in
vignette("Example clustering analysis", package = "longmixr")
, the
input data types depend on what FAMD
from the FactoMineR
package can handle. FAMD
accepts numeric
variables and treats
all other variables as factor
variables which it can handle as well.
An object (list) of class lcc
with length maxk
.
The first entry general_information
contains the entries:
consensus_matrices |
a list of all consensus matrices (for all specified clusters) |
cluster_assignments |
a data.frame with an ID column named after id_column and a column for every specified number of clusters, e.g. assignment_num_clus_2 |
call |
the call/all arguments how longitudinal_consensus_cluster was called
|
The other entries correspond to the number of specified clusters (e.g. the second entry corresponds to 2 specified clusters) and each contains a list with the following entries:
consensus_matrix |
the consensus matrix |
consensus_tree |
the result of the hierarchical clustering on the consensus matrix |
consensus_class |
the resulting class for every observation |
found_flexmix_clusters |
a vector of the actual found number of clusters by flexmix (which can deviate from the specified number)
|
set.seed(5) test_data <- data.frame(patient_id = rep(1:10, each = 4), visit = rep(1:4, 10), var_1 = c(rnorm(20, -1), rnorm(20, 3)) + rep(seq(from = 0, to = 1.5, length.out = 4), 10), var_2 = c(rnorm(20, 0.5, 1.5), rnorm(20, -2, 0.3)) + rep(seq(from = 1.5, to = 0, length.out = 4), 10)) model_list <- list(flexmix::FLXMRmgcv(as.formula("var_1 ~ .")), flexmix::FLXMRmgcv(as.formula("var_2 ~ ."))) clustering <- longitudinal_consensus_cluster( data = test_data, id_column = "patient_id", max_k = 2, reps = 3, model_list = model_list, flexmix_formula = as.formula("~s(visit, k = 4) | patient_id")) # not run # plot(clustering) # end not run
set.seed(5) test_data <- data.frame(patient_id = rep(1:10, each = 4), visit = rep(1:4, 10), var_1 = c(rnorm(20, -1), rnorm(20, 3)) + rep(seq(from = 0, to = 1.5, length.out = 4), 10), var_2 = c(rnorm(20, 0.5, 1.5), rnorm(20, -2, 0.3)) + rep(seq(from = 1.5, to = 0, length.out = 4), 10)) model_list <- list(flexmix::FLXMRmgcv(as.formula("var_1 ~ .")), flexmix::FLXMRmgcv(as.formula("var_2 ~ ."))) clustering <- longitudinal_consensus_cluster( data = test_data, id_column = "patient_id", max_k = 2, reps = 3, model_list = model_list, flexmix_formula = as.formula("~s(visit, k = 4) | patient_id")) # not run # plot(clustering) # end not run
A helper function to plot alluvial plots of a categorical variable separated
by the clusters found by longmixr
. You need to have
ggalluvial
installed to use this function.
plot_alluvial( model, data, variable_name, time_variable, number_of_clusters = 2 )
plot_alluvial( model, data, variable_name, time_variable, number_of_clusters = 2 )
model |
model |
data |
a |
variable_name |
name of the categorical variable to be plotted as character |
time_variable |
the name of the variable that depicts the time point of the measurements |
number_of_clusters |
the number of clusters that should be plotted, the
default is |
a ggplot
object that is plotted
library(ggalluvial) set.seed(5) test_data <- data.frame(patient_id = rep(1:10, each = 4), visit = rep(1:4, 10), var_1 = c(rnorm(20, -1), rnorm(20, 3)) + rep(seq(from = 0, to = 1.5, length.out = 4), 10), var_2 = c(rnorm(20, 0.5, 1.5), rnorm(20, -2, 0.3)) + rep(seq(from = 1.5, to = 0, length.out = 4), 10)) model_list <- list(flexmix::FLXMRmgcv(as.formula("var_1 ~ .")), flexmix::FLXMRmgcv(as.formula("var_2 ~ ."))) clustering <- longitudinal_consensus_cluster( data = test_data, id_column = "patient_id", max_k = 2, reps = 3, model_list = model_list, flexmix_formula = as.formula("~s(visit, k = 4) | patient_id")) # add categorical variable for test plotting test_data$cat <- sample(LETTERS[1:3], 40, replace = TRUE) plot_alluvial( model = clustering, data = test_data, variable_name = "cat", time_variable = "visit" )
library(ggalluvial) set.seed(5) test_data <- data.frame(patient_id = rep(1:10, each = 4), visit = rep(1:4, 10), var_1 = c(rnorm(20, -1), rnorm(20, 3)) + rep(seq(from = 0, to = 1.5, length.out = 4), 10), var_2 = c(rnorm(20, 0.5, 1.5), rnorm(20, -2, 0.3)) + rep(seq(from = 1.5, to = 0, length.out = 4), 10)) model_list <- list(flexmix::FLXMRmgcv(as.formula("var_1 ~ .")), flexmix::FLXMRmgcv(as.formula("var_2 ~ ."))) clustering <- longitudinal_consensus_cluster( data = test_data, id_column = "patient_id", max_k = 2, reps = 3, model_list = model_list, flexmix_formula = as.formula("~s(visit, k = 4) | patient_id")) # add categorical variable for test plotting test_data$cat <- sample(LETTERS[1:3], 40, replace = TRUE) plot_alluvial( model = clustering, data = test_data, variable_name = "cat", time_variable = "visit" )
A helper function to plot spaghetti plots of continuous variables separated
by the clusters found by longmixr
.
plot_spaghetti( model, data, variable_names, time_variable, show_mean_sd_ribbon = TRUE, number_of_clusters = 2, scales = "fixed" )
plot_spaghetti( model, data, variable_names, time_variable, show_mean_sd_ribbon = TRUE, number_of_clusters = 2, scales = "fixed" )
model |
|
data |
a |
variable_names |
character vector of the continuous variables to be plotted |
time_variable |
the name of the variable that depicts the time point of the measurements |
show_mean_sd_ribbon |
|
number_of_clusters |
the number of clusters that should be plotted, the
default is |
scales |
|
The spaghetti plot shows the longitudinal trajectory (defined by
time_variable
) of continuous variables separated by the clusters found
by longitudinal_consensus_cluster
. The provided data.frame
for data
can either be the same as used in the clustering with
longitudinal_consensus_cluster
or needs to contain the same
id_column
as in the clustering and a time_variable
.
a ggplot
object that is plotted
set.seed(5) test_data <- data.frame(patient_id = rep(1:10, each = 4), visit = rep(1:4, 10), var_1 = c(rnorm(20, -1), rnorm(20, 3)) + rep(seq(from = 0, to = 1.5, length.out = 4), 10), var_2 = c(rnorm(20, 0.5, 1.5), rnorm(20, -2, 0.3)) + rep(seq(from = 1.5, to = 0, length.out = 4), 10)) model_list <- list(flexmix::FLXMRmgcv(as.formula("var_1 ~ .")), flexmix::FLXMRmgcv(as.formula("var_2 ~ ."))) clustering <- longitudinal_consensus_cluster( data = test_data, id_column = "patient_id", max_k = 2, reps = 3, model_list = model_list, flexmix_formula = as.formula("~s(visit, k = 4) | patient_id")) plot_spaghetti( model = clustering, data = test_data, variable_names = "var_1", time_variable = "visit" )
set.seed(5) test_data <- data.frame(patient_id = rep(1:10, each = 4), visit = rep(1:4, 10), var_1 = c(rnorm(20, -1), rnorm(20, 3)) + rep(seq(from = 0, to = 1.5, length.out = 4), 10), var_2 = c(rnorm(20, 0.5, 1.5), rnorm(20, -2, 0.3)) + rep(seq(from = 1.5, to = 0, length.out = 4), 10)) model_list <- list(flexmix::FLXMRmgcv(as.formula("var_1 ~ .")), flexmix::FLXMRmgcv(as.formula("var_2 ~ ."))) clustering <- longitudinal_consensus_cluster( data = test_data, id_column = "patient_id", max_k = 2, reps = 3, model_list = model_list, flexmix_formula = as.formula("~s(visit, k = 4) | patient_id")) plot_spaghetti( model = clustering, data = test_data, variable_names = "var_1", time_variable = "visit" )
Plot a longitudinal consensus clustering
## S3 method for class 'lcc' plot(x, color_palette = NULL, which_plots = "all", n_item_consensus = 3, ...)
## S3 method for class 'lcc' plot(x, color_palette = NULL, which_plots = "all", n_item_consensus = 3, ...)
x |
|
color_palette |
optional character vector of colors for consensus matrix |
which_plots |
determine which plots should be plotted; the default is |
n_item_consensus |
determines how many item consensus plots are plotted
together in one plot before a new plot is used; the default is |
... |
additional parameters for plotting; currently not used |
Plots the following plots (when selected):
consensus matrix legend |
the legend for the following consensus matrix plots (select with "consensusmatrix_legend" ) |
consensus matrix plot |
for every specified number of clusters, a heatmap of the consensus matrix and the result of the final clustering is shown (select with "consensusmatrix_x" where x is replaced by the corresponding number
of clusters) |
consensus CDF |
a line plot of the CDFs for all different specified numbers of clusters (select with "CDF" ) |
Delta area |
elbow plot of the difference in the CDFs between the different numbers of clusters (select with "delta" ) |
tracking plot |
cluster assignment of the subjects throughout the different cluster solutions (select with "cluster_tracking" ) |
item-consensus |
for every item (subject), calculate the average consensus value with all items that are assigned to one consensus cluster. This is repeated for every cluster and for all different numbers of clusters (select with "item_consensus" ) |
cluster-consensus |
every bar represents the average pair-wise item-consensus within one consensus cluster (select with "cluster_consensus" )
|
In the final step, the consensus clustering performs a hierarchical clustering
step on the consensus cluster. This function tries out different linkage
methods and returns the corresponding clusterings. The outputs can be plotted
like the results from longitudinal_consensus_cluster
.
test_clustering_methods( results, use_methods = c("average", "ward.D", "ward.D2", "single", "complete", "mcquitty", "median", "centroid") )
test_clustering_methods( results, use_methods = c("average", "ward.D", "ward.D2", "single", "complete", "mcquitty", "median", "centroid") )
results |
clustering result of class |
use_methods |
character vector of one or several items of |
a list of elements, each element of class lcc
. The entries are
named after the used linkage method.
set.seed(5) test_data <- data.frame(patient_id = rep(1:10, each = 4), visit = rep(1:4, 10), var_1 = c(rnorm(20, -1), rnorm(20, 3)) + rep(seq(from = 0, to = 1.5, length.out = 4), 10), var_2 = c(rnorm(20, 0.5, 1.5), rnorm(20, -2, 0.3)) + rep(seq(from = 1.5, to = 0, length.out = 4), 10)) model_list <- list(flexmix::FLXMRmgcv(as.formula("var_1 ~ .")), flexmix::FLXMRmgcv(as.formula("var_2 ~ ."))) clustering <- longitudinal_consensus_cluster( data = test_data, id_column = "patient_id", max_k = 2, reps = 3, model_list = model_list, flexmix_formula = as.formula("~s(visit, k = 4) | patient_id")) clustering_linkage <- test_clustering_methods(results = clustering, use_methods = c("average", "single")) # not run # plot(clustering_linkage[["single"]]) # end not run
set.seed(5) test_data <- data.frame(patient_id = rep(1:10, each = 4), visit = rep(1:4, 10), var_1 = c(rnorm(20, -1), rnorm(20, 3)) + rep(seq(from = 0, to = 1.5, length.out = 4), 10), var_2 = c(rnorm(20, 0.5, 1.5), rnorm(20, -2, 0.3)) + rep(seq(from = 1.5, to = 0, length.out = 4), 10)) model_list <- list(flexmix::FLXMRmgcv(as.formula("var_1 ~ .")), flexmix::FLXMRmgcv(as.formula("var_2 ~ ."))) clustering <- longitudinal_consensus_cluster( data = test_data, id_column = "patient_id", max_k = 2, reps = 3, model_list = model_list, flexmix_formula = as.formula("~s(visit, k = 4) | patient_id")) clustering_linkage <- test_clustering_methods(results = clustering, use_methods = c("average", "single")) # not run # plot(clustering_linkage[["single"]]) # end not run