3. Create Patient Similarity Networks (or Feature Design)¶

3.1. Overview¶

Defining meaningful patient similarity networks (or features) is the key to building a high-performing classifier. It is also crucial to using netDx for clinical discovery. When designing patient similarity networks, two considerations are the level at which individual variables should be grouped, and the measure used to define patient similarity.

3.2. Grouping Variables: What makes an input network?¶

In a given datatype, not all measured variables are equally predictive. Moreover some groupings of variables may be more informative about mechanism, perhaps because they use prior knowledge. Such variables are more interpretable, as they reflect some process of clinical or biological relevance. For each datatype, the user needs to decide which level of variable-grouping to apply.

This table provides some examples for different types of data. Consider the lung cancer example from the Introduction, and assume we measure: 34 clinical variables; 20,000 genes; 16 metabolites; and genetic mutations from whole-genome sequencing.

Type of data	1 variable per network	All variables in one network	Subset of variables per network
Clinical data	Age or sex or smoking frequency (34 networks)	Clinical data (1 network)	a set of 5 variables measuring lung condition (2-27 networks)
Gene expression	Top 10 known high-risk genes, one gene per network(10 networks)	All gene expression data (1 network)	Genes grouped by pathways (~2,000 networks)
Metabolic data	3 metabolites identified through other statistical analyses(3 networks)	All metabolite data (1 network)	Metabolites grouped by biological process they affect (3-4 networks)
Genetic data	Top genes from GWAS studies (~10 networks)	All genetic data (1 network)	Genes grouped by pathways (~2,000 networks)

3.3. Choosing a similarity metric¶

Patient similarity can be measured in different ways for different types of input data. Here is a set of recommended metrics to start with, depending on what the type of data is and how many variables were grouped to create a given net.

_images/sim_metrics.png sim_metrics.png

3.4. Summary table¶

Type of data	Example	Similarity measure	Example call
Continous, over 5 vars	Gene expression	Pearson correlation	makePSN_NamedMatrix(dat,dat_names, groupList,outDir,writeProfiles=TRUE)
Continous, 2-5 vars	Gene expression	Average normalized difference (custom)	makePSN_NamedMatrix(dat,dat_names,myGroup, outDir,simMetric="custom",customFunc=normDiff2, sparsify=TRUE)
Discrete, mutation data	Gene mutations	Co-occurrence in same unit (e.g. gene or pathway)	makePSN_RangeSets(mutation_GR, pathway_GRList,outDir)

3.4.1. Defining custom similarity functions¶

netDx is agnostic to the choice of similarity metric. Set the simMetric argument of makePSN_NamedMatrix to custom to provide a user-defined similarity metric. Be aware that the choice of similarity metric could increase false positives or false negatives during feature selection. Build controls to guard against these.

3.5. Expression data¶

This situation applies to a table of >5 values with continuous-valued measures. An example is a table of gene expression data, with ~20,000 measures per patient. Another is proteomic data with ~20 measures per patient.

Suggested metric: Pearson correlation. This is the default similarity metric for makePSN_NamedMatrix() so no special specification is required.

Note: Be sure to set writeProfiles=TRUE.

3.6. Fewer than 5 datapoints¶

Pearson correlation is not a stable measure of similarity when there are fewer than 5 variables per patient. An alternate similarity measure in such a situation is the average normalized difference for each variable.

_images/avg_normDiff.png avg_normDiff

#' Similarity by average of normalized difference
#'
#' @details Similarity measure for network with 2-5 variables. 
#' Defined as the average of normalized difference for each of the 
#' member variables
#' @param x (matrix) rows are patients, columns are values for component
#' variables
normDiff_avg <- function(x) {

	# normalized difference for a single variable
	normDiff <- function(x) {
	    nm <- colnames(x)
	    x <- as.numeric(x)
	    n <- length(x)
	    rngX  <- max(x,na.rm=T)-min(x,na.rm=T)
	    
	    out <- matrix(NA,nrow=n,ncol=n);
	    # weight between i and j is
	    # wt(i,j) = 1 - (abs(x[i]-x[j])/(max(x)-min(x)))
	    for (j in 1:n) out[,j] <- 1-(abs((x-x[j])/rngX))
	    rownames(out) <- nm; colnames(out)<- nm
	    out
	}

	sim <- matrix(0,nrow=ncol(x),ncol=ncol(x))
	for (k in 1:nrow(x)) {
		tmp <- normDiff(x[k,,drop=FALSE])
		sim <- sim + tmp
		rownames(sim) <- rownames(tmp)
		colnames(sim) <- colnames(tmp)
	}
	sim <- sim/nrow(x)
	sim
}

Use in netDx: Given

datmatrix with 5 variables (5xN matrix, where N is number of patients)
dat_names vector with variable names
myGroup list with the group name as key and members as variables,

then the call to create networks would be:

makePSN_NamedMatrix(dat,dat_names,myGroup,
			outDir,simMetric="custom",customFunc=normDiff_avg,
			sparsify=TRUE)

Note:

simMetric="custom"
customFunc points to the custom function definition
writeProfiles=FALSE
sparsify=TRUE keep only the strongest edges for efficient memory use

3.7. Range-based data (genetic mutations)¶

Creating patient data from genomic events such as genetic mutations or DNA copy number polymorphisms, requires a different design for creating PSN.

This section coming soon - 170927