R: Discretisation functions

discrete {opm}

R Documentation

Discretisation functions

Description

These are the helper functions called by do_disc (which is the function normally applied by an opm user). discrete converts continuous numeric characters to discrete ones. best_cutoff determines the best cutoff for dividing a numeric matrix into two categories by minimising within-group discrepancies. That is, for each combination of row group and column maximise the number of contained elements that are in the category in which most of the elements within this combination of row group and column are located.

Usage

  ## S4 method for signature 'matrix,character'
best_cutoff(x, y, ...) 
  ## S4 method for signature 'matrix,factor'
best_cutoff(x, y, combined = TRUE, lower = min(x, na.rm = TRUE),
    upper = max(x, na.rm = TRUE), all = FALSE) 

  ## S4 method for signature 'array'
discrete(x, ...) 
  ## S4 method for signature 'data.frame'
discrete(x, ..., as.labels = NULL, sep = " ")
  ## S4 method for signature 'numeric'
discrete(x, range, gap = FALSE,
    output = c("character", "integer", "logical", "factor", "numeric"),
    middle.na = TRUE, states = 32L, ...)

Arguments

`x`	Numeric vector or array object convertible to a numeric vector. The data-frame method first calls `extract`, restricting the columns to the numeric ones. `best_cutoff` only accepts a numeric matrix.
`range`	If a numeric vector, in non-`gap` mode (see next argument) the assumed real range of the data; must contain all elements of `x`, but can be much wider. In `gap` mode, it must, in contrast, lie within the range of `x`. If `range` is set to `TRUE`, the empirical range of `x` is used in non-`gap` mode. In `gap` mode, the range is determined using `run_kmeans` with the number of clusters set to `3` and then applying `borders` to the result. The number of clusters is set to `2` if `range` is `FALSE` in `gap` mode.
`gap`	Logical scalar. If `TRUE`, always convert to binary or ternary characters, ignoring `states`. `range` then indicates a partial range of `x` within which character conversion is ambiguous and has to be treated as either missing information or intermediate character state, depending on `middle.na`. If `FALSE` (the default), apply an equal-width-intervals discretisation with the widths determined from the number of requested `states` and `range`.
`output`	String determining the output mode: ‘character’, ‘integer’, ‘logical’, ‘factor’, or ‘numeric’. ‘numeric’ simply returns `x`, but performs the range checks. One cannot combine ‘logical’ with `TRUE` values for both `gap` and `middle.na`.
`middle.na`	Logical scalar. Only relevant in `gap` mode. In that case, if `TRUE`, the middle value yields `NA` (uncertain whether negative or positive). If `FALSE`, the middle value lies between the left and the right one (i.e., a third character state meaning ‘weak’). This is simply coded as 0-1-2 and thus cannot be combined with ‘logical’ as `output` setting.
`states`	Integer or character vector. Ignored in `gap` mode and if `output` is not ‘character’. Otherwise, the possible values are a single-element character vector, which is split into its elements; a multiple-element character vector which is used directly; an integer vector indicating the elements to pick from the default character states. In the latter case, a single integer is interpreted as the upper bound of an integer vector starting at 1.
`as.labels`	Vector of data-frame indexes. See `extract`. (If given, this argument must be named.)
`sep`	Character scalar. See `extract`. (If given, this argument must be named.)
`y`	Factor or character vector indicating group affiliations. Its length must correspond to the number of rows of `x`.
`combined`	Logical scalar. If `TRUE`, determine a single threshold for the entire matrix. If `FALSE`, determine one threshold for each group of rows of `x` that corresponds to a level of `y`.
`lower`	Numeric scalar. Lower bound for the cutoff values to test.
`upper`	Numeric scalar. Upper bound for the cutoff values to test.
`all`	Logical scalar. If `TRUE`, calculate the score for all possible cutoffs for `x`. This is slow and is only useful for plotting complete optimisation curves.
`...`	Optional arguments passed between the methods or, if requested, to `run_kmeans` (except `object` and `k`, see there).

Details

One of the uses of discrete is to create character data suitable for phylogenetic studies with programs such as PAUP* and RAxML. These accept only discrete characters with at most 32 states, coded as 0 to 9 followed by A to V. For the full export one additionally needs phylo_data. The matrix method is just a wrapper that takes care of the matrix dimensions, and the data-frame method is a wrapper for that method.

The term ‘character’ as used here has no direct connection to the eponymous mode or class of R. Rather, the term is borrowed from taxonomic classification in biology, where, technically, a single ‘character’ is stored in one column of a data matrix if each organism is stored in one row. Characters are the quasi-independent units of evolution on the one hand and of phylogenetic reconstruction (and thus taxonomic classification) on the other hand.

The scoring function to be maximised by best_cutoff is calculated as follows. All values in x are divided into those larger then the cutoff and those at most large as the cutoff. For each combination of group and matrix column the frequencies of the two categories are determined, and the respective larger ones are summed up over all combinations. This value is then divided by the frequency over the entire matrix of the more frequent of the two categories. This is done to avoid trivial solutions with minimal and maximal cutoffs, causing all values to be placed in the same category.

Value

discrete generates a double, integer, character or logical vector or factor, depending on output. For the matrix method, a matrix composed of a vector as produced by the numeric method, the original dimensions and the original dimnames attributes of x.

If combined is TRUE, best_cutoff yields either a matrix or a vector: If all is TRUE, a two-column matrix with (i) the cutoffs examined and (ii) the resulting scores. If all is FALSE, a vector with the entries ‘maximum’ (the best cutoff) and ‘objective’ (the score it achieved). If combined is FALSE, either a list of matrices or a matrix. If all is TRUE, a list of matrices structures like the single matrix returned if combined is TRUE. If all is FALSE, a matrix with two columns called ‘maximum’ ‘objective’, and one row per level of y.

References

Dougherty, J., Kohavi, R., Sahami, M. 1995 Supervised and unsupervised discretisation of continuous features. In: Prieditis, A., Russell, S. (eds.) Machine Learning: Proceedings of the fifth international conference.

Ventura, D., Martinez, T. R. 1995 An empirical comparison of discretisation methods. Proceedings of the Tenth International Symposium on Computer and Information Sciences, p. 443–450.

Wiley, E. O., Lieberman, B. S. 2011 Phylogenetics: Theory and Practice of Phylogenetic Systematics. Hoboken, New Jersey: Wiley-Blackwell.

Bunuel, L. 1972 Le charme discret de la bourgeoisie. France/Spain, 96 min.

Examples

# Treat everything between 3.4 and 4.5 as ambiguous
(x <- discrete(1:5, range = c(3.5, 4.5), gap = TRUE))

## [1] "0" "0" "0" "?" "1"
## attr(,"cutoffs")
## [1] 3.5 4.5

stopifnot(x == c("0", "0", "0", "?", "1"))

# Treat everything between 3.4 and 4.5 as intermediate
(x <- discrete(1:5, range = c(3.5, 4.5), gap = TRUE, middle.na = FALSE))

## [1] "0" "0" "0" "1" "2"
## attr(,"cutoffs")
## [1] 3.5 4.5

stopifnot(x == c("0", "0", "0", "1", "2"))

# Boring example: real and possible range as well as the number of states
# to code the data have a 1:1 relationship
(x <- discrete(1:5, range = c(1, 5), states = 5))

## [1] "0" "1" "2" "3" "4"

stopifnot(identical(x, as.character(0:4)))

# Now fit the data into a potential range twice as large, and at the
# beginning of it
(x <- discrete(1:5, range = c(1, 10), states = 5))

## [1] "0" "0" "1" "1" "2"

stopifnot(identical(x, as.character(c(0, 0, 1, 1, 2))))

# Matrix and data-frame methods
x <- matrix(as.numeric(1:10), ncol = 2)
(y <- discrete(x, range = c(3.4, 4.5), gap = TRUE))

##      [,1] [,2]
## [1,] "0"  "1" 
## [2,] "0"  "1" 
## [3,] "0"  "1" 
## [4,] "?"  "1" 
## [5,] "1"  "1" 
## attr(,"cutoffs")
## [1] 3.4 4.5

stopifnot(identical(dim(x), dim(y)))
(yy <- discrete(as.data.frame(x), range = c(3.4, 4.5), gap = TRUE))

##      V1  V2 
## [1,] "0" "1"
## [2,] "0" "1"
## [3,] "0" "1"
## [4,] "?" "1"
## [5,] "1" "1"
## attr(,"cutoffs")
## [1] 3.4 4.5

stopifnot(y == yy)

# K-means based discretisation of PM data (prefer do_disc() for this)
x <- extract(vaas_4, as.labels = list("Species", "Strain"),
  in.parens = FALSE)
(y <- discrete(x, range = TRUE, gap = TRUE))[, 1:3]

##                                Negative Control Dextrin D-Maltose
## Escherichia coli DSM18039      "0"              "?"     "0"      
## Escherichia coli DSM30083T     "?"              "1"     "1"      
## Pseudomonas aeruginosa DSM1707 "0"              "0"     "0"      
## Pseudomonas aeruginosa 429SC1  "0"              "0"     "0"

stopifnot(c("0", "?", "1") %in% y)

## best_cutoff()
x <- matrix(c(5:2, 1:2, 7:8), ncol = 2)
grps <- c("a", "a", "b", "b")

# combined optimisation
(y <- best_cutoff(x, grps))

##   maximum objective 
##  3.673825  2.000000

stopifnot(is.numeric(y), length(y) == 2) # two-element numeric vector
stopifnot(y[["maximum"]] < 4, y[["maximum"]] > 3, y[["objective"]] == 2)
plot(best_cutoff(x, grps, all = TRUE), type = "l")

plot of chunk unnamed-chunk-1

# separate optimisation
(y <- best_cutoff(x, grps, combined = FALSE))

##    maximum objective
## a 2.652523         2
## b 6.347592         2

stopifnot(is.matrix(y), dim(y) == c(2, 2)) # numeric matrix
stopifnot(y["a", "objective"] == 2, y["b", "objective"] == 2)
(y <- best_cutoff(x, grps, combined = FALSE, all = TRUE))

## $a
##      cutoff score
## [1,]    1.5     1
## [2,]    3.0     2
## [3,]    4.5     1
## 
## $b
##      cutoff score
## [1,]    2.5     1
## [2,]    5.0     2
## [3,]    7.5     1

plot(y$a, type = "l")

plot of chunk unnamed-chunk-1

plot(y$b, type = "l")

plot of chunk unnamed-chunk-1

[Package opm version 1.3.63 Index]

Discretisation functions

Description

Usage

Arguments

Details

Value

References

See Also

Examples