R: Regularly split character vectors if possible

separate {opm}

R Documentation

Regularly split character vectors if possible

Description

From a given set of splitting characters select the ones that split a character vector in a regular way, yielding the same number of parts for all vector elements. Then apply these splitting characters to create a matrix. The data frame method applies this to all character vectors (and optionally also all factors) within a data frame.

Usage

  ## S4 method for signature 'character'
separate(object, split = opm_opt("split"),
    simplify = FALSE, keep.const = TRUE, list.wise = FALSE,
    strip.white = list.wise) 
  ## S4 method for signature 'data.frame'
separate(object, split = opm_opt("split"),
    simplify = FALSE, keep.const = TRUE, coerce = TRUE, name.sep = ".", ...) 
  ## S4 method for signature 'factor'
separate(object, split = opm_opt("split"),
    simplify = FALSE, keep.const = TRUE, ...)

Arguments

`object`	Character vector to be split, or data frame in which character vectors (or factors) shall be attempted to be split, or factor.
`split`	Character vector or `TRUE`. If a character vector, used as container of the splitting characters and converted to a vector containing only non-duplicated single-character strings. For instance, the default `split` argument `".-_"` yields `c(".", "-", "_")`. If a vector of only empty strings or `TRUE`, strings with parts representing fixed-width fields are assumed, and splitting is done at whitespace-only columns. Beforehand, equal-length strings are created by padding with spaces at the right. After splitting in fixed-width mode, whitespace characters are trimmed from both ends of the resulting strings.
`simplify`	Logical scalar indicating whether a resulting matrix with one column should be simplified to a vector (or such a data frame to a factor). If so, at least one matrix column is kept, even if `keep.const` is `FALSE`.
`keep.const`	Logical scalar indicating whether constant columns should be kept or removed.
`coerce`	Logical scalar indicating whether factors should be coerced to ‘character’ mode and then also be attempted to be split. The resulting columns will be coerced back to factors.
`name.sep`	Character scalar to be inserted in the constructed column names. If more than one column results from splitting, the names will contain (i) the original column name, (ii) `name.sep` and (iii) their index, thus creating unique column names (if the original ones were unique).
`list.wise`	Logical scalar. Ignored if `split` is `TRUE`. Otherwise, `object` is assumed to contain word lists separated by `split`. The result is a logical matrix in which the columns represent these words and the fields indicate whether or not a word was present in a certain item contained in `object`.
`strip.white`	Logical scalar. Remove whitespace from the ends of each resulting character scalar after splitting? Has an effect on the removal of constant columns. Whitespace is always removed if `split` is `TRUE`.
`...`	Optional arguments passed between the methods.

Details

This function is useful if information coded in the elements of a character vector is to be converted to a matrix or data frame. For instance, file names created by a batch export conducted by a some software are usually more or less regularly structured and contain content at distinct positions. In such situations, the correct splitting approach can be recognised by yielding the same number of fields from each vector element.

Value

Character matrix, its number of rows being equal to the length of object, or data frame with the same number of rows as object but potentially more columns. May be character vector of factor with character or factor input and simplify set to TRUE.

Examples

# Splitting by characters
x <- c("a-b-cc", "d-ff-g")
(y <- separate(x, ".")) # a split character that does not occur

##      [,1]    
## [1,] "a-b-cc"
## [2,] "d-ff-g"

stopifnot(is.matrix(y), y[, 1L] == x)
(y <- separate(x, "-")) # a split character that does occur

##      [,1] [,2] [,3]
## [1,] "a"  "b"  "cc"
## [2,] "d"  "ff" "g"

stopifnot(is.matrix(y), dim(y) == c(2, 3))

# Fixed-with splitting
x <- c("  abd  efgh", " ABCD EFGH ", " xyz")
(y <- separate(x, TRUE))

##      1      2     
## [1,] "abd"  "efgh"
## [2,] "ABCD" "EFGH"
## [3,] "xyz"  ""

stopifnot(is.matrix(y), dim(y) == c(3, 2))

# Applied to factors
xx <- as.factor(x)
(yy <- separate(xx, TRUE))

##      1    2
## 1  abd efgh
## 2 ABCD EFGH
## 3  xyz

stopifnot(identical(yy, as.data.frame(y)))

# List-wise splitting
x <- c("a,b", "c,b", "a,c")
(y <- separate(x, ",", list.wise = TRUE))

##          a     b     c
## [1,]  TRUE  TRUE FALSE
## [2,] FALSE  TRUE  TRUE
## [3,]  TRUE FALSE  TRUE

stopifnot(is.matrix(y), dim(y) == c(3, 3), is.logical(y))

# Data-frame method
x <- data.frame(a = 1:2, b = c("a-b-cc", "a-ff-g"))
(y <- separate(x, coerce = FALSE))

##   a      b
## 1 1 a-b-cc
## 2 2 a-ff-g

stopifnot(identical(x, y))
(y <- separate(x)) # only character/factor columns are split

##   a b.1 b.2 b.3
## 1 1   a   b  cc
## 2 2   a  ff   g

stopifnot(is.data.frame(y), dim(y) == c(2, 4))
stopifnot(sapply(y, class) == c("integer", "factor", "factor", "factor"))
(y <- separate(x, keep.const = FALSE))

##   a b.1 b.2
## 1 1   b  cc
## 2 2  ff   g

stopifnot(is.data.frame(y), dim(y) == c(2, 3))
stopifnot(sapply(y, class) == c("integer", "factor", "factor"))

[Package opm version 1.3.63 Index]