這篇文章講因子資料的處理。關於因子資料(Factor),我已經在本系列文章《[R Lang] R Lang與高級醫學統計學(2)》中詳細闡述。
對於因子資料的處理,可以使用 tidyverse
套件系統中的 forcats
套件進行處理。
基本函式
forcats
套件的基本函式包含:
- fct_count(f, sort = FALSE, prop = FALSE):計算類別水準數目。
- fct_unique(f):呈現專一類別水準名稱。
- fct_c(f1, f2):合併不同類別水準的 2 個因子物件。
library(dplyr)library(ggplot2)library(forcats)library(kableExtra)## Warning: package 'kableExtra' was built under R version 4.0.3set.seed(100)letters[1:5]## [1] "a" "b" "c" "d" "e"f <- factor(sample(letters[1:5])[rpois(100, 5)])table(f)## f## a b c d e## 21 2 3 21 15fct_count(f)## # A tibble: 6 x 2## f n## <fct> <int>## 1 a 21## 2 b 2## 3 c 3## 4 d 21## 5 e 15## 6 <NA> 38fct_count(f, sort = TRUE)## # A tibble: 6 x 2## f n## <fct> <int>## 1 <NA> 38## 2 a 21## 3 d 21## 4 e 15## 5 c 3## 6 b 2fct_count(f, sort = TRUE, prop = TRUE)## # A tibble: 6 x 3## f n p## <fct> <int> <dbl>## 1 <NA> 38 0.38## 2 a 21 0.21## 3 d 21 0.21## 4 e 15 0.15## 5 c 3 0.03## 6 b 2 0.02##f1 <- factor(letters[1:3])f2 <- factor(letters[c(1, 2, 23)])f1## [1] a b c## Levels: a b cf2## [1] a b w## Levels: a b wfct_c(f1, f2)## [1] a b c a b w## Levels: a b c w
移除或增加類別水準
函式 fct_drop()
可以移除部分類別水準,函式 fct_expand()
可以增加因子變數中的類別水準。函式 fct_explicit_na
可以明確設性缺失值為 1 項類別水準。
f <- factor(c("F", "M"), levels = c("F", "M", "Other"))f## [1] F M## Levels: F M Otherfct_drop(f)## [1] F M## Levels: F M# Set only to restrict which levels to dropfct_drop(f, only = "F")## [1] F M## Levels: F M Otherfct_drop(f, only = "Other")## [1] F M## Levels: F M##fct_expand(f, "B", "T")## [1] F M## Levels: F M Other B T##f <- factor(c("F", "M", "M", "F", "F", "B", "T", NA, NA))f## [1] F M M F F B T <NA> <NA>## Levels: B F M Tfct_explicit_na(f)## [1] F M M F F B T (Missing)## [9] (Missing)## Levels: B F M T (Missing)fct_explicit_na(f, na_level = "Other")## [1] F M M F F B T Other Other## Levels: B F M T Other
改變或合併類別水準函式
系列函式 fct_lump()
可將部分類別水準合併。這個系列包含:
- fct_lump_min():合併類別水準頻率計數低於設定的最小值。
- fct_lump_prop():合併類別水準相對頻率低於設定的最小值。
- fct_lump_n():合併類別水準最多 n 種主要類別。
- fct_lump_lowfreq():合併類別水準,且確保 other 類別的頻率仍是最低。
函式宣告如下:
fct_lump(f, n, prop, w = NULL, other_level = "Other", ties.method = c("min", "average", "first", "last", "random", "max"))fct_lump_min(f, min, w = NULL, other_level = "Other")fct_lump_prop(f, prop, w = NULL, other_level = "Other")fct_lump_n(f, n, w = NULL, other_level = "Other", ties.method = c("min", "average", "first", "last", "random", "max"))fct_lump_lowfreq(f, other_level = "Other")
其中引數 f
為因子向量,n
設定最多 n 種主要類別,prop
設定正值百分率,合併小於 prop 的類別,設定負值百分率,合併大於 prop 的類別。w
設定權重。other_level
設定合併後的類別名稱。ties.method
處理相同排序方式。min
保留至少出現 min 次類別。
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))x %>% table()## .## A B C D E F G H I## 40 10 5 27 1 1 1 1 1x %>% fct_lump_n(3) %>% table()## .## A B D Other## 40 10 27 10x %>% fct_lump_prop(0.10) %>% table()## .## A B D Other## 40 10 27 10x %>% fct_lump_min(5) %>% table()## .## A B C D Other## 40 10 5 27 5x %>% fct_lump_lowfreq() %>% table()## .## A D Other## 40 27 20##set.seed(123)x <- factor(letters[rpois(50, 5)])x## [1] d g d h i b e h e e i e f e b h c b d i h f f k f f e e d c i h f g a e f c d c c d## [43] d d c c c e d g## Levels: a b c d e f g h i ktable(x)## x## a b c d e f g h i k## 1 3 8 9 9 7 3 5 4 1table(fct_lump_lowfreq(x))#### b c d e f g h i Other## 3 8 9 9 7 3 5 4 2## Use positive values to collapse the rarestfct_lump_n(x, n = 3)## [1] d Other d Other Other Other e Other e e Other e Other e## [15] Other Other c Other d Other Other Other Other Other Other Other e e## [29] d c Other Other Other Other Other e Other c d c c d## [43] d d c c c e d Other## Levels: c d e Otherfct_lump_prop(x, prop = 0.1)## [1] d Other d Other Other Other e Other e e Other e f e## [15] Other Other c Other d Other Other f f Other f f e e## [29] d c Other Other f Other Other e f c d c c d## [43] d d c c c e d Other## Levels: c d e f Other## Use negative values to collapse the most commonfct_lump_n(x, n = -3)## [1] Other g Other Other Other b Other Other Other Other Other Other Other Other## [15] b Other Other b Other Other Other Other Other k Other Other Other Other## [29] Other Other Other Other Other g a Other Other Other Other Other Other Other## [43] Other Other Other Other Other Other Other g## Levels: a b g k Otherfct_lump_prop(x, prop = -0.1)## [1] Other g Other h i b Other h Other Other i Other Other Other## [15] b h Other b Other i h Other Other k Other Other Other Other## [29] Other Other i h Other g a Other Other Other Other Other Other Other## [43] Other Other Other Other Other Other Other g## Levels: a b g h i k Other## Use weighted frequenciesw <- c(rep(2, 25), rep(1, 25))fct_lump_n(x, n = 5, w = w)## [1] d Other d h Other Other e h e e Other e f e## [15] Other h c Other d Other h f f Other f f e e## [29] d c Other h f Other Other e f c d c c d## [43] d d c c c e d Other## Levels: c d e f h Otherfct_lump_n(x, n = 6)## [1] d Other d h i Other e h e e i e f e## [15] Other h c Other d i h f f Other f f e e## [29] d c i h f Other Other e f c d c c d## [43] d d c c c e d Other## Levels: c d e f h i Otherfct_lump_n(x, n = 6, ties.method = "max")## [1] d Other d h i Other e h e e i e f e## [15] Other h c Other d i h f f Other f f e e## [29] d c i h f Other Other e f c d c c d## [43] d d c c c e d Other## Levels: c d e f h i Other## Use fct_lump_min() to lump together all levels with fewer than `n` valuestable(fct_lump_min(x, min = 10))#### Other## 50
類別重新排列函式
因子資料的類別重新排序函式有:
- fct_rev(f):將反轉原有類別出現的排列順序。
- fct_shuffle(f, n = 1L):將原有類別出現的排列順序隨機變更。
- fct_reorder(.f, .x, .fun = median, …, .desc = FALSE)
- fct_reorder2(.f, .x, .y, .fun = last2, …, .desc = TRUE)
- first2(.x, .y)
- last2(.x, .y)
fct_reorder()
將因子 f 類別出現的排列順序依照其他變數更動,fct_reorder2()
保留因子 f 原有類別出現的排列順序。當 y 變數依照 x 變數排序,函式 first2(.x, .y)
與 last2(.x, .y)
可尋找 y 變數的最前與最後的 2 個數值。
引數解釋:
- .f:為主要因子變數。
- .x, .y:為其他變數。
- .fun:為摘要函式。
- .desc = FALSE:由小到大排序。
f <- factor(c("a", "b", "c"))fct_rev(f)## [1] a b c## Levels: c b afct_shuffle(f)## [1] a b c## Levels: a c bfct_shuffle(f)## [1] a b c## Levels: c b a##df <- tibble::tribble( ~color, ~a, ~b, "blue", 1, 2, "green", 6, 2, "purple", 3, 3, "red", 2, 3, "yellow", 5, 1)df$color <- factor(df$color)##fct_reorder(df$color, df$a, min)## [1] blue green purple red yellow## Levels: blue red purple yellow greenfct_reorder2(df$color, df$a, df$b)## [1] blue green purple red yellow## Levels: purple red blue green yellow