陳叔叔的部落格

文章摘要

FakeGPT

加載中...

此內容由人工不智慧生成。

這篇文章講因子資料的處理。關於因子資料（Factor），我已經在本系列文章《[R Lang] R Lang與高級醫學統計學（2）》中詳細闡述。

對於因子資料的處理，可以使用 tidyverse 套件系統中的 forcats 套件進行處理。

基本函式

forcats 套件的基本函式包含：

fct_count(f, sort = FALSE, prop = FALSE)：計算類別水準數目。
fct_unique(f)：呈現專一類別水準名稱。
fct_c(f1, f2)：合併不同類別水準的 2 個因子物件。

1
library(dplyr)
2
library(ggplot2)
3
library(forcats)
4
library(kableExtra)
5
## Warning: package 'kableExtra' was built under R version 4.0.3
6
set.seed(100)
7
letters[1:5]
8
## [1] "a" "b" "c" "d" "e"
9
f <- factor(sample(letters[1:5])[rpois(100, 5)])
10
table(f)
11
## f
12
##  a  b  c  d  e
13
## 21  2  3 21 15
14
fct_count(f)
15
## # A tibble: 6 x 2
16
##   f         n
17
##   <fct> <int>
18
## 1 a        21
19
## 2 b         2
20
## 3 c         3
21
## 4 d        21
22
## 5 e        15
23
## 6 <NA>     38
24
fct_count(f, sort = TRUE)
25
## # A tibble: 6 x 2
26
##   f         n
27
##   <fct> <int>
28
## 1 <NA>     38
29
## 2 a        21
30
## 3 d        21
31
## 4 e        15
32
## 5 c         3
33
## 6 b         2
34
fct_count(f, sort = TRUE, prop = TRUE)
35
## # A tibble: 6 x 3
36
##   f         n     p
37
##   <fct> <int> <dbl>
38
## 1 <NA>     38  0.38
39
## 2 a        21  0.21
40
## 3 d        21  0.21
41
## 4 e        15  0.15
42
## 5 c         3  0.03
43
## 6 b         2  0.02
44
##
45
f1 <- factor(letters[1:3])
46
f2 <- factor(letters[c(1, 2, 23)])
47
f1
48
## [1] a b c
49
## Levels: a b c
50
f2
51
## [1] a b w
52
## Levels: a b w
53
fct_c(f1, f2)
54
## [1] a b c a b w
55
## Levels: a b c w

移除或增加類別水準

函式 fct_drop() 可以移除部分類別水準，函式 fct_expand() 可以增加因子變數中的類別水準。函式 fct_explicit_na 可以明確設性缺失值為 1 項類別水準。

1
f <- factor(c("F", "M"), levels = c("F", "M", "Other"))
2
f
3
## [1] F M
4
## Levels: F M Other
5
fct_drop(f)
6
## [1] F M
7
## Levels: F M
8
# Set only to restrict which levels to drop
9
fct_drop(f, only = "F")
10
## [1] F M
11
## Levels: F M Other
12
fct_drop(f, only = "Other")
13
## [1] F M
14
## Levels: F M
15
##
16
fct_expand(f, "B", "T")
17
## [1] F M
18
## Levels: F M Other B T
19
##
20
f <- factor(c("F", "M", "M", "F", "F", "B", "T", NA, NA))
21
f
22
## [1] F    M    M    F    F    B    T    <NA> <NA>
23
## Levels: B F M T
24
fct_explicit_na(f)
25
## [1] F         M         M         F         F         B         T         (Missing)
26
## [9] (Missing)
27
## Levels: B F M T (Missing)
28
fct_explicit_na(f, na_level = "Other")
29
## [1] F     M     M     F     F     B     T     Other Other
30
## Levels: B F M T Other

改變或合併類別水準函式

系列函式 fct_lump() 可將部分類別水準合併。這個系列包含：

fct_lump_min()：合併類別水準頻率計數低於設定的最小值。
fct_lump_prop()：合併類別水準相對頻率低於設定的最小值。
fct_lump_n()：合併類別水準最多 n 種主要類別。
fct_lump_lowfreq()：合併類別水準，且確保 other 類別的頻率仍是最低。

函式宣告如下：

1
fct_lump(f, n, prop, w = NULL, other_level = "Other",
2
         ties.method = c("min", "average", "first", "last", "random", "max"))
3
fct_lump_min(f, min, w = NULL, other_level = "Other")
4
fct_lump_prop(f, prop, w = NULL, other_level = "Other")
5
fct_lump_n(f, n, w = NULL, other_level = "Other",
6
  ties.method = c("min", "average", "first", "last", "random", "max"))
7
fct_lump_lowfreq(f, other_level = "Other")

其中引數 f 為因子向量，n 設定最多 n 種主要類別，prop 設定正值百分率，合併小於 prop 的類別，設定負值百分率，合併大於 prop 的類別。w 設定權重。other_level 設定合併後的類別名稱。ties.method 處理相同排序方式。min 保留至少出現 min 次類別。

1
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
2
x %>% table()
3
## .
4
##  A  B  C  D  E  F  G  H  I
5
## 40 10  5 27  1  1  1  1  1
6
x %>% fct_lump_n(3) %>% table()
7
## .
8
##     A     B     D Other
9
##    40    10    27    10
10
x %>% fct_lump_prop(0.10) %>% table()
11
## .
12
##     A     B     D Other
13
##    40    10    27    10
14
x %>% fct_lump_min(5) %>% table()
15
## .
16
##     A     B     C     D Other
17
##    40    10     5    27     5
18
x %>% fct_lump_lowfreq() %>% table()
19
## .
20
##     A     D Other
21
##    40    27    20
22
##
23
set.seed(123)
24
x <- factor(letters[rpois(50, 5)])
25
x
26
##  [1] d g d h i b e h e e i e f e b h c b d i h f f k f f e e d c i h f g a e f c d c c d
27
## [43] d d c c c e d g
28
## Levels: a b c d e f g h i k
29
table(x)
30
## x
31
## a b c d e f g h i k
32
## 1 3 8 9 9 7 3 5 4 1
33
table(fct_lump_lowfreq(x))
34
##
35
##     b     c     d     e     f     g     h     i Other
36
##     3     8     9     9     7     3     5     4     2
37
## Use positive values to collapse the rarest
38
fct_lump_n(x, n = 3)
39
##  [1] d     Other d     Other Other Other e     Other e     e     Other e     Other e
40
## [15] Other Other c     Other d     Other Other Other Other Other Other Other e     e
41
## [29] d     c     Other Other Other Other Other e     Other c     d     c     c     d
42
## [43] d     d     c     c     c     e     d     Other
43
## Levels: c d e Other
44
fct_lump_prop(x, prop = 0.1)
45
##  [1] d     Other d     Other Other Other e     Other e     e     Other e     f     e
46
## [15] Other Other c     Other d     Other Other f     f     Other f     f     e     e
47
## [29] d     c     Other Other f     Other Other e     f     c     d     c     c     d
48
## [43] d     d     c     c     c     e     d     Other
49
## Levels: c d e f Other
50
## Use negative values to collapse the most common
51
fct_lump_n(x, n = -3)
52
##  [1] Other g     Other Other Other b     Other Other Other Other Other Other Other Other
53
## [15] b     Other Other b     Other Other Other Other Other k     Other Other Other Other
54
## [29] Other Other Other Other Other g     a     Other Other Other Other Other Other Other
55
## [43] Other Other Other Other Other Other Other g
56
## Levels: a b g k Other
57
fct_lump_prop(x, prop = -0.1)
58
##  [1] Other g     Other h     i     b     Other h     Other Other i     Other Other Other
59
## [15] b     h     Other b     Other i     h     Other Other k     Other Other Other Other
60
## [29] Other Other i     h     Other g     a     Other Other Other Other Other Other Other
61
## [43] Other Other Other Other Other Other Other g
62
## Levels: a b g h i k Other
63
## Use weighted frequencies
64
w <- c(rep(2, 25), rep(1, 25))
65
fct_lump_n(x, n = 5, w = w)
66
##  [1] d     Other d     h     Other Other e     h     e     e     Other e     f     e
67
## [15] Other h     c     Other d     Other h     f     f     Other f     f     e     e
68
## [29] d     c     Other h     f     Other Other e     f     c     d     c     c     d
69
## [43] d     d     c     c     c     e     d     Other
70
## Levels: c d e f h Other
71
fct_lump_n(x, n = 6)
72
##  [1] d     Other d     h     i     Other e     h     e     e     i     e     f     e
73
## [15] Other h     c     Other d     i     h     f     f     Other f     f     e     e
74
## [29] d     c     i     h     f     Other Other e     f     c     d     c     c     d
75
## [43] d     d     c     c     c     e     d     Other
76
## Levels: c d e f h i Other
77
fct_lump_n(x, n = 6, ties.method = "max")
78
##  [1] d     Other d     h     i     Other e     h     e     e     i     e     f     e
79
## [15] Other h     c     Other d     i     h     f     f     Other f     f     e     e
80
## [29] d     c     i     h     f     Other Other e     f     c     d     c     c     d
81
## [43] d     d     c     c     c     e     d     Other
82
## Levels: c d e f h i Other
83
## Use fct_lump_min() to lump together all levels with fewer than `n` values
84
table(fct_lump_min(x, min = 10))
85
##
86
## Other
87
##    50

類別重新排列函式

因子資料的類別重新排序函式有：

fct_rev(f)：將反轉原有類別出現的排列順序。
fct_shuffle(f, n = 1L)：將原有類別出現的排列順序隨機變更。
fct_reorder(.f, .x, .fun = median, …, .desc = FALSE)
fct_reorder2(.f, .x, .y, .fun = last2, …, .desc = TRUE)
first2(.x, .y)
last2(.x, .y)

fct_reorder() 將因子 f 類別出現的排列順序依照其他變數更動，fct_reorder2() 保留因子 f 原有類別出現的排列順序。當 y 變數依照 x 變數排序，函式 first2(.x, .y) 與 last2(.x, .y) 可尋找 y 變數的最前與最後的 2 個數值。

引數解釋：

.f：為主要因子變數。
.x, .y：為其他變數。
.fun：為摘要函式。
.desc = FALSE：由小到大排序。

1
f <- factor(c("a", "b", "c"))
2
fct_rev(f)
3
## [1] a b c
4
## Levels: c b a
5
fct_shuffle(f)
6
## [1] a b c
7
## Levels: a c b
8
fct_shuffle(f)
9
## [1] a b c
10
## Levels: c b a
11
##
12
df <- tibble::tribble(
13
  ~color,     ~a, ~b,
14
  "blue",      1,  2,
15
  "green",     6,  2,
16
  "purple",    3,  3,
17
  "red",       2,  3,
18
  "yellow",    5,  1
19
)
20
df$color <- factor(df$color)
21
##
22
fct_reorder(df$color, df$a, min)
23
## [1] blue   green  purple red    yellow
24
## Levels: blue red purple yellow green
25
fct_reorder2(df$color, df$a, df$b)
26
## [1] blue   green  purple red    yellow
27
## Levels: purple red blue green yellow

[R Lang] R Lang 與高級醫學統計學（10）

https://blog.kynix.tw/posts/1731068651415/

作者

Adrian Chen

建檔時間

2024 年 4 月 13 日

協議

BY-NC-SA 4.0

姓名標示-非商業性-相同方式分享 4.0 國際