[R Lang] R Lang 與高級醫學統計學(10)

2024 年 4 月 13 日
文章摘要
FakeGPT
加載中...
此內容由人工不智慧生成。

這篇文章講因子資料的處理。關於因子資料(Factor),我已經在本系列文章《[R Lang] R Lang與高級醫學統計學(2)》中詳細闡述。

對於因子資料的處理,可以使用 tidyverse 套件系統中的 forcats 套件進行處理。

基本函式

forcats 套件的基本函式包含:

  • fct_count(f, sort = FALSE, prop = FALSE):計算類別水準數目。
  • fct_unique(f):呈現專一類別水準名稱。
  • fct_c(f1, f2):合併不同類別水準的 2 個因子物件。
library(dplyr)
library(ggplot2)
library(forcats)
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.0.3
set.seed(100)
letters[1:5]
## [1] "a" "b" "c" "d" "e"
f <- factor(sample(letters[1:5])[rpois(100, 5)])
table(f)
## f
## a b c d e
## 21 2 3 21 15
fct_count(f)
## # A tibble: 6 x 2
## f n
## <fct> <int>
## 1 a 21
## 2 b 2
## 3 c 3
## 4 d 21
## 5 e 15
## 6 <NA> 38
fct_count(f, sort = TRUE)
## # A tibble: 6 x 2
## f n
## <fct> <int>
## 1 <NA> 38
## 2 a 21
## 3 d 21
## 4 e 15
## 5 c 3
## 6 b 2
fct_count(f, sort = TRUE, prop = TRUE)
## # A tibble: 6 x 3
## f n p
## <fct> <int> <dbl>
## 1 <NA> 38 0.38
## 2 a 21 0.21
## 3 d 21 0.21
## 4 e 15 0.15
## 5 c 3 0.03
## 6 b 2 0.02
##
f1 <- factor(letters[1:3])
f2 <- factor(letters[c(1, 2, 23)])
f1
## [1] a b c
## Levels: a b c
f2
## [1] a b w
## Levels: a b w
fct_c(f1, f2)
## [1] a b c a b w
## Levels: a b c w

移除或增加類別水準

函式 fct_drop() 可以移除部分類別水準,函式 fct_expand() 可以增加因子變數中的類別水準。函式 fct_explicit_na 可以明確設性缺失值為 1 項類別水準。

f <- factor(c("F", "M"), levels = c("F", "M", "Other"))
f
## [1] F M
## Levels: F M Other
fct_drop(f)
## [1] F M
## Levels: F M
# Set only to restrict which levels to drop
fct_drop(f, only = "F")
## [1] F M
## Levels: F M Other
fct_drop(f, only = "Other")
## [1] F M
## Levels: F M
##
fct_expand(f, "B", "T")
## [1] F M
## Levels: F M Other B T
##
f <- factor(c("F", "M", "M", "F", "F", "B", "T", NA, NA))
f
## [1] F M M F F B T <NA> <NA>
## Levels: B F M T
fct_explicit_na(f)
## [1] F M M F F B T (Missing)
## [9] (Missing)
## Levels: B F M T (Missing)
fct_explicit_na(f, na_level = "Other")
## [1] F M M F F B T Other Other
## Levels: B F M T Other

改變或合併類別水準函式

系列函式 fct_lump() 可將部分類別水準合併。這個系列包含:

  • fct_lump_min():合併類別水準頻率計數低於設定的最小值。
  • fct_lump_prop():合併類別水準相對頻率低於設定的最小值。
  • fct_lump_n():合併類別水準最多 n 種主要類別。
  • fct_lump_lowfreq():合併類別水準,且確保 other 類別的頻率仍是最低。

函式宣告如下:

fct_lump(f, n, prop, w = NULL, other_level = "Other",
ties.method = c("min", "average", "first", "last", "random", "max"))
fct_lump_min(f, min, w = NULL, other_level = "Other")
fct_lump_prop(f, prop, w = NULL, other_level = "Other")
fct_lump_n(f, n, w = NULL, other_level = "Other",
ties.method = c("min", "average", "first", "last", "random", "max"))
fct_lump_lowfreq(f, other_level = "Other")

其中引數 f 為因子向量,n 設定最多 n 種主要類別,prop 設定正值百分率,合併小於 prop 的類別,設定負值百分率,合併大於 prop 的類別。w 設定權重。other_level 設定合併後的類別名稱。ties.method 處理相同排序方式。min 保留至少出現 min 次類別。

x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
x %>% table()
## .
## A B C D E F G H I
## 40 10 5 27 1 1 1 1 1
x %>% fct_lump_n(3) %>% table()
## .
## A B D Other
## 40 10 27 10
x %>% fct_lump_prop(0.10) %>% table()
## .
## A B D Other
## 40 10 27 10
x %>% fct_lump_min(5) %>% table()
## .
## A B C D Other
## 40 10 5 27 5
x %>% fct_lump_lowfreq() %>% table()
## .
## A D Other
## 40 27 20
##
set.seed(123)
x <- factor(letters[rpois(50, 5)])
x
## [1] d g d h i b e h e e i e f e b h c b d i h f f k f f e e d c i h f g a e f c d c c d
## [43] d d c c c e d g
## Levels: a b c d e f g h i k
table(x)
## x
## a b c d e f g h i k
## 1 3 8 9 9 7 3 5 4 1
table(fct_lump_lowfreq(x))
##
## b c d e f g h i Other
## 3 8 9 9 7 3 5 4 2
## Use positive values to collapse the rarest
fct_lump_n(x, n = 3)
## [1] d Other d Other Other Other e Other e e Other e Other e
## [15] Other Other c Other d Other Other Other Other Other Other Other e e
## [29] d c Other Other Other Other Other e Other c d c c d
## [43] d d c c c e d Other
## Levels: c d e Other
fct_lump_prop(x, prop = 0.1)
## [1] d Other d Other Other Other e Other e e Other e f e
## [15] Other Other c Other d Other Other f f Other f f e e
## [29] d c Other Other f Other Other e f c d c c d
## [43] d d c c c e d Other
## Levels: c d e f Other
## Use negative values to collapse the most common
fct_lump_n(x, n = -3)
## [1] Other g Other Other Other b Other Other Other Other Other Other Other Other
## [15] b Other Other b Other Other Other Other Other k Other Other Other Other
## [29] Other Other Other Other Other g a Other Other Other Other Other Other Other
## [43] Other Other Other Other Other Other Other g
## Levels: a b g k Other
fct_lump_prop(x, prop = -0.1)
## [1] Other g Other h i b Other h Other Other i Other Other Other
## [15] b h Other b Other i h Other Other k Other Other Other Other
## [29] Other Other i h Other g a Other Other Other Other Other Other Other
## [43] Other Other Other Other Other Other Other g
## Levels: a b g h i k Other
## Use weighted frequencies
w <- c(rep(2, 25), rep(1, 25))
fct_lump_n(x, n = 5, w = w)
## [1] d Other d h Other Other e h e e Other e f e
## [15] Other h c Other d Other h f f Other f f e e
## [29] d c Other h f Other Other e f c d c c d
## [43] d d c c c e d Other
## Levels: c d e f h Other
fct_lump_n(x, n = 6)
## [1] d Other d h i Other e h e e i e f e
## [15] Other h c Other d i h f f Other f f e e
## [29] d c i h f Other Other e f c d c c d
## [43] d d c c c e d Other
## Levels: c d e f h i Other
fct_lump_n(x, n = 6, ties.method = "max")
## [1] d Other d h i Other e h e e i e f e
## [15] Other h c Other d i h f f Other f f e e
## [29] d c i h f Other Other e f c d c c d
## [43] d d c c c e d Other
## Levels: c d e f h i Other
## Use fct_lump_min() to lump together all levels with fewer than `n` values
table(fct_lump_min(x, min = 10))
##
## Other
## 50

類別重新排列函式

因子資料的類別重新排序函式有:

  • fct_rev(f):將反轉原有類別出現的排列順序。
  • fct_shuffle(f, n = 1L):將原有類別出現的排列順序隨機變更。
  • fct_reorder(.f, .x, .fun = median, …, .desc = FALSE)
  • fct_reorder2(.f, .x, .y, .fun = last2, …, .desc = TRUE)
  • first2(.x, .y)
  • last2(.x, .y)

fct_reorder() 將因子 f 類別出現的排列順序依照其他變數更動,fct_reorder2() 保留因子 f 原有類別出現的排列順序。當 y 變數依照 x 變數排序,函式 first2(.x, .y)last2(.x, .y) 可尋找 y 變數的最前與最後的 2 個數值。

引數解釋:

  • .f:為主要因子變數。
  • .x, .y:為其他變數。
  • .fun:為摘要函式。
  • .desc = FALSE:由小到大排序。
f <- factor(c("a", "b", "c"))
fct_rev(f)
## [1] a b c
## Levels: c b a
fct_shuffle(f)
## [1] a b c
## Levels: a c b
fct_shuffle(f)
## [1] a b c
## Levels: c b a
##
df <- tibble::tribble(
~color, ~a, ~b,
"blue", 1, 2,
"green", 6, 2,
"purple", 3, 3,
"red", 2, 3,
"yellow", 5, 1
)
df$color <- factor(df$color)
##
fct_reorder(df$color, df$a, min)
## [1] blue green purple red yellow
## Levels: blue red purple yellow green
fct_reorder2(df$color, df$a, df$b)
## [1] blue green purple red yellow
## Levels: purple red blue green yellow
[R Lang] R Lang 與高級醫學統計學(10)
https://blog.kynix.tw/posts/1731068651415/
作者
Adrian Chen
建檔時間
2024 年 4 月 13 日
協議
BY-NC-SA 4.0
姓名標示-非商業性-相同方式分享 4.0 國際