陳叔叔的部落格

文章摘要

FakeGPT

加載中...

此內容由人工不智慧生成。

這篇文章講資料的基本處理。主要用到的就是 tidyverse 套件系統。它提供了資料科學的一些實用的函式。

我們首先要安裝 tidyverse 套件系統。

1
install.packages("tidyverse")

如果是第一次安裝套件，會讓你選擇 CRAN repository 的 mirror，就近選擇一個就好。我是有選擇台大的鏡像站。另外，如果你想要更換鏡像站的話，在 PyCharm 下的步驟如下：

打開側邊欄 R Tools，切換到 Packages。
點按上面的“+”。
點按 Manage Repositories。
在彈出的視窗中點按 @CRAN@，然後點按上面的筆形按鈕。
在彈出的視窗中選擇 Repository。

CRAN1

CRAN2

CRAN3

在 R Studio 中的步驟請自行 Google。

資料輸入

資料輸入套件 readr

readr 是 tidyverse 套件系統中的一個套件，專門用於處理資料的輸入。其中的函式有 read_csv() 用於輸入 .csv 資料檔， read_excel() 輸入 excel 資料檔，read_delim() 輸入不同分隔符號的資料檔。

其中 read_delim() 函式的定義如下：

1
read_delim(
2
  file,
3
  delim,
4
  quote = "\"",
5
  escape_backslash = FALSE,
6
  escape_double = TRUE,
7
  col_names = TRUE,
8
  col_types = NULL,
9
  locale = default_locale(),
10
  na = c("", "NA"),
11
  quoted_na = TRUE,
12
  comment = "",
13
  trim_ws = FALSE,
14
  skip = 0,
15
  n_max = Inf,
16
  guess_max = min(1000, n_max),
17
  progress = show_progress(),
18
  skip_empty_rows = TRUE
19
)

解釋一下引數：

file：路徑與檔名
delim：分隔符號
quote：視同分隔符號（資料的文字變數值常放在雙引號中）
escape_backslash：預設 FALSE，是否有逃脫符號
escape_double：預設 TRUE，是否用引號符號作為逃脫符號
col_names：設定變數名（T 或 F）
col_types：設定變數的類型
na：設定 NA 符號
comment：設定注釋符號，在注釋符號之後的文字不會被讀入
trim_ws：去除變數值得空白
skip：要跳過幾行（row）才開始讀入資料
n_max：最大輸入行數

1
library(tidyverse)
2
library(readr)
3
dd <- readr::read_csv("./Data/DMTKAInfMo.csv")
4

5
library(readxl)
6
dd <- readxl::read_excel("C:/RData/DMTKAInfMo.xls")

整潔資料

所謂整潔資料（tidy data），可以認為是一個使用 R 進行處理的資料標準。該標準的基本要求如下：

每個變數各自形成一欄（縱行, column）
每個列（橫列, row）各自為一個觀測時間的測量
一個檔案只用一張資料表（sheet）
一個欄位（縱行, Column）只有一個變數，同時有清楚的變數名
若完整資料包含不同資料表，則不同資料表要有索引（inxex）或指標變數（id）可進行關聯與串聯

Tibble與Data Frame

透過 readr 套件讀入的資料會被儲存成 tibble 物件。它相較於 data.frame 幾乎無差別，只是多了一些方便 tidyverse 處理的屬性。

使用 as.data.frame() 函式可以將 tibble 物件轉成 data.frame。使用 as_tibble() 函式也可以將 data.frame 轉換成 tibble。

資料流動管道運算指令

運算指令為 %>%，稱為 pipe，由 tidyverse 套件系統中的 magrittr 套件提供。

運算指令的左側通常是資料物件，包括資料框架、矩陣、向量等。右側則通常是函式。在流動過程中，左側的資料物件自動成為右側函式的第一個引數。

1
library(magrittr)
2
c(1:10) %>% mean() %>% log()
3
## [1] 1.705
4
## 相當於 log(mean(c(1:10)))

資料檢視函式 glimpse()

在讀入資料之後，我們必須對資料進行檢視，從而確定資料是否有被正確讀入。在 tidyverse 套件系統的 tibble 套件中提供一個檢視資料的函式 glimpse()，該函式類似於 R base 中的 str()。

執行效果如下：

1
glimpse(dd)
2
## Rows: 137
3
## Columns: 8
4
## $ treat    <fct> placebo, placebo, placebo, placebo, placebo, placebo, placebo, pla...
5
## $ cellcode <fct> squamous, squamous, squamous, squamous, squamous, squamous, squamo...
6
## $ time     <int> 72, 411, 228, 126, 118, 10, 82, 110, 314, 100, 42, 8, 144, 25, 11,...
7
## $ censor   <fct> dead, dead, dead, dead, dead, dead, dead, dead, dead, survival, de...
8
## $ diagtime <int> 60, 70, 60, 60, 70, 20, 40, 80, 50, 70, 60, 40, 30, 80, 70, 60, 60...
9
## $ kps      <int> 7, 5, 3, 9, 11, 5, 10, 29, 18, 6, 4, 58, 4, 9, 11, 3, 9, 2, 4, 4, ...
10
## $ age      <int> 69, 64, 38, 63, 65, 49, 69, 68, 43, 70, 81, 63, 63, 52, 48, 61, 42...
11
## $ prior    <fct> no, yes, no, yes, yes, no, yes, no, no, no, no, yes, no, yes, yes,...

資料處理

資料處理的套件主要是 tidyverse 套件系統中的 dplyr 套件。用於將讀入的資料進行處理和統計操作。

選擇個體函式 filter()

從這個函式的名字也可以知道，這個函式和 JS、Java 等程式語言中的 filter 類似，都是起到一個過濾器的作用。在 Excel 中，也有類似的方法實現條件的過濾。

例如，我們想要選擇上面 survVATrial.csv 檔案內容中，threat 為 placebo，cellcode 為 large 的內容：

1
dd.a <- dd %>% filter(threat == 'placebo', cellcode == 'large')

依據變數值排序函式 arrange()

預設情況下從小到大排序，如果要反排，可以使用 desc() 函式。

1
## 按照 age 變數從小到大排序
2
dd.s <- dd %>% arrange(age)
3

4
## 按照 time 變數從大到小排序
5
dd.s <- dd %>% arrange(desc(time))

選擇變數或欄位子集函式 select()

透過選擇欄位，可以建立欄位子集，將需要的變數儲存起來進行分析，這樣可以大大加速分析執行速度。

1
## 選擇餓 dd 中 threat, cellcode, censor 三個變數
2
dd.c <- dd %>% select(treat, cellcode, censor)

變數轉換函式 mutate()

使用 mutate() 函式，可以將變數進行一定形式的變換，形成一欄新的變數。

1
dd.a <- dd %>% mutate(
2
    log_age = log(age),
3
    diag_age = diagtime * age / 100
4
)

三因素運算函式 if_else()

這個函式類似於C家族程式語言中的三因素運算元 ?:，其定義如下：

1
if_else(condition, true, false, missing = NULL)

解釋：如果 condition 的值為 TRUE，回傳 true 的值，否則回傳 false 的值。missing = NULL 表示遺失值應當以什麼字元替代。

1
x <- -10:10
2
if_else(x > 0, 0, 1)
3
## [1] 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

變數重新命名函式 rename()

可以將變數重新命名：

1
## 格式：new_name = old_name
2
dd.new <- dd %>% rename(drug = treat)

移除遺失資料 drop_na()

使用 tidyr 套件中的函式 drop_na() 可將缺失值個體移除。請注意，缺失值移除將完全移除一個個體。只要該列中任意一個變數為NA，則將該列完全移除。

1
## 移除所有含有缺失值的個體
2
dd.mis %>% drop_na()
3

4
## 移除 age 變數含有缺失值的個體
5
dd.mis %>% drop_na(age)

隨機抽樣函式 sample_n()和sample_frac()

這兩個函式可以對資料進行隨機抽樣。引數如下：

size = k：設定所要抽出之樣本數或分率。
weight：抽取之相對應權重，若無設定，則權重相等。
replace = FALSE：設定是否可以重複抽取。

1
dd %>% sample_n(size = 5, replace = FALSE)

明顯不同個體選擇函式 distinct() 和 n_distinct()

舉例來說,假設你有一個資料框包含了某個學校的學生資訊,裡面有學生的姓名、性別和年級等欄位。如果有幾個學生重名,那麼使用 distinct() 函式就能快速地找出資料框中唯一不重複的學生資料列。

使用橫列指標選出個體函式 slice()

slice() 為一系列函式，可以利用橫列指標（row index）選出個體（row）。

slice()
slice_head()：選出資料最前端的個體
slice_last()：選出資料最末端的個體
slice_min()：選出資料變數值最小的個體
slice_max()：選出資料變數值最大的個體
slice_sample()：隨機選出個體

1
## 選出第一列
2
dd %>% slice(1)
3

4
## 選出 1～3 列
5
dd %>% slice(1:3)
6

7
## 選出 101～最後
8
dd %>% slice(101:n())
9

10
## 選出除掉 1～100 列之後的所有列
11
dd %>% slice(-c(1:100))
12

13
## 選出開頭三列
14
dd %>% slice_head(n = 3)
15

16
## 選出末尾三列
17
dd %>% slice_tail(n = 3)
18

19
## 選出 time 變數最小的三列
20
dd %>% slice_min(time, n = 3)
21

22
## 選出 time 變數最大的三列
23
dd %>% slice_max(time, n = 3)
24

25
## 隨機選出三列
26
set.seed(1)
27
dd %>% slice_sample(n = 3)

計算常見統計量函式 summarise()

smmarise() 函式可以計算常見的統計量，比如個數、平均值、變異數等等，並將計算結果單獨作為一個變數插入原始資料中。

1
dd %>%
2
  summarise(
3
    count = n(),
4
    age_mean = mean(age, na.rm = TRUE),
5
    age_sd = sd(age, na.rm = TRUE)
6
    )

資料分組操作函式 group_by()

資料分析常常需要類別變數分組，個別操作資料或進行計算統計量。函式 group_by() 引數可放入類別變數，然後分組進行相同資料分析。

1
dd %>%
2
  group_by(treat) %>%
3
  summarise(
4
    diagtime_mean = mean(diagtime, na.rm = TRUE),
5
    diagtime_sd = sd(diagtime, na.rm = TRUE)
6
    )
7
## # A tibble: 2 x 3
8
##   treat   diagtime_mean diagtime_sd
9
##   <fct>           <dbl>       <dbl>
10
## 1 placebo          59.2        18.7
11
## 2 test             57.9        21.4

多變數計算統計量函式 summarise_all()

summarise() 函式只能分別對當一變數進行計算，若要同時對許多變數進行相同操作，可使用以下函式：

summarise_all()：對每一個變數進行相同操作
summarise_each()：對每一個變數進行相同操作, 需加變數名
summarise_at()：對選出的變數進行相同操作需加變數名
summarise_if()：對符合特定條件的變數進行相同操作

1
## summarise_all()
2
con.mean <- dd %>%
3
  select(time, diagtime, kps, age) %>%
4
  summarise_all(mean, na.rm = TRUE)
5
con.mean
6
## # A tibble: 1 x 4
7
##    time diagtime   kps   age
8
##   <dbl>    <dbl> <dbl> <dbl>
9
## 1  122.     58.6  8.77  58.3
10
#
11
con.sd <- dd %>%
12
  select(time, diagtime, kps, age) %>%
13
  summarise_all(sd, na.rm = TRUE)
14
con.sd
15
## # A tibble: 1 x 4
16
##    time diagtime   kps   age
17
##   <dbl>    <dbl> <dbl> <dbl>
18
## 1  158.     20.0  10.6  10.5
19
id <- c("mean", "sd")
20
comb <- rbind(con.mean, con.sd)
21
comb <- cbind(id, comb)
22
comb
23
##     id  time diagtime    kps   age
24
## 1 mean 121.6    58.57  8.774 58.31
25
## 2   sd 157.8    20.04 10.612 10.54
26
#
27
dd %>% select(time, diagtime, kps, age) %>%
28
  summarise_all(list(mean, sd), na.rm = TRUE)
29
## # A tibble: 1 x 8
30
##   time_fn1 diagtime_fn1 kps_fn1 age_fn1 time_fn2 diagtime_fn2 kps_fn2 age_fn2
31
##      <dbl>        <dbl>   <dbl>   <dbl>    <dbl>        <dbl>   <dbl>   <dbl>
32
## 1     122.         58.6    8.77    58.3     158.         20.0    10.6    10.5
33
dd %>% select(time, diagtime, kps, age) %>%
34
  summarise_all(lst(mean, sd), na.rm = TRUE)
35
## # A tibble: 1 x 8
36
##   time_mean diagtime_mean kps_mean age_mean time_sd diagtime_sd kps_sd age_sd
37
##       <dbl>         <dbl>    <dbl>    <dbl>   <dbl>       <dbl>  <dbl>  <dbl>
38
## 1      122.          58.6     8.77     58.3    158.        20.0   10.6   10.5
39
dd %>%
40
  summarise_each(list(mean, sd), time, age) # not so useful
41
## Warning: `summarise_each_()` is deprecated as of dplyr 0.7.0.
42
## Please use `across()` instead.
43
## This warning is displayed once every 8 hours.
44
## Call `lifecycle::last_warnings()` to see where this warning was generated.
45
## # A tibble: 1 x 4
46
##   time_fn1 age_fn1 time_fn2 age_fn2
47
##      <dbl>   <dbl>    <dbl>   <dbl>
48
## 1     122.    58.3     158.    10.5
49
dd %>%
50
  summarise_each(lst(mean, sd), time, age) # not so useful
51
## # A tibble: 1 x 4
52
##   time_mean age_mean time_sd age_sd
53
##       <dbl>    <dbl>   <dbl>  <dbl>
54
## 1      122.     58.3    158.   10.5
55
dd %>%
56
  summarise_at(c("time", "age"), mean, na.rm = TRUE)
57
## # A tibble: 1 x 2
58
##    time   age
59
##   <dbl> <dbl>
60
## 1  122.  58.3
61
dd %>%
62
  summarise_at(.vars = vars(time, age), mean, na.rm = TRUE)
63
## # A tibble: 1 x 2
64
##    time   age
65
##   <dbl> <dbl>
66
## 1  122.  58.3
67
dd %>%
68
  summarise_at(.vars = vars(time, age),
69
               .funs = c(Mean = "mean", SD = "sd"), na.rm = TRUE)
70
## # A tibble: 1 x 4
71
##   time_Mean age_Mean time_SD age_SD
72
##       <dbl>    <dbl>   <dbl>  <dbl>
73
## 1      122.     58.3    158.   10.5
74
dd %>%
75
  summarise_if(is.numeric, list(mean, sd), na.rm = TRUE)
76
## # A tibble: 1 x 8
77
##   time_fn1 diagtime_fn1 kps_fn1 age_fn1 time_fn2 diagtime_fn2 kps_fn2 age_fn2
78
##      <dbl>        <dbl>   <dbl>   <dbl>    <dbl>        <dbl>   <dbl>   <dbl>
79
## 1     122.         58.6    8.77    58.3     158.         20.0    10.6    10.5
80
dd %>%
81
  summarise_if(is.numeric, lst(mean, sd), na.rm = TRUE)
82
## # A tibble: 1 x 8
83
##   time_mean diagtime_mean kps_mean age_mean time_sd diagtime_sd kps_sd age_sd
84
##       <dbl>         <dbl>    <dbl>    <dbl>   <dbl>       <dbl>  <dbl>  <dbl>
85
## 1      122.          58.6     8.77     58.3    158.        20.0   10.6   10.5

資料聯集與交集函式

intersect()：交集
union()：并集
setdiff()：差集

1
tibble1 <- tibble(
2
  id = c(1, 2, 3, 4, 5),
3
  name = c("Alice", "Bob", "Charlie", "David", "Emily")
4
)
5
tibble2 <- tibble(
6
  id = c(3, 4, 5, 6, 7),
7
  name = c("Charlie", "David", "Emily", "Frank", "George")
8
)
9

10
## 二者交集
11
intersect(tibble1, tibble2)
12
## # A tibble: 3 × 2
13
##      id name
14
##   <dbl> <chr>
15
## 1     3 Charlie
16
## 2     4 David
17
## 3     5 Emily
18

19
## 二者并集
20
union(tibble1, tibble2)
21
## # A tibble: 7 × 2
22
##      id name
23
##   <dbl> <chr>
24
## 1     1 Alice
25
## 2     2 Bob
26
## 3     3 Charlie
27
## 4     4 David
28
## 5     5 Emily
29
## 6     6 Frank
30
## 7     7 George
31

32
## 二者差集
33
setdiff(tibble1, tibble2)
34
## # A tibble: 2 × 2
35
##      id name
36
##   <dbl> <chr>
37
## 1     1 Alice
38
## 2     2 Bob

資料合併函式

資料經常儲存再不同檔案，例如門診檔，住院檔，實驗室檔，同一位個體常須使用個體辨識碼（id）或姓名（names）進行合併或清理。用來連結不同資料的個體辨識碼或變數稱為“關鍵碼”或“所引鍵”（key）。

在 tidyverse 套件系統中，有一些函式可以實現這樣的合併作業。

inner_join(x, y)：包函 x 與 y 都配對存在的 y 與 y 個體與變數
left_join(x, y)：包函所有 x 個體與變數且在 y 有配對存在的 y 個體與變數
right_join(x, y)：包函所有y個體與變數且在x有配對存在的x個體與變數
full_join(x, y)：包函所有 x 與 y 的個體與變數資料
semi_join(x, y)：包函 x 在 y 有配對存在的 x 個體與變數
anti_join(x, y)：包函 x 在 y 無配對存在的 x 個體與變數

1
set.seed(1)
2
df <- dd %>%
3
  select(treat, cellcode, time, censor, age) %>%
4
  mutate(id = 1:n()) %>%
5
  filter(id <= 10)
6
x <- df %>%
7
  select(id, treat, time, age) %>%
8
  sample_n(size = 7, replace = FALSE) %>%
9
  arrange(id)
10
y <- df %>%
11
  select(id, cellcode, censor, age) %>%
12
  sample_n(size = 7, replace = FALSE)  %>%
13
  arrange(id)
14
x
15
## # A tibble: 7 x 4
16
##      id treat    time   age
17
##   <int> <fct>   <int> <int>
18
## 1     1 placebo    72    69
19
## 2     2 placebo   411    64
20
## 3     3 placebo   228    38
21
## 4     4 placebo   126    63
22
## 5     5 placebo   118    65
23
## 6     7 placebo    82    69
24
## 7     9 placebo   314    43
25
y
26
## # A tibble: 7 x 4
27
##      id cellcode censor     age
28
##   <int> <fct>    <fct>    <int>
29
## 1     1 squamous dead        69
30
## 2     2 squamous dead        64
31
## 3     3 squamous dead        38
32
## 4     5 squamous dead        65
33
## 5     6 squamous dead        49
34
## 6     7 squamous dead        69
35
## 7    10 squamous survival    70
36
inner_join(x, y)
37
## # A tibble: 5 x 6
38
##      id treat    time   age cellcode censor
39
##   <int> <fct>   <int> <int> <fct>    <fct>
40
## 1     1 placebo    72    69 squamous dead
41
## 2     2 placebo   411    64 squamous dead
42
## 3     3 placebo   228    38 squamous dead
43
## 4     5 placebo   118    65 squamous dead
44
## 5     7 placebo    82    69 squamous dead
45
left_join(x, y)
46
## # A tibble: 7 x 6
47
##      id treat    time   age cellcode censor
48
##   <int> <fct>   <int> <int> <fct>    <fct>
49
## 1     1 placebo    72    69 squamous dead
50
## 2     2 placebo   411    64 squamous dead
51
## 3     3 placebo   228    38 squamous dead
52
## 4     4 placebo   126    63 <NA>     <NA>
53
## 5     5 placebo   118    65 squamous dead
54
## 6     7 placebo    82    69 squamous dead
55
## 7     9 placebo   314    43 <NA>     <NA>
56
right_join(x, y)
57
## # A tibble: 7 x 6
58
##      id treat    time   age cellcode censor
59
##   <int> <fct>   <int> <int> <fct>    <fct>
60
## 1     1 placebo    72    69 squamous dead
61
## 2     2 placebo   411    64 squamous dead
62
## 3     3 placebo   228    38 squamous dead
63
## 4     5 placebo   118    65 squamous dead
64
## 5     7 placebo    82    69 squamous dead
65
## 6     6 <NA>       NA    49 squamous dead
66
## 7    10 <NA>       NA    70 squamous survival
67
full_join(x, y)
68
## # A tibble: 9 x 6
69
##      id treat    time   age cellcode censor
70
##   <int> <fct>   <int> <int> <fct>    <fct>
71
## 1     1 placebo    72    69 squamous dead
72
## 2     2 placebo   411    64 squamous dead
73
## 3     3 placebo   228    38 squamous dead
74
## 4     4 placebo   126    63 <NA>     <NA>
75
## 5     5 placebo   118    65 squamous dead
76
## 6     7 placebo    82    69 squamous dead
77
## 7     9 placebo   314    43 <NA>     <NA>
78
## 8     6 <NA>       NA    49 squamous dead
79
## 9    10 <NA>       NA    70 squamous survival
80
semi_join(x, y)
81
## # A tibble: 5 x 4
82
##      id treat    time   age
83
##   <int> <fct>   <int> <int>
84
## 1     1 placebo    72    69
85
## 2     2 placebo   411    64
86
## 3     3 placebo   228    38
87
## 4     5 placebo   118    65
88
## 5     7 placebo    82    69
89
anti_join(x, y)
90
## # A tibble: 2 x 4
91
##      id treat    time   age
92
##   <int> <fct>   <int> <int>
93
## 1     4 placebo   126    63
94
## 2     9 placebo   314    43

[R Lang] R Lang 與高級醫學統計學（8）

https://blog.kynix.tw/posts/1731066650156/

作者

Adrian Chen

建檔時間

2024 年 4 月 12 日

協議

BY-NC-SA 4.0

姓名標示-非商業性-相同方式分享 4.0 國際