作者: Baayen
出版年: 2013
書名: Analyzing linguistic data - a practical introduction to satistics using R

CH1: An introduction to R

Factor: A non-numeric predicatorc (p. 9)
Levels: Values of s factor (p. 9)
Contingency table (crosstab): (p. 13)

CH2: Graphical data exploration

2.1 何謂隨機變數 (Random variables)?

一個實驗的結果即是一個隨機變數。

例一：在丟硬幣的實驗中，丟出的硬幣是包含兩種結果的隨機變數。
例二：在丟骰子的實驗中，包含一個隨機變數，這個隨機變數有六種結果。
例三：在正確詞彙判斷的實驗中，受試者在一定時間之內按下「正確」和「錯誤」其中一個按鍵。在這個實驗中包含兩個隨機變數，第一個隨機變數為時間，其值為連續值。第二個隨機變數為正確性，有兩種結果：正確與錯誤。

「隨機」的意義來自於事先並不知道會跑出哪一種結果。每個隨機變數都與機率分佈(probability distribution)相關，描述著隨機變數中不同結果出現的似然率(likelihood)。

2.2 視覺化單一隨機變數

Bar chart 和 histogram 的差別

Bar chart 用於離散變數或 factor 上。
Histogram 用於連續變數上，且 histogram 中的資料面積總和為
1。

為何常常需要將尺度以log轉換? (p. 31下方)

降低隨機變數分布的偏移。

2.3 視覺化兩個以上的變數

mosaic plot:
- 先將資料轉成contingency table (xtabs())
- 再丟進 mosalplot()
scatter plot: 用於表示兩個連續隨機變數之間的關係。
- heteroskadastic
- corrrelation
- scatter smoother
pairs plot (scatterplot matrix)
- 適合兩個以上的隨機變數，兩兩之間互相比較，呈現關係
  - multicollinearlity (p. 37)

2.4 Trellis graphics (格狀圖)

CH3: Probability distributions

Many statistical tests exploit the properties of the probability distributions of random variables. (許多統計檢定在做的事就在發揮隨機變數的機率分佈性質。)

3.1 Distributions

一個隨機變數的機率分佈說明了該變數不同結果的似然率。隨機變數又能分成兩種:

離散
連續

3.2 離散分佈

問題：在CELEX lexical database (Bayen et al., 1995)中列出了一個 1860萬詞的語料庫中，各個英文詞彙出現的頻率。發現 the 這個功能詞出現的頻率為 1093547 次，換成機率也就是 0.05885575 （這個值可以視為 the 出現在英文當中的機率的估計值）。然後現在我們又看到更早的Brown語料庫，在這個含有100萬詞的語料庫中，按照剛剛 0.05885575 的機率值，我們預期應該可以看到 the 出現 58856 次，但實際上卻找到了 69971 次。我們想知道的是，預期中與實際中的差異究竟算不算太大？

詞彙定義：

probability of success (p)
probability of failure (1-p)
number of trials (n)
binomially dstributed random variable with parameters p and n
還要區分兩種對象:
- 母體(population)的性質
- 樣本(sample)的性質
Poisson distrbution

3.3 連續分佈

CH4: Basic statistical methods

The logic underlying the statistical tests: a statistical test produces a test statistic (檢定統計量) of which the distribution is known.

我們想知道的是，這個檢定統計量是否是一個極端值，極端到不能只用巧合來解釋（巧合的相反是必然）。

一般做法是先提出一個像是稻草人般的虛無假設，這個虛無假設的檢定統計量不是極端值。如果我們透過統計檢定發現得到的檢定統計量是個極端值

本書採用的立場為 Frequentist statistical inference 而非 Bayesian inference。

[讀書筆記] Analyzing linguistic data - a practical introduction to satistics using R