[讀書筆記] Managing gigabytes: Ch.3 Indexing

12月 12, 2020 分類 Indexing

第二章所介紹的壓縮技巧的確大幅度地節省了硬碟空間，但是還是無法解決如何有效率地對資訊做搜索的這個問題。本章就是要透過 index 來解決這個問題。本章將介紹各種索引方法以及索引壓縮 (index compression)。

比較：從一本書的index來查詢某個字出現的地方 v.s. 一頁一頁地找出某個詞 (page-by-page search)

詞彙定義

文檔 (document)
文檔集合 (document collection)
詞條 (term)

設計師必須考慮：

多大的單位算作一個文檔?

一句話？一個段落？一篇文章？數篇文章？

A document will thus be the unit of text that is returned in response to queries. (p. 104, par. 2)

索引的粒度 (The granularity of index, p. 105)

對於一個詞出現在每個文檔中的位置資訊，我們要記錄的多細？有時候我們只需要知道包含「漢堡」的文檔有哪些就好，所以這時候我們只需紀錄：

漢堡：文檔1、文檔3、…

但如果我們想查詢同時出現「漢堡」和「薯條」的句子，那麼我們還必須更詳細地記錄這些詞出現的位置：

漢堡: (文檔1, 句子3), (文檔2, 句子5), …
薯條: (文檔1, 句子5), (文檔2, 句子5), …

如此我們才能比對抽取出哪些句子同時出現漢堡跟薯條。

什麼算作一個詞條?

is跟was算作兩個不同的詞條嗎？

相關議題：transformation (p. 105-6 & Ch. 3.7)

case folding
stemming
stop words
thesaural substitution (同義詞需不需要合併成一個term? 本章不討論)

Ch. 3.1: 文檔集合之樣本 (Sample document collections)

這一節以四個文章集合當作實驗對象，並分別列出一些統計數據：

	Bible	GNUbib	Comact	TREC
文章總數 N	31,101	64,343	261,829	741,856
詞條總數 F	884,994	2,570,906	22,805,920	333,338,738
詞條類別總數 n	8,965	46,488	36,660	535,346
*索引指標 f	701,412	2,226,300	12,976,418	134,994,414
總大小 (MB)	4.33	14.05	131.86	2070.29

*索引指標 (Index pointer) 所有 (文檔, 詞條) 組的數量，也就是索引的總大小。

Ch. 3.2: 索引方法一: Inverted file indexing

An index is a mechanism for locating a given term in a text.

適合作為索引的資料結構：

inverted file (又稱 postings file, 或者 concordance) (本節; 是最合適的)
signature files (Ch. 3.5) (使用時機有限)
bitmaps (Ch. 3.5) (使用時機有限)

在大多數的應用中，比起 signature files 和 bitmaps，inverted files 在索引大小和處理搜索的速度都有較好的表現。(p. 111)

一個 inverted file 包含了 lexicon 當中每一個 term 的 inverted list (又稱為 postings list)，每個 inverted list 都儲存了指向該 term 所出現位置的 list of pointers，每個 pointer (又稱為 postings) 其實指向該詞出現的某篇文檔。

一本書後面的 Index 其實就是一個 inverted file的格式。

每個 inverted file index都需要 lexicon，lexicon就是 a list of all terms that appear in the database。

接著討論又回到了索引的粒度。紀錄的位置資訊越詳細，索引的大小當然就越大。

最後討論了未經壓縮的inverted files的尺寸大小，常常會佔原始文檔集合的 50 ~ 100%。

Ch. 3.3: Inverted file compression

最簡單的方法：d-gaps (p. 114-5)

各種方法分類：（這部分未讀）

Global methods
Local methods

Ch. 3.4: 比較各種索引壓縮方法的表現

未讀。

Ch. 3.5: 其他兩個索引方法

Signature files

Bitmaps

Ch. 3.6: 比較三種索引方法

本節結論（也就是最後一句話）：經過壓縮的inverted files無疑是最好的方法，最適用於巨大的文檔集合。

Ch. 3.7: What should be indexed?

本節處理在 p.106 已帶出的一些議題，到底在一個 inverted file 當中，如何定義一個詞條是什麼？需要全部轉換成大寫嗎？要以詞根形式作為詞條嗎？是否要去除停用詞？

Case folding
Stemming
Effect on index size
Stop words

留言與分享

B-tree

12月 11, 2020

Disk structure

每次對硬碟做讀寫時，總是以Block作為讀寫單位，每個Block的大小通常是512bytes。一個Block的地址可以由(Sector no., Track no.)來表達。
一個block上的每個byte也有自己的位置，稱為offset，從0到511。
所以要存取硬碟上某一個block的某一個byte，我們需要知道以下三個資訊：第幾個Track、第幾個Sector、第幾個offset。

參考資料:

How is data stored and retrieved inside a disk in a computer?

留言與分享

SQLite運作原理筆記

12月 10, 2020 分類資料庫

學習資源

留言與分享

在Colab上讓matplotlib顯示中文(2020/12更新)

12月 03, 2020 分類 Colab

# 先下載台北黑體字型
!wget -O taipei_sans_tc_beta.ttf https://drive.google.com/uc?id=1eGAsTN1HBpJAkeVM57_C7ccp7hbgSz3_&export=download

import matplotlib

# 新增字體
matplotlib.font_manager.fontManager.addfont('taipei_sans_tc_beta.ttf')

# 將 font-family 設為 Taipei Sans TC Beta
# 設定完後，之後的圖表都可以顯示中文了
matplotlib.rc('font', family='Taipei Sans TC Beta')

留言與分享

[讀書筆記] Managing gigabytes: Ch.3 Indexing

詞彙定義

設計師必須考慮：

Ch. 3.1: 文檔集合之樣本 (Sample document collections)

Ch. 3.2: 索引方法一: Inverted file indexing

Ch. 3.3: Inverted file compression

Ch. 3.4: 比較各種索引壓縮方法的表現

Ch. 3.5: 其他兩個索引方法

Signature files

Bitmaps

Ch. 3.6: 比較三種索引方法

Ch. 3.7: What should be indexed?

B-tree

目錄

Disk structure

Disk structure

SQLite運作原理筆記

學習資源

在Colab上讓matplotlib顯示中文(2020/12更新)

puerdon