Ch 1: Meet Lucene

1.5 Understanding the core indexing classes (p. 25)

IndexWriter: 創建或開啟既存的index，然後對index進行新增/刪除/更新Documents.
Directory:
Analyzer: 所有text進行索引前，要先進行”分析” (如何從文本中抓出token) [見ch. 4]
Document: Field的集合
Field: 每個Field包含一對name:value，還有一些關於如何索引field的選項 [見2.4 (p. 43)]

1.6 Understanding the core searching classes (p. 28)

IndexSearcher
Term
Query (Query只是一個abstract class)
- BooleanQuery
- PhraseQuery
- PrefixQuery
- PhrasePrefixQuery
- TermRangeQuery
- NumericRangeQuery
- FilteredQuery
- SpanQuery
TermQuery
TopDocs

Ch 2: Building a search index

2.1 How Lucene models content

關於 term vector:

What is term vector in Lucene

2.2 Understanding the indexing process

抽取文本，建立Document
Analysis
加入Index
1. 使用 inverted index (p. 35 有還不錯的說明)
2. Index segments: Lucene特有的資料結構

2.3 Basic index operation

將document加進index

IndexWriter writer = getWriter();

// 先建立Document，再把Field一一加入
Document doc = new Document();
doc.add(new Field("id", 0, Field.Store.YES, Field.Index.No));
doc.add(new Field("country", "Taiwan", Field.Store.YES, Field.Index.No));

// 最關鍵的一步
writer.addDocument(doc);

將document從index中刪除

1	writer.deleteDocuments();

見 p.40 關於刪除文件的更多選項。

更新index

1 2	writer.updateDocument(Term, Document, Analyzer); writer.updateDocument(new Term("ID", documentId), newDocument);

2.4 Field options [非常重要！] (這裡講的都是字串值的Field)

決定了每個field將如何被索引。

2.4.1 options for indexing

Field.Index.ANALYZED: 索引時，使用Analyzer將field value拆解成token
Field.Index.NOT_ANALYZED：不要使用Analyzer (適合用於網址, ID 等等的搜尋)
Field.Index.ANALYZED_NO_NORMS：如Field.Index.ANALYZED，只是不儲存norm 資訊 (用於boosting)
Field.Index.NOT_ANALYZED_NO_NORMS
Field.Index.NO: 完全不要索引這個field值，所以在搜尋時也查不到

在建立inverted index時，Lucene預設儲存所有能實作Vector Space Model的必要資訊，例如：計算某個term出現在某doc中的總頻率，還有每個occurrence出現在文件中的位置。但如果你知道有某個field只需要做Boolean searching (i.e. only for filtering)，像是日期的話，那麼可以下Field.setOmitTermFreqAndPositions(true)。可以節省空間。

2.4.2 options for storing: 決定是否要將該field的完整值存起來，存起來的話，在搜尋提取階段就可以重新叫回

Field.Store.YES
Field.Store.NO: 像是如果只是要叫出一個網頁URL，而不要叫出全部網頁內容，就可以不用STORE。

2.4.3 options for using term vectors

有關於什麼是term vector的說明：mix of stored field and indexed field. (p.44)

至於term vector可以用來做什麼，請看 Ch. 5.9 (可以比較相近文件, 分群)

TermVector.YES: 儲存一個Doc中的所有unique terms和次數。
TermVector.WITH_POSITIONS: TermVector.YES + 每個term的每個occurence的位置
TermVector.WITH_OFFSETS: TermVector.YES + 每個term的每個occurence的的起始/結束位元位置
TermVector.WITH_POSITIONS_PFFSETS
TermVector.NO

2.4.4 Reader, TokenStream, byte[] field value (使用非String的field value)

Reader: if holding the full String in memory is too costly
TokenStream: for preanalyzed fields
binary values: for storing

2.4.5 Field option combination

2.4.6 如何正確對欄位索引，讓欄位之後可以進行排序

數字值的欄位記得用 NumbericField
textual欄位記得要…

2.4.7 Multivalued fields

假設作者欄位有多個作者怎麼辦？直接用同一個Field name，加不同的field value即可。

2.5 Boosting docs and fields (進行加權; 暫時跳過)

2.6 對非文字(數字, 日期, 時間)進行索引

“controls how important specific fields and documents are during Lucene’s scoring”

情況一：數字出現在文本當中，之後想對他搜尋，那麼就要找不會丟棄數字的Analyzer。
情況二：欄位中出現單一的數字，之後想對他做filtering, range searching, sorting。
使用NumbericField

1	doc.add(new NumericField("price").setDoubleValue(19.99);

Ch 3: Adding search to your application

Lucene in Action筆記