在Python自然語言處理套件 nltk 中,在 nltk.util 模組中,有一段用來產生 ngrams 的程式碼,我們先用個例子來看看它的作用是什麼。

作用

以 bigrams(就是2-grams的意思)為例,假設你給他一個string list的輸入,那麽他會輸出一個tuple list,其中每個tuple都是由緊鄰的兩個string所組成,從第一個string到最後一個string。

1
2
3
from nltk.util import bigrams
list(bigrams['I', 'have', 'a', 'dream', '.'])
# 會得到 [('I', 'have'), ('have', 'a'), ('a', 'dream'), ('dream', '.')]

事實上,根據bigrams可以接受任何 Sequence type 或 Iterator type 的值作為輸入。

Python 中的 Sequence type 包含:str, list, tuple, range, bytes, bytearray 等等。
而Python中的 Iterator type 則是指那些有定義 __iter__()__next__() 方法的物件。

分析

大圖像:6個函數

1
2
3
4
5
6
7
8
pad_sequence(  
sequence,
n,
pad_left=False,
pad_right=False,
left_pad_symbol=None,
right_pad_symbol=None
)

指定是否要在輸入的sequence最前或最後加上識別元素,例如可用來標記句首和句尾,再把加工後的 sequence 丟進iter(),以產生一個 Iterator

  • ngrams():
  • bigrams()
  • trigrams()
  • everygrams()
  • skipgrams()

完整原始碼

官網原始碼連結

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
##########################################################################
# Ngram iteration
##########################################################################

def pad_sequence(sequence, n, pad_left=False, pad_right=False,
left_pad_symbol=None, right_pad_symbol=None):
"""
Returns a padded sequence of items before ngram extraction.

>>> list(pad_sequence([1,2,3,4,5], 2, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>'))
['<s>', 1, 2, 3, 4, 5, '</s>']
>>> list(pad_sequence([1,2,3,4,5], 2, pad_left=True, left_pad_symbol='<s>'))
['<s>', 1, 2, 3, 4, 5]
>>> list(pad_sequence([1,2,3,4,5], 2, pad_right=True, right_pad_symbol='</s>'))
[1, 2, 3, 4, 5, '</s>']

:param sequence: the source data to be padded
:type sequence: sequence or iter
:param n: the degree of the ngrams
:type n: int
:param pad_left: whether the ngrams should be left-padded
:type pad_left: bool
:param pad_right: whether the ngrams should be right-padded
:type pad_right: bool
:param left_pad_symbol: the symbol to use for left padding (default is None)
:type left_pad_symbol: any
:param right_pad_symbol: the symbol to use for right padding (default is None)
:type right_pad_symbol: any
:rtype: sequence or iter
"""
sequence = iter(sequence)
if pad_left:
sequence = chain((left_pad_symbol,) * (n-1), sequence)
if pad_right:
sequence = chain(sequence, (right_pad_symbol,) * (n-1))
return sequence

# add a flag to pad the sequence so we get peripheral ngrams?

def ngrams(sequence, n, pad_left=False, pad_right=False,
left_pad_symbol=None, right_pad_symbol=None):
"""
Return the ngrams generated from a sequence of items, as an iterator.
For example:

>>> from nltk.util import ngrams
>>> list(ngrams([1,2,3,4,5], 3))
[(1, 2, 3), (2, 3, 4), (3, 4, 5)]

Wrap with list for a list version of this function. Set pad_left
or pad_right to true in order to get additional ngrams:

>>> list(ngrams([1,2,3,4,5], 2, pad_right=True))
[(1, 2), (2, 3), (3, 4), (4, 5), (5, None)]
>>> list(ngrams([1,2,3,4,5], 2, pad_right=True, right_pad_symbol='</s>'))
[(1, 2), (2, 3), (3, 4), (4, 5), (5, '</s>')]
>>> list(ngrams([1,2,3,4,5], 2, pad_left=True, left_pad_symbol='<s>'))
[('<s>', 1), (1, 2), (2, 3), (3, 4), (4, 5)]
>>> list(ngrams([1,2,3,4,5], 2, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>'))
[('<s>', 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, '</s>')]


:param sequence: the source data to be converted into ngrams
:type sequence: sequence or iter
:param n: the degree of the ngrams
:type n: int
:param pad_left: whether the ngrams should be left-padded
:type pad_left: bool
:param pad_right: whether the ngrams should be right-padded
:type pad_right: bool
:param left_pad_symbol: the symbol to use for left padding (default is None)
:type left_pad_symbol: any
:param right_pad_symbol: the symbol to use for right padding (default is None)
:type right_pad_symbol: any
:rtype: sequence or iter
"""
sequence = pad_sequence(sequence, n, pad_left, pad_right,
left_pad_symbol, right_pad_symbol)

history = []
while n > 1:
history.append(next(sequence))
n -= 1
for item in sequence:
history.append(item)
yield tuple(history)
del history[0]

def bigrams(sequence, **kwargs):
"""
Return the bigrams generated from a sequence of items, as an iterator.
For example:

>>> from nltk.util import bigrams
>>> list(bigrams([1,2,3,4,5]))
[(1, 2), (2, 3), (3, 4), (4, 5)]

Use bigrams for a list version of this function.

:param sequence: the source data to be converted into bigrams
:type sequence: sequence or iter
:rtype: iter(tuple)
"""

for item in ngrams(sequence, 2, **kwargs):
yield item

def trigrams(sequence, **kwargs):
"""
Return the trigrams generated from a sequence of items, as an iterator.
For example:

>>> from nltk.util import trigrams
>>> list(trigrams([1,2,3,4,5]))
[(1, 2, 3), (2, 3, 4), (3, 4, 5)]

Use trigrams for a list version of this function.

:param sequence: the source data to be converted into trigrams
:type sequence: sequence or iter
:rtype: iter(tuple)
"""

for item in ngrams(sequence, 3, **kwargs):
yield item

def everygrams(sequence, min_len=1, max_len=-1, **kwargs):
"""
Returns all possible ngrams generated from a sequence of items, as an iterator.

>>> sent = 'a b c'.split()
>>> list(everygrams(sent))
[('a',), ('b',), ('c',), ('a', 'b'), ('b', 'c'), ('a', 'b', 'c')]
>>> list(everygrams(sent, max_len=2))
[('a',), ('b',), ('c',), ('a', 'b'), ('b', 'c')]

:param sequence: the source data to be converted into trigrams
:type sequence: sequence or iter
:param min_len: minimum length of the ngrams, aka. n-gram order/degree of ngram
:type min_len: int
:param max_len: maximum length of the ngrams (set to length of sequence by default)
:type max_len: int
:rtype: iter(tuple)
"""

if max_len == -1:
max_len = len(sequence)
for n in range(min_len, max_len+1):
for ng in ngrams(sequence, n, **kwargs):
yield ng

def skipgrams(sequence, n, k, **kwargs):
"""
Returns all possible skipgrams generated from a sequence of items, as an iterator.
Skipgrams are ngrams that allows tokens to be skipped.
Refer to http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf

>>> sent = "Insurgents killed in ongoing fighting".split()
>>> list(skipgrams(sent, 2, 2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]
>>> list(skipgrams(sent, 3, 2))
[('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]

:param sequence: the source data to be converted into trigrams
:type sequence: sequence or iter
:param n: the degree of the ngrams
:type n: int
:param k: the skip distance
:type k: int
:rtype: iter(tuple)
"""

# Pads the sequence as desired by **kwargs.
if 'pad_left' in kwargs or 'pad_right' in kwargs:
sequence = pad_sequence(sequence, n, **kwargs)

# Note when iterating through the ngrams, the pad_right here is not
# the **kwargs padding, it's for the algorithm to detect the SENTINEL
# object on the right pad to stop inner loop.
SENTINEL = object()
for ngram in ngrams(sequence, n + k, pad_right=True, right_pad_symbol=SENTINEL):
head = ngram[:1]
tail = ngram[1:]
for skip_tail in combinations(tail, n - 1):
if skip_tail[-1] is SENTINEL:
continue
yield head + skip_tail