matchzoo.processor_units package¶

Submodules¶

matchzoo.processor_units.chain_transform module¶

Wrapper function organizes a number of transform functions.

matchzoo.processor_units.chain_transform.chain_transform(units)¶: Compose unit transformations into a single function.

matchzoo.processor_units.processor_units module¶

Matchzoo toolkit for text pre-processing.

class matchzoo.processor_units.processor_units.DigitRemovalUnit¶

基类：matchzoo.processor_units.processor_units.ProcessorUnit

Process unit to remove digits.

transform(tokens)¶

Remove digits from list of tokens.

参数:	tokens (`list`) -- list of tokens to be filtered.
Return tokens:	tokens of tokens without digits.
返回类型:	`list`

class matchzoo.processor_units.processor_units.FixedLengthUnit(text_length, pad_value=0, pad_mode='pre', truncate_mode='pre')¶

基类：matchzoo.processor_units.processor_units.ProcessorUnit

FixedLengthUnit Class.

Process unit to get the fixed length text.

Examples

>>> fixedlen = FixedLengthUnit(3)
>>> fixedlen.transform(range(1, 6)) == [3, 4, 5]
True
>>> fixedlen = FixedLengthUnit(3)
>>> fixedlen.transform(range(1, 3)) == [0, 1, 2]
True

transform(tokens)¶

Transform list of tokenized tokens into the fixed length text.

参数:	tokens (`list`) -- list of tokenized tokens.
Return tokens:	list of tokenized tokens in fixed length.
返回类型:	`list`

class matchzoo.processor_units.processor_units.FrequencyFilterUnit(low=0, high=inf, mode='df')¶

基类：matchzoo.processor_units.processor_units.StatefulProcessorUnit

Frequency filter unit.

参数:	low -- Lower bound, inclusive. high -- Upper bound, exclusive. mode -- One of tf (term frequency), df (document frequency), and idf (inverse document frequency).

Examples::

>>> import matchzoo as mz

To filter based on term frequency (tf):

>>> tf_filter = mz.processor_units.FrequencyFilterUnit(
...     low=2, mode='tf')
>>> tf_filter.fit([['A', 'B', 'B'], ['C', 'C', 'C']])
>>> tf_filter.transform(['A', 'B', 'C'])
['B', 'C']

To filter based on document frequency (df):

>>> tf_filter = mz.processor_units.FrequencyFilterUnit(
...     low=2, mode='df')
>>> tf_filter.fit([['A', 'B'], ['B', 'C']])
>>> tf_filter.transform(['A', 'B', 'C'])
['B']

To filter based on inverse document frequency (idf):

>>> idf_filter = mz.processor_units.FrequencyFilterUnit(
...     low=1.2, mode='idf')
>>> idf_filter.fit([['A', 'B'], ['B', 'C', 'D']])
>>> idf_filter.transform(['A', 'B', 'C'])
['A', 'C']

fit(list_of_tokens)¶: Fit list_of_tokens by calculating mode states.

transform(tokens)¶

Transform a list of tokens by filtering out unwanted words.

返回类型:	`list`

class matchzoo.processor_units.processor_units.LemmatizationUnit¶

基类：matchzoo.processor_units.processor_units.ProcessorUnit

Process unit for token lemmatization.

transform(tokens)¶

Lemmatization a sequence of tokens.

参数:	tokens (`list`) -- list of tokens to be lemmatized.
Return tokens:	list of lemmatizd tokens.
返回类型:	`list`

class matchzoo.processor_units.processor_units.LowercaseUnit¶

基类：matchzoo.processor_units.processor_units.ProcessorUnit

Process unit for text lower case.

transform(tokens)¶

Convert list of tokens to lower case.

参数:	tokens (`list`) -- list of tokens.
Return tokens:	lower-cased list of tokens.
返回类型:	`list`

class matchzoo.processor_units.processor_units.MatchingHistogramUnit(bin_size=30, embedding_matrix=None, normalize=True, mode='LCH')¶

基类：matchzoo.processor_units.processor_units.ProcessorUnit

MatchingHistogramUnit Class.

参数:	bin_size (`int`) -- The number of bins of the matching histogram. embedding_matrix -- The word embedding matrix applied to calculate the matching histogram. normalize -- Boolean, normalize the embedding or not. mode (`str`) -- The type of the historgram, it should be one of 'CH', 'NG', or 'LCH'.

Examples

>>> embedding_matrix = np.array([[1.0, -1.0], [1.0, 2.0], [1.0, 3.0]])
>>> text_left = [0, 1]
>>> text_right = [1, 2]
>>> histogram = MatchingHistogramUnit(3, embedding_matrix, True, 'CH')
>>> histogram.transform([text_left, text_right])
[[3.0, 1.0, 1.0], [1.0, 2.0, 2.0]]

transform(text_pair)¶

Transform the input text.

返回类型:	`list`

class matchzoo.processor_units.processor_units.NgramLetterUnit(ngram=3, reduce_dim=True)¶

基类：matchzoo.processor_units.processor_units.ProcessorUnit

Process unit for n-letter generation.

Triletter is used in DSSMModel. This processor is expected to execute before Vocab has been created.

Examples

>>> triletter = NgramLetterUnit()
>>> rv = triletter.transform(['hello', 'word'])
>>> len(rv)
9
>>> rv
['#he', 'hel', 'ell', 'llo', 'lo#', '#wo', 'wor', 'ord', 'rd#']
>>> triletter = NgramLetterUnit(reduce_dim=False)
>>> rv = triletter.transform(['hello', 'word'])
>>> len(rv)
2
>>> rv
[['#he', 'hel', 'ell', 'llo', 'lo#'], ['#wo', 'wor', 'ord', 'rd#']]

transform(tokens)¶

Transform token into tri-letter.

For example, word should be represented as #wo, wor, ord and rd#.

Return n_letters:
参数:	tokens (`list`) -- list of tokens to be transformed.
	generated n_letters.
返回类型:	`list`

class matchzoo.processor_units.processor_units.ProcessorUnit¶

基类：object

Process unit do not persive state (i.e. do not need fit).

transform(input)¶: Abstract base method, need to be implemented in subclass.

class matchzoo.processor_units.processor_units.PuncRemovalUnit¶

基类：matchzoo.processor_units.processor_units.ProcessorUnit

Process unit for remove punctuations.

transform(tokens)¶

Remove punctuations from list of tokens.

参数:	tokens (`list`) -- list of toekns.
Return rv:	tokens without punctuation.
返回类型:	`list`

class matchzoo.processor_units.processor_units.StatefulProcessorUnit¶

基类：matchzoo.processor_units.processor_units.ProcessorUnit

Process unit do persive state (i.e. need fit).

fit(input)¶: Abstract base method, need to be implemented in subclass.

state¶: Get current state.

class matchzoo.processor_units.processor_units.StemmingUnit(stemmer='porter')¶

基类：matchzoo.processor_units.processor_units.ProcessorUnit

Process unit for token stemming.

transform(tokens)¶

Reducing inflected words to their word stem, base or root form.

参数:	tokens (`list`) -- list of string to be stemmed. stemmer -- stemmer to use, porter or lancaster.
引发:	ValueError -- stemmer type should be porter or lancaster.
Return tokens:	stemmed token.
返回类型:	`list`

class matchzoo.processor_units.processor_units.StopRemovalUnit(lang='english')¶

基类：matchzoo.processor_units.processor_units.ProcessorUnit

Process unit to remove stop words.

Example

>>> unit = StopRemovalUnit()
>>> unit.transform(['a', 'the', 'test'])
['test']
>>> type(unit.stopwords)
<class 'list'>

stopwords¶

Get stopwords based on language.

Params lang:	language code.
返回类型:	`list`
返回:	list of stop words.

transform(tokens)¶

Remove stopwords from list of tokenized tokens.

参数:	tokens (`list`) -- list of tokenized tokens. lang -- language code for stopwords.
Return tokens:	list of tokenized tokens without stopwords.
返回类型:	`list`

class matchzoo.processor_units.processor_units.TokenizeUnit¶

基类：matchzoo.processor_units.processor_units.ProcessorUnit

Process unit for text tokenization.

transform(input)¶

Process input data from raw terms to list of tokens.

参数:	input (`str`) -- raw textual input.
Return tokens:	tokenized tokens as a list.
返回类型:	`list`

class matchzoo.processor_units.processor_units.VocabularyUnit¶

基类：matchzoo.processor_units.processor_units.StatefulProcessorUnit

Vocabulary class.

Examples

>>> vocab = VocabularyUnit()
>>> vocab.fit(['A', 'B', 'C', 'D', 'E'])
>>> term_index = vocab.state['term_index']
>>> term_index  
{'E': 1, 'C': 2, 'D': 3, 'A': 4, 'B': 5}
>>> index_term = vocab.state['index_term']
>>> index_term  
{1: 'C', 2: 'A', 3: 'E', 4: 'B', 5: 'D'}

>>> term_index['out-of-vocabulary-term']
0
>>> index_term[0]
''
>>> index_term[42]
Traceback (most recent call last):
    ...
KeyError: 42

>>> a_index = term_index['A']
>>> c_index = term_index['C']
>>> vocab.transform(['C', 'A', 'C']) == [c_index, a_index, c_index]
True
>>> vocab.transform(['C', 'A', 'OOV']) == [c_index, a_index, 0]
True

>>> indices = vocab.transform('ABCDDZZZ')
>>> ''.join(vocab.state['index_term'][i] for i in indices)
'ABCDD'

class IndexTerm¶

基类：dict

Map index to term.

class TermIndex¶

基类：dict

Map term to index.

fit(tokens)¶: Build a TermIndex and a IndexTerm.

transform(tokens)¶

Transform a list of tokens to corresponding indices.

返回类型:	`list`

class matchzoo.processor_units.processor_units.WordHashingUnit(term_index)¶

基类：matchzoo.processor_units.processor_units.ProcessorUnit

Word-hashing layer for DSSM-based models.

The input of WordHashingUnit should be a list of word sub-letter list extracted from one document. The output of is the word-hashing representation of this document.

NgramLetterUnit and VocabularyUnit are two essential prerequisite of WordHashingUnit.

Examples

>>> letters = [['#te', 'tes','est', 'st#'], ['oov']]
>>> word_hashing = WordHashingUnit(
...     term_index={'': 0,'st#': 1, '#te': 2, 'est': 3, 'tes': 4})
>>> hashing = word_hashing.transform(letters)
>>> hashing[0]
array([0., 1., 1., 1., 1., 0.])
>>> hashing[1]
array([1., 0., 0., 0., 0., 0.])
>>> hashing.shape
(2, 6)

transform(terms)¶

Transform list of letters into word hashing layer.

参数:	terms (`list`) -- list of tri_letters generated by `NgramLetterUnit`.
返回类型:	`ndarray`
返回:	Word hashing representation of tri-letters.

matchzoo.processor_units.processor_units.list_available()¶: List all available units.

matchzoo.processor_units package¶

Submodules¶

matchzoo.processor_units.chain_transform module¶

matchzoo.processor_units.processor_units module¶

Module contents¶