matchzoo.preprocessors.units package¶

Submodules¶

matchzoo.preprocessors.units.digit_removal module¶

class matchzoo.preprocessors.units.digit_removal.DigitRemoval¶

Bases: matchzoo.preprocessors.units.unit.Unit

Process unit to remove digits.

transform(input_)¶

Remove digits from list of tokens.

Parameters:	input – list of tokens to be filtered.
Return tokens:	tokens of tokens without digits.
Return type:	`list`

matchzoo.preprocessors.units.fixed_length module¶

class matchzoo.preprocessors.units.fixed_length.FixedLength(text_length, pad_value=0, pad_mode='pre', truncate_mode='pre')¶

Bases: matchzoo.preprocessors.units.unit.Unit

FixedLengthUnit Class.

Process unit to get the fixed length text.

Examples

>>> from matchzoo.preprocessors.units import FixedLength
>>> fixedlen = FixedLength(3)
>>> fixedlen.transform(list(range(1, 6))) == [3, 4, 5]
True
>>> fixedlen.transform(list(range(1, 3))) == [0, 1, 2]
True

transform(input_)¶

Transform list of tokenized tokens into the fixed length text.

Parameters:	input – list of tokenized tokens.
Return tokens:	list of tokenized tokens in fixed length.
Return type:	`list`

matchzoo.preprocessors.units.frequency_filter module¶

class matchzoo.preprocessors.units.frequency_filter.FrequencyFilter(low=0, high=inf, mode='df')¶

Bases: matchzoo.preprocessors.units.stateful_unit.StatefulUnit

Frequency filter unit.

Parameters:	low (`float`) – Lower bound, inclusive. high (`float`) – Upper bound, exclusive. mode (`str`) – One of tf (term frequency), df (document frequency), and idf (inverse document frequency).

Examples::

>>> import matchzoo as mz

To filter based on term frequency (tf):

>>> tf_filter = mz.preprocessors.units.FrequencyFilter(
...     low=2, mode='tf')
>>> tf_filter.fit([['A', 'B', 'B'], ['C', 'C', 'C']])
>>> tf_filter.transform(['A', 'B', 'C'])
['B', 'C']

To filter based on document frequency (df):

>>> tf_filter = mz.preprocessors.units.FrequencyFilter(
...     low=2, mode='df')
>>> tf_filter.fit([['A', 'B'], ['B', 'C']])
>>> tf_filter.transform(['A', 'B', 'C'])
['B']

To filter based on inverse document frequency (idf):

>>> idf_filter = mz.preprocessors.units.FrequencyFilter(
...     low=1.2, mode='idf')
>>> idf_filter.fit([['A', 'B'], ['B', 'C', 'D']])
>>> idf_filter.transform(['A', 'B', 'C'])
['A', 'C']

fit(list_of_tokens)¶: Fit list_of_tokens by calculating mode states.

transform(input_)¶

Transform a list of tokens by filtering out unwanted words.

Return type:	`list`

matchzoo.preprocessors.units.lemmatization module¶

class matchzoo.preprocessors.units.lemmatization.Lemmatization¶

Bases: matchzoo.preprocessors.units.unit.Unit

Process unit for token lemmatization.

transform(input_)¶

Lemmatization a sequence of tokens.

Parameters:	input – list of tokens to be lemmatized.
Return tokens:	list of lemmatizd tokens.
Return type:	`list`

matchzoo.preprocessors.units.lowercase module¶

class matchzoo.preprocessors.units.lowercase.Lowercase¶

Bases: matchzoo.preprocessors.units.unit.Unit

Process unit for text lower case.

transform(input_)¶

Convert list of tokens to lower case.

Parameters:	input – list of tokens.
Return tokens:	lower-cased list of tokens.
Return type:	`list`

matchzoo.preprocessors.units.matching_histogram module¶

class matchzoo.preprocessors.units.matching_histogram.MatchingHistogram(bin_size=30, embedding_matrix=None, normalize=True, mode='LCH')¶

Bases: matchzoo.preprocessors.units.unit.Unit

MatchingHistogramUnit Class.

Parameters:	bin_size (`int`) – The number of bins of the matching histogram. embedding_matrix – The word embedding matrix applied to calculate the matching histogram. normalize – Boolean, normalize the embedding or not. mode (`str`) – The type of the historgram, it should be one of ‘CH’, ‘NG’, or ‘LCH’.

Examples

>>> embedding_matrix = np.array([[1.0, -1.0], [1.0, 2.0], [1.0, 3.0]])
>>> text_left = [0, 1]
>>> text_right = [1, 2]
>>> histogram = MatchingHistogram(3, embedding_matrix, True, 'CH')
>>> histogram.transform([text_left, text_right])
[[3.0, 1.0, 1.0], [1.0, 2.0, 2.0]]

transform(input_)¶

Transform the input text.

Return type:	`list`

matchzoo.preprocessors.units.ngram_letter module¶

class matchzoo.preprocessors.units.ngram_letter.NgramLetter(ngram=3, reduce_dim=True)¶

Bases: matchzoo.preprocessors.units.unit.Unit

Process unit for n-letter generation.

Triletter is used in DSSMModel. This processor is expected to execute before Vocab has been created.

Examples

>>> triletter = NgramLetter()
>>> rv = triletter.transform(['hello', 'word'])
>>> len(rv)
9
>>> rv
['#he', 'hel', 'ell', 'llo', 'lo#', '#wo', 'wor', 'ord', 'rd#']
>>> triletter = NgramLetter(reduce_dim=False)
>>> rv = triletter.transform(['hello', 'word'])
>>> len(rv)
2
>>> rv
[['#he', 'hel', 'ell', 'llo', 'lo#'], ['#wo', 'wor', 'ord', 'rd#']]

transform(input_)¶

Transform token into tri-letter.

For example, word should be represented as #wo, wor, ord and rd#.

Return n_letters:
Parameters:	input – list of tokens to be transformed.
	generated n_letters.
Return type:	`list`

matchzoo.preprocessors.units.punc_removal module¶

class matchzoo.preprocessors.units.punc_removal.PuncRemoval¶

Bases: matchzoo.preprocessors.units.unit.Unit

Process unit for remove punctuations.

transform(input_)¶

Remove punctuations from list of tokens.

Parameters:	input – list of toekns.
Return rv:	tokens without punctuation.
Return type:	`list`

matchzoo.preprocessors.units.stateful_unit module¶

class matchzoo.preprocessors.units.stateful_unit.StatefulUnit¶

Bases: matchzoo.preprocessors.units.unit.Unit

Unit with inner state.

Usually need to be fit before transforming. All information gathered in the fit phrase will be stored into its context.

context¶: Get current context. Same as unit.state.

fit(input_)¶: Abstract base method, need to be implemented in subclass.

state¶

Get current context. Same as unit.context.

Deprecated since v2.2.0, and will be removed in the future. Used unit.context instead.

matchzoo.preprocessors.units.stemming module¶

class matchzoo.preprocessors.units.stemming.Stemming(stemmer='porter')¶

Bases: matchzoo.preprocessors.units.unit.Unit

Process unit for token stemming.

Parameters:	stemmer – stemmer to use, porter or lancaster.

transform(input_)¶

Reducing inflected words to their word stem, base or root form.

Parameters:	input – list of string to be stemmed.
Return type:	`list`

matchzoo.preprocessors.units.stop_removal module¶

class matchzoo.preprocessors.units.stop_removal.StopRemoval(lang='english')¶

Bases: matchzoo.preprocessors.units.unit.Unit

Process unit to remove stop words.

Example

>>> unit = StopRemoval()
>>> unit.transform(['a', 'the', 'test'])
['test']
>>> type(unit.stopwords)
<class 'list'>

stopwords¶

Get stopwords based on language.

Params lang:	language code.
Return type:	`list`
Returns:	list of stop words.

transform(input_)¶

Remove stopwords from list of tokenized tokens.

Parameters:	input – list of tokenized tokens. lang – language code for stopwords.
Return tokens:	list of tokenized tokens without stopwords.
Return type:	`list`

matchzoo.preprocessors.units.tokenize module¶

class matchzoo.preprocessors.units.tokenize.Tokenize¶

Bases: matchzoo.preprocessors.units.unit.Unit

Process unit for text tokenization.

transform(input_)¶

Process input data from raw terms to list of tokens.

Parameters:	input – raw textual input.
Return tokens:	tokenized tokens as a list.
Return type:	`list`

matchzoo.preprocessors.units.unit module¶

class matchzoo.preprocessors.units.unit.Unit¶

Bases: object

Process unit do not persive state (i.e. do not need fit).

transform(input_)¶: Abstract base method, need to be implemented in subclass.

matchzoo.preprocessors.units.vocabulary module¶

class matchzoo.preprocessors.units.vocabulary.Vocabulary(pad_value='<PAD>', oov_value='<OOV>')¶

Bases: matchzoo.preprocessors.units.stateful_unit.StatefulUnit

Vocabulary class.

Parameters:	pad_value (`str`) – The string value for the padding position. oov_value (`str`) – The string value for the out-of-vocabulary terms.

Examples

>>> vocab = Vocabulary(pad_value='[PAD]', oov_value='[OOV]')
>>> vocab.fit(['A', 'B', 'C', 'D', 'E'])
>>> term_index = vocab.state['term_index']
>>> term_index  # doctest: +SKIP
{'[PAD]': 0, '[OOV]': 1, 'D': 2, 'A': 3, 'B': 4, 'C': 5, 'E': 6}
>>> index_term = vocab.state['index_term']
>>> index_term  # doctest: +SKIP
{0: '[PAD]', 1: '[OOV]', 2: 'D', 3: 'A', 4: 'B', 5: 'C', 6: 'E'}

>>> term_index['out-of-vocabulary-term']
1
>>> index_term[0]
'[PAD]'
>>> index_term[42]
Traceback (most recent call last):
    ...
KeyError: 42
>>> a_index = term_index['A']
>>> c_index = term_index['C']
>>> vocab.transform(['C', 'A', 'C']) == [c_index, a_index, c_index]
True
>>> vocab.transform(['C', 'A', '[OOV]']) == [c_index, a_index, 1]
True
>>> indices = vocab.transform(list('ABCDDZZZ'))
>>> ' '.join(vocab.state['index_term'][i] for i in indices)
'A B C D D [OOV] [OOV] [OOV]'

class TermIndex¶

Bases: dict

Map term to index.

fit(tokens)¶: Build a TermIndex and a IndexTerm.

transform(input_)¶

Transform a list of tokens to corresponding indices.

Return type:	`list`

matchzoo.preprocessors.units.word_hashing module¶

class matchzoo.preprocessors.units.word_hashing.WordHashing(term_index)¶

Bases: matchzoo.preprocessors.units.unit.Unit

Word-hashing layer for DSSM-based models.

The input of WordHashingUnit should be a list of word sub-letter list extracted from one document. The output of is the word-hashing representation of this document.

NgramLetterUnit and VocabularyUnit are two essential prerequisite of WordHashingUnit.

Examples

>>> letters = [['#te', 'tes','est', 'st#'], ['oov']]
>>> word_hashing = WordHashing(
...     term_index={
...      '_PAD': 0, 'OOV': 1, 'st#': 2, '#te': 3, 'est': 4, 'tes': 5
...      })
>>> hashing = word_hashing.transform(letters)
>>> hashing[0]
[0.0, 0.0, 1.0, 1.0, 1.0, 1.0]
>>> hashing[1]
[0.0, 1.0, 0.0, 0.0, 0.0, 0.0]

transform(input_)¶

Transform list of letters into word hashing layer.

Parameters:	input – list of tri_letters generated by `NgramLetterUnit`.
Return type:	`list`
Returns:	Word hashing representation of tri-letters.

Module contents¶

matchzoo.preprocessors.units.list_available()¶

Return type:	`list`