matchzoo.preprocessors.units package

Submodules

matchzoo.preprocessors.units.digit_removal module

class matchzoo.preprocessors.units.digit_removal.DigitRemoval

Bases: matchzoo.preprocessors.units.unit.Unit

Process unit to remove digits.

transform(input_)

Remove digits from list of tokens.

Parameters

input – list of tokens to be filtered.

Return tokens

tokens of tokens without digits.

Return type

list

matchzoo.preprocessors.units.fixed_length module

class matchzoo.preprocessors.units.fixed_length.FixedLength(text_length, pad_value=0, pad_mode='pre', truncate_mode='pre')

Bases: matchzoo.preprocessors.units.unit.Unit

FixedLengthUnit Class.

Process unit to get the fixed length text.

Examples

>>> from matchzoo.preprocessors.units import FixedLength
>>> fixedlen = FixedLength(3)
>>> fixedlen.transform(list(range(1, 6))) == [3, 4, 5]
True
>>> fixedlen.transform(list(range(1, 3))) == [0, 1, 2]
True
transform(input_)

Transform list of tokenized tokens into the fixed length text.

Parameters

input – list of tokenized tokens.

Return tokens

list of tokenized tokens in fixed length.

Return type

list

matchzoo.preprocessors.units.frequency_filter module

class matchzoo.preprocessors.units.frequency_filter.FrequencyFilter(low=0, high=inf, mode='df')

Bases: matchzoo.preprocessors.units.stateful_unit.StatefulUnit

Frequency filter unit.

Parameters
  • low (float) – Lower bound, inclusive.

  • high (float) – Upper bound, exclusive.

  • mode (str) – One of tf (term frequency), df (document frequency), and idf (inverse document frequency).

Examples::
>>> import matchzoo as mz
To filter based on term frequency (tf):
>>> tf_filter = mz.preprocessors.units.FrequencyFilter(
...     low=2, mode='tf')
>>> tf_filter.fit([['A', 'B', 'B'], ['C', 'C', 'C']])
>>> tf_filter.transform(['A', 'B', 'C'])
['B', 'C']
To filter based on document frequency (df):
>>> tf_filter = mz.preprocessors.units.FrequencyFilter(
...     low=2, mode='df')
>>> tf_filter.fit([['A', 'B'], ['B', 'C']])
>>> tf_filter.transform(['A', 'B', 'C'])
['B']
To filter based on inverse document frequency (idf):
>>> idf_filter = mz.preprocessors.units.FrequencyFilter(
...     low=1.2, mode='idf')
>>> idf_filter.fit([['A', 'B'], ['B', 'C', 'D']])
>>> idf_filter.transform(['A', 'B', 'C'])
['A', 'C']
fit(list_of_tokens)

Fit list_of_tokens by calculating mode states.

transform(input_)

Transform a list of tokens by filtering out unwanted words.

Return type

list

matchzoo.preprocessors.units.lemmatization module

class matchzoo.preprocessors.units.lemmatization.Lemmatization

Bases: matchzoo.preprocessors.units.unit.Unit

Process unit for token lemmatization.

transform(input_)

Lemmatization a sequence of tokens.

Parameters

input – list of tokens to be lemmatized.

Return tokens

list of lemmatizd tokens.

Return type

list

matchzoo.preprocessors.units.lowercase module

class matchzoo.preprocessors.units.lowercase.Lowercase

Bases: matchzoo.preprocessors.units.unit.Unit

Process unit for text lower case.

transform(input_)

Convert list of tokens to lower case.

Parameters

input – list of tokens.

Return tokens

lower-cased list of tokens.

Return type

list

matchzoo.preprocessors.units.matching_histogram module

class matchzoo.preprocessors.units.matching_histogram.MatchingHistogram(bin_size=30, embedding_matrix=None, normalize=True, mode='LCH')

Bases: matchzoo.preprocessors.units.unit.Unit

MatchingHistogramUnit Class.

Parameters
  • bin_size (int) – The number of bins of the matching histogram.

  • embedding_matrix – The word embedding matrix applied to calculate the matching histogram.

  • normalize – Boolean, normalize the embedding or not.

  • mode (str) – The type of the historgram, it should be one of ‘CH’, ‘NG’, or ‘LCH’.

Examples

>>> embedding_matrix = np.array([[1.0, -1.0], [1.0, 2.0], [1.0, 3.0]])
>>> text_left = [0, 1]
>>> text_right = [1, 2]
>>> histogram = MatchingHistogram(3, embedding_matrix, True, 'CH')
>>> histogram.transform([text_left, text_right])
[[3.0, 1.0, 1.0], [1.0, 2.0, 2.0]]
transform(input_)

Transform the input text.

Return type

list

matchzoo.preprocessors.units.ngram_letter module

class matchzoo.preprocessors.units.ngram_letter.NgramLetter(ngram=3, reduce_dim=True)

Bases: matchzoo.preprocessors.units.unit.Unit

Process unit for n-letter generation.

Triletter is used in DSSMModel. This processor is expected to execute before Vocab has been created.

Examples

>>> triletter = NgramLetter()
>>> rv = triletter.transform(['hello', 'word'])
>>> len(rv)
9
>>> rv
['#he', 'hel', 'ell', 'llo', 'lo#', '#wo', 'wor', 'ord', 'rd#']
>>> triletter = NgramLetter(reduce_dim=False)
>>> rv = triletter.transform(['hello', 'word'])
>>> len(rv)
2
>>> rv
[['#he', 'hel', 'ell', 'llo', 'lo#'], ['#wo', 'wor', 'ord', 'rd#']]
transform(input_)

Transform token into tri-letter.

For example, word should be represented as #wo, wor, ord and rd#.

Parameters

input – list of tokens to be transformed.

Return n_letters

generated n_letters.

Return type

list

matchzoo.preprocessors.units.punc_removal module

class matchzoo.preprocessors.units.punc_removal.PuncRemoval

Bases: matchzoo.preprocessors.units.unit.Unit

Process unit for remove punctuations.

transform(input_)

Remove punctuations from list of tokens.

Parameters

input – list of toekns.

Return rv

tokens without punctuation.

Return type

list

matchzoo.preprocessors.units.stateful_unit module

class matchzoo.preprocessors.units.stateful_unit.StatefulUnit

Bases: matchzoo.preprocessors.units.unit.Unit

Unit with inner state.

Usually need to be fit before transforming. All information gathered in the fit phrase will be stored into its context.

property context

Get current context. Same as unit.state.

abstract fit(input_)

Abstract base method, need to be implemented in subclass.

property state

Get current context. Same as unit.context.

Deprecated since v2.2.0, and will be removed in the future. Used unit.context instead.

matchzoo.preprocessors.units.stemming module

class matchzoo.preprocessors.units.stemming.Stemming(stemmer='porter')

Bases: matchzoo.preprocessors.units.unit.Unit

Process unit for token stemming.

Parameters

stemmer – stemmer to use, porter or lancaster.

transform(input_)

Reducing inflected words to their word stem, base or root form.

Parameters

input – list of string to be stemmed.

Return type

list

matchzoo.preprocessors.units.stop_removal module

class matchzoo.preprocessors.units.stop_removal.StopRemoval(lang='english')

Bases: matchzoo.preprocessors.units.unit.Unit

Process unit to remove stop words.

Example

>>> unit = StopRemoval()
>>> unit.transform(['a', 'the', 'test'])
['test']
>>> type(unit.stopwords)
<class 'list'>
property stopwords: list

Get stopwords based on language.

Params lang

language code.

Return type

list

Returns

list of stop words.

transform(input_)

Remove stopwords from list of tokenized tokens.

Parameters
  • input – list of tokenized tokens.

  • lang – language code for stopwords.

Return tokens

list of tokenized tokens without stopwords.

Return type

list

matchzoo.preprocessors.units.tokenize module

matchzoo.preprocessors.units.unit module

class matchzoo.preprocessors.units.unit.Unit

Bases: object

Process unit do not persive state (i.e. do not need fit).

abstract transform(input_)

Abstract base method, need to be implemented in subclass.

matchzoo.preprocessors.units.vocabulary module

matchzoo.preprocessors.units.word_hashing module

Module contents