matchzoo.preprocessors.units package¶
Submodules¶
matchzoo.preprocessors.units.digit_removal module¶
- class matchzoo.preprocessors.units.digit_removal.DigitRemoval¶
Bases:
matchzoo.preprocessors.units.unit.Unit
Process unit to remove digits.
- transform(input_)¶
Remove digits from list of tokens.
- Parameters
input – list of tokens to be filtered.
- Return tokens
tokens of tokens without digits.
- Return type
list
matchzoo.preprocessors.units.fixed_length module¶
- class matchzoo.preprocessors.units.fixed_length.FixedLength(text_length, pad_value=0, pad_mode='pre', truncate_mode='pre')¶
Bases:
matchzoo.preprocessors.units.unit.Unit
FixedLengthUnit Class.
Process unit to get the fixed length text.
Examples
>>> from matchzoo.preprocessors.units import FixedLength >>> fixedlen = FixedLength(3) >>> fixedlen.transform(list(range(1, 6))) == [3, 4, 5] True >>> fixedlen.transform(list(range(1, 3))) == [0, 1, 2] True
- transform(input_)¶
Transform list of tokenized tokens into the fixed length text.
- Parameters
input – list of tokenized tokens.
- Return tokens
list of tokenized tokens in fixed length.
- Return type
list
matchzoo.preprocessors.units.frequency_filter module¶
- class matchzoo.preprocessors.units.frequency_filter.FrequencyFilter(low=0, high=inf, mode='df')¶
Bases:
matchzoo.preprocessors.units.stateful_unit.StatefulUnit
Frequency filter unit.
- Parameters
low (
float
) – Lower bound, inclusive.high (
float
) – Upper bound, exclusive.mode (
str
) – One of tf (term frequency), df (document frequency), and idf (inverse document frequency).
- Examples::
>>> import matchzoo as mz
- To filter based on term frequency (tf):
>>> tf_filter = mz.preprocessors.units.FrequencyFilter( ... low=2, mode='tf') >>> tf_filter.fit([['A', 'B', 'B'], ['C', 'C', 'C']]) >>> tf_filter.transform(['A', 'B', 'C']) ['B', 'C']
- To filter based on document frequency (df):
>>> tf_filter = mz.preprocessors.units.FrequencyFilter( ... low=2, mode='df') >>> tf_filter.fit([['A', 'B'], ['B', 'C']]) >>> tf_filter.transform(['A', 'B', 'C']) ['B']
- To filter based on inverse document frequency (idf):
>>> idf_filter = mz.preprocessors.units.FrequencyFilter( ... low=1.2, mode='idf') >>> idf_filter.fit([['A', 'B'], ['B', 'C', 'D']]) >>> idf_filter.transform(['A', 'B', 'C']) ['A', 'C']
- fit(list_of_tokens)¶
Fit list_of_tokens by calculating mode states.
- transform(input_)¶
Transform a list of tokens by filtering out unwanted words.
- Return type
list
matchzoo.preprocessors.units.lemmatization module¶
- class matchzoo.preprocessors.units.lemmatization.Lemmatization¶
Bases:
matchzoo.preprocessors.units.unit.Unit
Process unit for token lemmatization.
- transform(input_)¶
Lemmatization a sequence of tokens.
- Parameters
input – list of tokens to be lemmatized.
- Return tokens
list of lemmatizd tokens.
- Return type
list
matchzoo.preprocessors.units.lowercase module¶
- class matchzoo.preprocessors.units.lowercase.Lowercase¶
Bases:
matchzoo.preprocessors.units.unit.Unit
Process unit for text lower case.
- transform(input_)¶
Convert list of tokens to lower case.
- Parameters
input – list of tokens.
- Return tokens
lower-cased list of tokens.
- Return type
list
matchzoo.preprocessors.units.matching_histogram module¶
- class matchzoo.preprocessors.units.matching_histogram.MatchingHistogram(bin_size=30, embedding_matrix=None, normalize=True, mode='LCH')¶
Bases:
matchzoo.preprocessors.units.unit.Unit
MatchingHistogramUnit Class.
- Parameters
bin_size (
int
) – The number of bins of the matching histogram.embedding_matrix – The word embedding matrix applied to calculate the matching histogram.
normalize – Boolean, normalize the embedding or not.
mode (
str
) – The type of the historgram, it should be one of ‘CH’, ‘NG’, or ‘LCH’.
Examples
>>> embedding_matrix = np.array([[1.0, -1.0], [1.0, 2.0], [1.0, 3.0]]) >>> text_left = [0, 1] >>> text_right = [1, 2] >>> histogram = MatchingHistogram(3, embedding_matrix, True, 'CH') >>> histogram.transform([text_left, text_right]) [[3.0, 1.0, 1.0], [1.0, 2.0, 2.0]]
- transform(input_)¶
Transform the input text.
- Return type
list
matchzoo.preprocessors.units.ngram_letter module¶
- class matchzoo.preprocessors.units.ngram_letter.NgramLetter(ngram=3, reduce_dim=True)¶
Bases:
matchzoo.preprocessors.units.unit.Unit
Process unit for n-letter generation.
Triletter is used in
DSSMModel
. This processor is expected to execute before Vocab has been created.Examples
>>> triletter = NgramLetter() >>> rv = triletter.transform(['hello', 'word']) >>> len(rv) 9 >>> rv ['#he', 'hel', 'ell', 'llo', 'lo#', '#wo', 'wor', 'ord', 'rd#'] >>> triletter = NgramLetter(reduce_dim=False) >>> rv = triletter.transform(['hello', 'word']) >>> len(rv) 2 >>> rv [['#he', 'hel', 'ell', 'llo', 'lo#'], ['#wo', 'wor', 'ord', 'rd#']]
- transform(input_)¶
Transform token into tri-letter.
For example, word should be represented as #wo, wor, ord and rd#.
- Parameters
input – list of tokens to be transformed.
- Return n_letters
generated n_letters.
- Return type
list
matchzoo.preprocessors.units.punc_removal module¶
- class matchzoo.preprocessors.units.punc_removal.PuncRemoval¶
Bases:
matchzoo.preprocessors.units.unit.Unit
Process unit for remove punctuations.
- transform(input_)¶
Remove punctuations from list of tokens.
- Parameters
input – list of toekns.
- Return rv
tokens without punctuation.
- Return type
list
matchzoo.preprocessors.units.stateful_unit module¶
- class matchzoo.preprocessors.units.stateful_unit.StatefulUnit¶
Bases:
matchzoo.preprocessors.units.unit.Unit
Unit with inner state.
Usually need to be fit before transforming. All information gathered in the fit phrase will be stored into its context.
- property context¶
Get current context. Same as unit.state.
- abstract fit(input_)¶
Abstract base method, need to be implemented in subclass.
- property state¶
Get current context. Same as unit.context.
Deprecated since v2.2.0, and will be removed in the future. Used unit.context instead.
matchzoo.preprocessors.units.stemming module¶
- class matchzoo.preprocessors.units.stemming.Stemming(stemmer='porter')¶
Bases:
matchzoo.preprocessors.units.unit.Unit
Process unit for token stemming.
- Parameters
stemmer – stemmer to use, porter or lancaster.
- transform(input_)¶
Reducing inflected words to their word stem, base or root form.
- Parameters
input – list of string to be stemmed.
- Return type
list
matchzoo.preprocessors.units.stop_removal module¶
- class matchzoo.preprocessors.units.stop_removal.StopRemoval(lang='english')¶
Bases:
matchzoo.preprocessors.units.unit.Unit
Process unit to remove stop words.
Example
>>> unit = StopRemoval() >>> unit.transform(['a', 'the', 'test']) ['test'] >>> type(unit.stopwords) <class 'list'>
- property stopwords: list¶
Get stopwords based on language.
- Params lang
language code.
- Return type
list
- Returns
list of stop words.
- transform(input_)¶
Remove stopwords from list of tokenized tokens.
- Parameters
input – list of tokenized tokens.
lang – language code for stopwords.
- Return tokens
list of tokenized tokens without stopwords.
- Return type
list