matchzoo.preprocessors.units package¶
Submodules¶
matchzoo.preprocessors.units.digit_removal module¶
-
class
matchzoo.preprocessors.units.digit_removal.DigitRemoval¶ Bases:
matchzoo.preprocessors.units.unit.UnitProcess unit to remove digits.
-
transform(input_)¶ Remove digits from list of tokens.
Parameters: input – list of tokens to be filtered. Return tokens: tokens of tokens without digits. Return type: list
-
matchzoo.preprocessors.units.fixed_length module¶
-
class
matchzoo.preprocessors.units.fixed_length.FixedLength(text_length, pad_value=0, pad_mode='pre', truncate_mode='pre')¶ Bases:
matchzoo.preprocessors.units.unit.UnitFixedLengthUnit Class.
Process unit to get the fixed length text.
Examples
>>> from matchzoo.preprocessors.units import FixedLength >>> fixedlen = FixedLength(3) >>> fixedlen.transform(list(range(1, 6))) == [3, 4, 5] True >>> fixedlen.transform(list(range(1, 3))) == [0, 1, 2] True
-
transform(input_)¶ Transform list of tokenized tokens into the fixed length text.
Parameters: input – list of tokenized tokens. Return tokens: list of tokenized tokens in fixed length. Return type: list
-
matchzoo.preprocessors.units.frequency_filter module¶
-
class
matchzoo.preprocessors.units.frequency_filter.FrequencyFilter(low=0, high=inf, mode='df')¶ Bases:
matchzoo.preprocessors.units.stateful_unit.StatefulUnitFrequency filter unit.
Parameters: - low (
float) – Lower bound, inclusive. - high (
float) – Upper bound, exclusive. - mode (
str) – One of tf (term frequency), df (document frequency), and idf (inverse document frequency).
- Examples::
>>> import matchzoo as mz
- To filter based on term frequency (tf):
>>> tf_filter = mz.preprocessors.units.FrequencyFilter( ... low=2, mode='tf') >>> tf_filter.fit([['A', 'B', 'B'], ['C', 'C', 'C']]) >>> tf_filter.transform(['A', 'B', 'C']) ['B', 'C']
- To filter based on document frequency (df):
>>> tf_filter = mz.preprocessors.units.FrequencyFilter( ... low=2, mode='df') >>> tf_filter.fit([['A', 'B'], ['B', 'C']]) >>> tf_filter.transform(['A', 'B', 'C']) ['B']
- To filter based on inverse document frequency (idf):
>>> idf_filter = mz.preprocessors.units.FrequencyFilter( ... low=1.2, mode='idf') >>> idf_filter.fit([['A', 'B'], ['B', 'C', 'D']]) >>> idf_filter.transform(['A', 'B', 'C']) ['A', 'C']
-
fit(list_of_tokens)¶ Fit list_of_tokens by calculating mode states.
-
transform(input_)¶ Transform a list of tokens by filtering out unwanted words.
Return type: list
- low (
matchzoo.preprocessors.units.lemmatization module¶
-
class
matchzoo.preprocessors.units.lemmatization.Lemmatization¶ Bases:
matchzoo.preprocessors.units.unit.UnitProcess unit for token lemmatization.
-
transform(input_)¶ Lemmatization a sequence of tokens.
Parameters: input – list of tokens to be lemmatized. Return tokens: list of lemmatizd tokens. Return type: list
-
matchzoo.preprocessors.units.lowercase module¶
-
class
matchzoo.preprocessors.units.lowercase.Lowercase¶ Bases:
matchzoo.preprocessors.units.unit.UnitProcess unit for text lower case.
-
transform(input_)¶ Convert list of tokens to lower case.
Parameters: input – list of tokens. Return tokens: lower-cased list of tokens. Return type: list
-
matchzoo.preprocessors.units.matching_histogram module¶
-
class
matchzoo.preprocessors.units.matching_histogram.MatchingHistogram(bin_size=30, embedding_matrix=None, normalize=True, mode='LCH')¶ Bases:
matchzoo.preprocessors.units.unit.UnitMatchingHistogramUnit Class.
Parameters: - bin_size (
int) – The number of bins of the matching histogram. - embedding_matrix – The word embedding matrix applied to calculate the matching histogram.
- normalize – Boolean, normalize the embedding or not.
- mode (
str) – The type of the historgram, it should be one of ‘CH’, ‘NG’, or ‘LCH’.
Examples
>>> embedding_matrix = np.array([[1.0, -1.0], [1.0, 2.0], [1.0, 3.0]]) >>> text_left = [0, 1] >>> text_right = [1, 2] >>> histogram = MatchingHistogram(3, embedding_matrix, True, 'CH') >>> histogram.transform([text_left, text_right]) [[3.0, 1.0, 1.0], [1.0, 2.0, 2.0]]
-
transform(input_)¶ Transform the input text.
Return type: list
- bin_size (
matchzoo.preprocessors.units.ngram_letter module¶
-
class
matchzoo.preprocessors.units.ngram_letter.NgramLetter(ngram=3, reduce_dim=True)¶ Bases:
matchzoo.preprocessors.units.unit.UnitProcess unit for n-letter generation.
Triletter is used in
DSSMModel. This processor is expected to execute before Vocab has been created.Examples
>>> triletter = NgramLetter() >>> rv = triletter.transform(['hello', 'word']) >>> len(rv) 9 >>> rv ['#he', 'hel', 'ell', 'llo', 'lo#', '#wo', 'wor', 'ord', 'rd#'] >>> triletter = NgramLetter(reduce_dim=False) >>> rv = triletter.transform(['hello', 'word']) >>> len(rv) 2 >>> rv [['#he', 'hel', 'ell', 'llo', 'lo#'], ['#wo', 'wor', 'ord', 'rd#']]
-
transform(input_)¶ Transform token into tri-letter.
For example, word should be represented as #wo, wor, ord and rd#.
Parameters: input – list of tokens to be transformed. Return n_letters: generated n_letters. Return type: list
-
matchzoo.preprocessors.units.punc_removal module¶
-
class
matchzoo.preprocessors.units.punc_removal.PuncRemoval¶ Bases:
matchzoo.preprocessors.units.unit.UnitProcess unit for remove punctuations.
-
transform(input_)¶ Remove punctuations from list of tokens.
Parameters: input – list of toekns. Return rv: tokens without punctuation. Return type: list
-
matchzoo.preprocessors.units.stateful_unit module¶
-
class
matchzoo.preprocessors.units.stateful_unit.StatefulUnit¶ Bases:
matchzoo.preprocessors.units.unit.UnitUnit with inner state.
Usually need to be fit before transforming. All information gathered in the fit phrase will be stored into its context.
-
context¶ Get current context. Same as unit.state.
-
fit(input_)¶ Abstract base method, need to be implemented in subclass.
-
state¶ Get current context. Same as unit.context.
Deprecated since v2.2.0, and will be removed in the future. Used unit.context instead.
-
matchzoo.preprocessors.units.stemming module¶
-
class
matchzoo.preprocessors.units.stemming.Stemming(stemmer='porter')¶ Bases:
matchzoo.preprocessors.units.unit.UnitProcess unit for token stemming.
Parameters: stemmer – stemmer to use, porter or lancaster. -
transform(input_)¶ Reducing inflected words to their word stem, base or root form.
Parameters: input – list of string to be stemmed. Return type: list
-
matchzoo.preprocessors.units.stop_removal module¶
-
class
matchzoo.preprocessors.units.stop_removal.StopRemoval(lang='english')¶ Bases:
matchzoo.preprocessors.units.unit.UnitProcess unit to remove stop words.
Example
>>> unit = StopRemoval() >>> unit.transform(['a', 'the', 'test']) ['test'] >>> type(unit.stopwords) <class 'list'>
-
stopwords¶ Get stopwords based on language.
Params lang: language code. Return type: listReturns: list of stop words.
-
transform(input_)¶ Remove stopwords from list of tokenized tokens.
Parameters: - input – list of tokenized tokens.
- lang – language code for stopwords.
Return tokens: list of tokenized tokens without stopwords.
Return type: list
-
matchzoo.preprocessors.units.tokenize module¶
-
class
matchzoo.preprocessors.units.tokenize.Tokenize¶ Bases:
matchzoo.preprocessors.units.unit.UnitProcess unit for text tokenization.
-
transform(input_)¶ Process input data from raw terms to list of tokens.
Parameters: input – raw textual input. Return tokens: tokenized tokens as a list. Return type: list
-
matchzoo.preprocessors.units.unit module¶
matchzoo.preprocessors.units.vocabulary module¶
-
class
matchzoo.preprocessors.units.vocabulary.Vocabulary(pad_value='<PAD>', oov_value='<OOV>')¶ Bases:
matchzoo.preprocessors.units.stateful_unit.StatefulUnitVocabulary class.
Parameters: - pad_value (
str) – The string value for the padding position. - oov_value (
str) – The string value for the out-of-vocabulary terms.
Examples
>>> vocab = Vocabulary(pad_value='[PAD]', oov_value='[OOV]') >>> vocab.fit(['A', 'B', 'C', 'D', 'E']) >>> term_index = vocab.state['term_index'] >>> term_index # doctest: +SKIP {'[PAD]': 0, '[OOV]': 1, 'D': 2, 'A': 3, 'B': 4, 'C': 5, 'E': 6} >>> index_term = vocab.state['index_term'] >>> index_term # doctest: +SKIP {0: '[PAD]', 1: '[OOV]', 2: 'D', 3: 'A', 4: 'B', 5: 'C', 6: 'E'}
>>> term_index['out-of-vocabulary-term'] 1 >>> index_term[0] '[PAD]' >>> index_term[42] Traceback (most recent call last): ... KeyError: 42 >>> a_index = term_index['A'] >>> c_index = term_index['C'] >>> vocab.transform(['C', 'A', 'C']) == [c_index, a_index, c_index] True >>> vocab.transform(['C', 'A', '[OOV]']) == [c_index, a_index, 1] True >>> indices = vocab.transform(list('ABCDDZZZ')) >>> ' '.join(vocab.state['index_term'][i] for i in indices) 'A B C D D [OOV] [OOV] [OOV]'
-
class
TermIndex¶ Bases:
dictMap term to index.
-
transform(input_)¶ Transform a list of tokens to corresponding indices.
Return type: list
- pad_value (
matchzoo.preprocessors.units.word_hashing module¶
-
class
matchzoo.preprocessors.units.word_hashing.WordHashing(term_index)¶ Bases:
matchzoo.preprocessors.units.unit.UnitWord-hashing layer for DSSM-based models.
The input of
WordHashingUnitshould be a list of word sub-letter list extracted from one document. The output of is the word-hashing representation of this document.NgramLetterUnitandVocabularyUnitare two essential prerequisite ofWordHashingUnit.Examples
>>> letters = [['#te', 'tes','est', 'st#'], ['oov']] >>> word_hashing = WordHashing( ... term_index={ ... '_PAD': 0, 'OOV': 1, 'st#': 2, '#te': 3, 'est': 4, 'tes': 5 ... }) >>> hashing = word_hashing.transform(letters) >>> hashing[0] [0.0, 0.0, 1.0, 1.0, 1.0, 1.0] >>> hashing[1] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0]
-
transform(input_)¶ Transform list of
lettersinto word hashing layer.Parameters: input – list of tri_letters generated by NgramLetterUnit.Return type: listReturns: Word hashing representation of tri-letters.
-