matchzoo.processor_units package¶
Submodules¶
matchzoo.processor_units.chain_transform module¶
Wrapper function organizes a number of transform functions.
-
matchzoo.processor_units.chain_transform.chain_transform(units)¶ Compose unit transformations into a single function.
matchzoo.processor_units.processor_units module¶
Matchzoo toolkit for text pre-processing.
-
class
matchzoo.processor_units.processor_units.DigitRemovalUnit¶ 基类:
matchzoo.processor_units.processor_units.ProcessorUnitProcess unit to remove digits.
-
transform(tokens)¶ Remove digits from list of tokens.
参数: tokens ( list) -- list of tokens to be filtered.Return tokens: tokens of tokens without digits. 返回类型: list
-
-
class
matchzoo.processor_units.processor_units.FixedLengthUnit(text_length, pad_value=0, pad_mode='pre', truncate_mode='pre')¶ 基类:
matchzoo.processor_units.processor_units.ProcessorUnitFixedLengthUnit Class.
Process unit to get the fixed length text.
Examples
>>> fixedlen = FixedLengthUnit(3) >>> fixedlen.transform(range(1, 6)) == [3, 4, 5] True >>> fixedlen = FixedLengthUnit(3) >>> fixedlen.transform(range(1, 3)) == [0, 1, 2] True
-
transform(tokens)¶ Transform list of tokenized tokens into the fixed length text.
参数: tokens ( list) -- list of tokenized tokens.Return tokens: list of tokenized tokens in fixed length. 返回类型: list
-
-
class
matchzoo.processor_units.processor_units.FrequencyFilterUnit(low=0, high=inf, mode='df')¶ 基类:
matchzoo.processor_units.processor_units.StatefulProcessorUnitFrequency filter unit.
参数: - low -- Lower bound, inclusive.
- high -- Upper bound, exclusive.
- mode -- One of tf (term frequency), df (document frequency), and idf (inverse document frequency).
- Examples::
>>> import matchzoo as mz
- To filter based on term frequency (tf):
>>> tf_filter = mz.processor_units.FrequencyFilterUnit( ... low=2, mode='tf') >>> tf_filter.fit([['A', 'B', 'B'], ['C', 'C', 'C']]) >>> tf_filter.transform(['A', 'B', 'C']) ['B', 'C']
- To filter based on document frequency (df):
>>> tf_filter = mz.processor_units.FrequencyFilterUnit( ... low=2, mode='df') >>> tf_filter.fit([['A', 'B'], ['B', 'C']]) >>> tf_filter.transform(['A', 'B', 'C']) ['B']
- To filter based on inverse document frequency (idf):
>>> idf_filter = mz.processor_units.FrequencyFilterUnit( ... low=1.2, mode='idf') >>> idf_filter.fit([['A', 'B'], ['B', 'C', 'D']]) >>> idf_filter.transform(['A', 'B', 'C']) ['A', 'C']
-
fit(list_of_tokens)¶ Fit list_of_tokens by calculating mode states.
-
transform(tokens)¶ Transform a list of tokens by filtering out unwanted words.
返回类型: list
-
class
matchzoo.processor_units.processor_units.LemmatizationUnit¶ 基类:
matchzoo.processor_units.processor_units.ProcessorUnitProcess unit for token lemmatization.
-
transform(tokens)¶ Lemmatization a sequence of tokens.
参数: tokens ( list) -- list of tokens to be lemmatized.Return tokens: list of lemmatizd tokens. 返回类型: list
-
-
class
matchzoo.processor_units.processor_units.LowercaseUnit¶ 基类:
matchzoo.processor_units.processor_units.ProcessorUnitProcess unit for text lower case.
-
transform(tokens)¶ Convert list of tokens to lower case.
参数: tokens ( list) -- list of tokens.Return tokens: lower-cased list of tokens. 返回类型: list
-
-
class
matchzoo.processor_units.processor_units.MatchingHistogramUnit(bin_size=30, embedding_matrix=None, normalize=True, mode='LCH')¶ 基类:
matchzoo.processor_units.processor_units.ProcessorUnitMatchingHistogramUnit Class.
参数: - bin_size (
int) -- The number of bins of the matching histogram. - embedding_matrix -- The word embedding matrix applied to calculate the matching histogram.
- normalize -- Boolean, normalize the embedding or not.
- mode (
str) -- The type of the historgram, it should be one of 'CH', 'NG', or 'LCH'.
Examples
>>> embedding_matrix = np.array([[1.0, -1.0], [1.0, 2.0], [1.0, 3.0]]) >>> text_left = [0, 1] >>> text_right = [1, 2] >>> histogram = MatchingHistogramUnit(3, embedding_matrix, True, 'CH') >>> histogram.transform([text_left, text_right]) [[3.0, 1.0, 1.0], [1.0, 2.0, 2.0]]
-
transform(text_pair)¶ Transform the input text.
返回类型: list
- bin_size (
-
class
matchzoo.processor_units.processor_units.NgramLetterUnit(ngram=3, reduce_dim=True)¶ 基类:
matchzoo.processor_units.processor_units.ProcessorUnitProcess unit for n-letter generation.
Triletter is used in
DSSMModel. This processor is expected to execute before Vocab has been created.Examples
>>> triletter = NgramLetterUnit() >>> rv = triletter.transform(['hello', 'word']) >>> len(rv) 9 >>> rv ['#he', 'hel', 'ell', 'llo', 'lo#', '#wo', 'wor', 'ord', 'rd#'] >>> triletter = NgramLetterUnit(reduce_dim=False) >>> rv = triletter.transform(['hello', 'word']) >>> len(rv) 2 >>> rv [['#he', 'hel', 'ell', 'llo', 'lo#'], ['#wo', 'wor', 'ord', 'rd#']]
-
transform(tokens)¶ Transform token into tri-letter.
For example, word should be represented as #wo, wor, ord and rd#.
参数: tokens ( list) -- list of tokens to be transformed.Return n_letters: generated n_letters. 返回类型: list
-
-
class
matchzoo.processor_units.processor_units.ProcessorUnit¶ 基类:
objectProcess unit do not persive state (i.e. do not need fit).
-
transform(input)¶ Abstract base method, need to be implemented in subclass.
-
-
class
matchzoo.processor_units.processor_units.PuncRemovalUnit¶ 基类:
matchzoo.processor_units.processor_units.ProcessorUnitProcess unit for remove punctuations.
-
transform(tokens)¶ Remove punctuations from list of tokens.
参数: tokens ( list) -- list of toekns.Return rv: tokens without punctuation. 返回类型: list
-
-
class
matchzoo.processor_units.processor_units.StatefulProcessorUnit¶ 基类:
matchzoo.processor_units.processor_units.ProcessorUnitProcess unit do persive state (i.e. need fit).
-
fit(input)¶ Abstract base method, need to be implemented in subclass.
-
state¶ Get current state.
-
-
class
matchzoo.processor_units.processor_units.StemmingUnit(stemmer='porter')¶ 基类:
matchzoo.processor_units.processor_units.ProcessorUnitProcess unit for token stemming.
-
transform(tokens)¶ Reducing inflected words to their word stem, base or root form.
参数: - tokens (
list) -- list of string to be stemmed. - stemmer -- stemmer to use, porter or lancaster.
引发: ValueError -- stemmer type should be porter or lancaster.
Return tokens: stemmed token.
返回类型: list- tokens (
-
-
class
matchzoo.processor_units.processor_units.StopRemovalUnit(lang='english')¶ 基类:
matchzoo.processor_units.processor_units.ProcessorUnitProcess unit to remove stop words.
Example
>>> unit = StopRemovalUnit() >>> unit.transform(['a', 'the', 'test']) ['test'] >>> type(unit.stopwords) <class 'list'>
-
stopwords¶ Get stopwords based on language.
Params lang: language code. 返回类型: list返回: list of stop words.
-
transform(tokens)¶ Remove stopwords from list of tokenized tokens.
参数: - tokens (
list) -- list of tokenized tokens. - lang -- language code for stopwords.
Return tokens: list of tokenized tokens without stopwords.
返回类型: list- tokens (
-
-
class
matchzoo.processor_units.processor_units.TokenizeUnit¶ 基类:
matchzoo.processor_units.processor_units.ProcessorUnitProcess unit for text tokenization.
-
transform(input)¶ Process input data from raw terms to list of tokens.
参数: input ( str) -- raw textual input.Return tokens: tokenized tokens as a list. 返回类型: list
-
-
class
matchzoo.processor_units.processor_units.VocabularyUnit¶ 基类:
matchzoo.processor_units.processor_units.StatefulProcessorUnitVocabulary class.
Examples
>>> vocab = VocabularyUnit() >>> vocab.fit(['A', 'B', 'C', 'D', 'E']) >>> term_index = vocab.state['term_index'] >>> term_index {'E': 1, 'C': 2, 'D': 3, 'A': 4, 'B': 5} >>> index_term = vocab.state['index_term'] >>> index_term {1: 'C', 2: 'A', 3: 'E', 4: 'B', 5: 'D'}
>>> term_index['out-of-vocabulary-term'] 0 >>> index_term[0] '' >>> index_term[42] Traceback (most recent call last): ... KeyError: 42
>>> a_index = term_index['A'] >>> c_index = term_index['C'] >>> vocab.transform(['C', 'A', 'C']) == [c_index, a_index, c_index] True >>> vocab.transform(['C', 'A', 'OOV']) == [c_index, a_index, 0] True
>>> indices = vocab.transform('ABCDDZZZ') >>> ''.join(vocab.state['index_term'][i] for i in indices) 'ABCDD'
-
class
IndexTerm¶ 基类:
dictMap index to term.
-
class
TermIndex¶ 基类:
dictMap term to index.
-
transform(tokens)¶ Transform a list of tokens to corresponding indices.
返回类型: list
-
class
-
class
matchzoo.processor_units.processor_units.WordHashingUnit(term_index)¶ 基类:
matchzoo.processor_units.processor_units.ProcessorUnitWord-hashing layer for DSSM-based models.
The input of
WordHashingUnitshould be a list of word sub-letter list extracted from one document. The output of is the word-hashing representation of this document.NgramLetterUnitandVocabularyUnitare two essential prerequisite ofWordHashingUnit.Examples
>>> letters = [['#te', 'tes','est', 'st#'], ['oov']] >>> word_hashing = WordHashingUnit( ... term_index={'': 0,'st#': 1, '#te': 2, 'est': 3, 'tes': 4}) >>> hashing = word_hashing.transform(letters) >>> hashing[0] array([0., 1., 1., 1., 1., 0.]) >>> hashing[1] array([1., 0., 0., 0., 0., 0.]) >>> hashing.shape (2, 6)
-
transform(terms)¶ Transform list of
lettersinto word hashing layer.参数: terms ( list) -- list of tri_letters generated byNgramLetterUnit.返回类型: ndarray返回: Word hashing representation of tri-letters.
-
-
matchzoo.processor_units.processor_units.list_available()¶ List all available units.