matchzoo.preprocessors package¶
Subpackages¶
- matchzoo.preprocessors.units package
- Submodules
- matchzoo.preprocessors.units.digit_removal module
- matchzoo.preprocessors.units.fixed_length module
- matchzoo.preprocessors.units.frequency_filter module
- matchzoo.preprocessors.units.lemmatization module
- matchzoo.preprocessors.units.lowercase module
- matchzoo.preprocessors.units.matching_histogram module
- matchzoo.preprocessors.units.ngram_letter module
- matchzoo.preprocessors.units.punc_removal module
- matchzoo.preprocessors.units.stateful_unit module
- matchzoo.preprocessors.units.stemming module
- matchzoo.preprocessors.units.stop_removal module
- matchzoo.preprocessors.units.tokenize module
- matchzoo.preprocessors.units.unit module
- matchzoo.preprocessors.units.vocabulary module
- matchzoo.preprocessors.units.word_hashing module
- Module contents
Submodules¶
matchzoo.preprocessors.basic_preprocessor module¶
Basic Preprocessor.
-
class
matchzoo.preprocessors.basic_preprocessor.BasicPreprocessor(fixed_length_left=30, fixed_length_right=30, filter_mode='df', filter_low_freq=2, filter_high_freq=inf, remove_stop_words=False)¶ Bases:
matchzoo.engine.base_preprocessor.BasePreprocessorBaisc preprocessor helper.
Parameters: - fixed_length_left (
int) – Integer, maximize length ofleftin the data_pack. - fixed_length_right (
int) – Integer, maximize length ofrightin the data_pack. - filter_mode (
str) – String, mode used byFrequenceFilterUnit, Can be ‘df’, ‘cf’, and ‘idf’. - filter_low_freq (
float) – Float, lower bound value used byFrequenceFilterUnit. - filter_high_freq (
float) – Float, upper bound value used byFrequenceFilterUnit. - remove_stop_words (
bool) – Bool, useStopRemovalUnitunit or not.
Example
>>> import matchzoo as mz >>> train_data = mz.datasets.toy.load_data('train') >>> test_data = mz.datasets.toy.load_data('test') >>> preprocessor = mz.preprocessors.BasicPreprocessor( ... fixed_length_left=10, ... fixed_length_right=20, ... filter_mode='df', ... filter_low_freq=2, ... filter_high_freq=1000, ... remove_stop_words=True ... ) >>> preprocessor = preprocessor.fit(train_data, verbose=0) >>> preprocessor.context['input_shapes'] [(10,), (20,)] >>> preprocessor.context['vocab_size'] 226 >>> processed_train_data = preprocessor.transform(train_data, ... verbose=0) >>> type(processed_train_data) <class 'matchzoo.data_pack.data_pack.DataPack'> >>> test_data_transformed = preprocessor.transform(test_data, ... verbose=0) >>> type(test_data_transformed) <class 'matchzoo.data_pack.data_pack.DataPack'>
- fixed_length_left (
matchzoo.preprocessors.build_unit_from_data_pack module¶
Build unit from data pack.
-
matchzoo.preprocessors.build_unit_from_data_pack.build_unit_from_data_pack(unit, data_pack, mode='both', flatten=True, verbose=1)¶ Build a
StatefulUnitfrom aDataPackobject.Parameters: - unit (
StatefulUnit) –StatefulUnitobject to be built. - data_pack (
DataPack) – The inputDataPackobject. - mode (
str) – One of ‘left’, ‘right’, and ‘both’, to determine the source data for building theVocabularyUnit. - flatten (
bool) – Flatten the datapack or not. True to organize theDataPacktext as a list, and False to organizeDataPacktext as a list of list. - verbose (
int) – Verbosity.
Return type: Returns: A built
StatefulUnitobject.- unit (
matchzoo.preprocessors.build_vocab_unit module¶
-
matchzoo.preprocessors.build_vocab_unit.build_vocab_unit(data_pack, mode='both', verbose=1)¶ Build a
preprocessor.units.Vocabularygiven data_pack.The data_pack should be preprocessed forehand, and each item in text_left and text_right columns of the data_pack should be a list of tokens.
Parameters: - data_pack (
DataPack) – TheDataPackto build vocabulary upon. - mode (
str) – One of ‘left’, ‘right’, and ‘both’, to determine the source
data for building the
VocabularyUnit. :type verbose:int:param verbose: Verbosity. :rtype:Vocabulary:return: A built vocabulary unit.- data_pack (
matchzoo.preprocessors.cdssm_preprocessor module¶
CDSSM Preprocessor.
-
class
matchzoo.preprocessors.cdssm_preprocessor.CDSSMPreprocessor(fixed_length_left=10, fixed_length_right=40, with_word_hashing=True)¶ Bases:
matchzoo.engine.base_preprocessor.BasePreprocessorCDSSM Model preprocessor.
-
fit(data_pack, verbose=1)¶ Fit pre-processing context for transformation.
Parameters: - verbose (
int) – Verbosity. - data_pack (
DataPack) – Data_pack to be preprocessed.
Returns: class:CDSSMPreprocessor instance.
- verbose (
-
transform(data_pack, verbose=1)¶ Apply transformation on data, create letter-ngram representation.
Parameters: - data_pack (
DataPack) – Inputs to be preprocessed. - verbose (
int) – Verbosity.
Return type: Returns: Transformed data as
DataPackobject.- data_pack (
-
with_word_hashing¶ with_word_hashing getter.
-
matchzoo.preprocessors.chain_transform module¶
Wrapper function organizes a number of transform functions.
matchzoo.preprocessors.dssm_preprocessor module¶
DSSM Preprocessor.
-
class
matchzoo.preprocessors.dssm_preprocessor.DSSMPreprocessor(with_word_hashing=True)¶ Bases:
matchzoo.engine.base_preprocessor.BasePreprocessorDSSM Model preprocessor.
-
fit(data_pack, verbose=1)¶ Fit pre-processing context for transformation.
Parameters: - verbose (
int) – Verbosity. - data_pack (
DataPack) – data_pack to be preprocessed.
Returns: class:DSSMPreprocessor instance.
- verbose (
-
transform(data_pack, verbose=1)¶ Apply transformation on data, create tri-letter representation.
Parameters: - data_pack (
DataPack) – Inputs to be preprocessed. - verbose (
int) – Verbosity.
Return type: Returns: Transformed data as
DataPackobject.- data_pack (
-
with_word_hashing¶ with_word_hashing getter.
-
matchzoo.preprocessors.naive_preprocessor module¶
Naive Preprocessor.
-
class
matchzoo.preprocessors.naive_preprocessor.NaivePreprocessor¶ Bases:
matchzoo.engine.base_preprocessor.BasePreprocessorNaive preprocessor.
Example
>>> import matchzoo as mz >>> train_data = mz.datasets.toy.load_data() >>> test_data = mz.datasets.toy.load_data(stage='test') >>> preprocessor = mz.preprocessors.NaivePreprocessor() >>> train_data_processed = preprocessor.fit_transform(train_data, ... verbose=0) >>> type(train_data_processed) <class 'matchzoo.data_pack.data_pack.DataPack'> >>> test_data_transformed = preprocessor.transform(test_data, ... verbose=0) >>> type(test_data_transformed) <class 'matchzoo.data_pack.data_pack.DataPack'>