matchzoo.data_generator package¶
Submodules¶
matchzoo.data_generator.data_generator module¶
Base generator.
-
class
matchzoo.data_generator.data_generator.
DataGenerator
(data_pack, batch_size=32, shuffle=True)¶ 基类:
keras.utils.data_utils.Sequence
Abstract base class of all matchzoo generators.
Every generator must implement
_get_batch_of_transformed_samples()
method.Examples
>>> import matchzoo as mz >>> raw_data = mz.datasets.toy.load_data() >>> data_generator = DataGenerator(raw_data, batch_size=3, ... shuffle=False) >>> len(data_generator) 34 >>> data_generator.num_instance 100 >>> x, y = data_generator[-1] >>> type(x) <class 'dict'> >>> x.keys() dict_keys(['id_left', 'text_left', 'id_right', 'text_right']) >>> type(x['id_left']) <class 'numpy.ndarray'> >>> type(x['id_right']) <class 'numpy.ndarray'> >>> type(x['text_left']) <class 'numpy.ndarray'> >>> type(x['text_right']) <class 'numpy.ndarray'> >>> type(y) <class 'numpy.ndarray'>
-
num_instance
¶ Return the number of instances.
返回类型: int
-
on_epoch_end
()¶ Reorganize the index array while epoch is ended.
-
reset
()¶ Reset the generator from begin.
-
matchzoo.data_generator.dpool_data_generator module¶
Data generator with dynamic pooling.
-
class
matchzoo.data_generator.dpool_data_generator.
DPoolDataGenerator
(data_pack, fixed_length_left, fixed_length_right, compress_ratio_left=1, compress_ratio_right=1, batch_size=32, shuffle=True)¶ 基类:
matchzoo.data_generator.data_generator.DataGenerator
Generate data with dynamic pooling.
Examples
>>> import matchzoo as mz >>> raw_data = mz.datasets.toy.load_data() >>> preprocessor = mz.preprocessors.BasicPreprocessor( ... fixed_length_left=10, ... fixed_length_right=40, ... remove_stop_words=True) >>> processed_data = preprocessor.fit_transform(raw_data) >>> data_generator = DPoolDataGenerator(processed_data, 3, 10, ... batch_size=3, shuffle=False) >>> len(data_generator) 34 >>> data_generator.num_instance 100 >>> x, y = data_generator[-1] >>> type(x) <class 'dict'> >>> x.keys() dict_keys(['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'length_right', 'dpool_index']) >>> type(x['id_left']) <class 'numpy.ndarray'> >>> type(x['id_right']) <class 'numpy.ndarray'> >>> type(x['text_left']) <class 'numpy.ndarray'> >>> type(x['text_right']) <class 'numpy.ndarray'> >>> type(y) <class 'numpy.ndarray'>
-
class
matchzoo.data_generator.dpool_data_generator.
DPoolPairDataGenerator
(data_pack, fixed_length_left, fixed_length_right, compress_ratio_left=1, compress_ratio_right=1, num_dup=1, num_neg=1, batch_size=32, shuffle=True)¶ 基类:
matchzoo.data_generator.pair_data_generator.PairDataGenerator
Generate pair-wise data with dynamic pooling.
Examples
>>> np.random.seed(111) >>> import matchzoo as mz >>> raw_data = mz.datasets.toy.load_data() >>> preprocessor = mz.preprocessors.BasicPreprocessor( ... fixed_length_left=10, ... fixed_length_right=40, ... remove_stop_words=True) >>> processed_data = preprocessor.fit_transform(raw_data) >>> data_generator = DPoolPairDataGenerator(processed_data, 3, 10, ... 1, 1, 2, 1, 3, False) >>> data_generator.num_instance 10 >>> len(data_generator) 4 >>> x, y = data_generator[0] >>> type(x) <class 'dict'> >>> x.keys() dict_keys(['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'length_right', 'dpool_index']) >>> type(x['id_left']) <class 'numpy.ndarray'> >>> type(x['id_right']) <class 'numpy.ndarray'> >>> type(x['text_left']) <class 'numpy.ndarray'> >>> type(x['text_right']) <class 'numpy.ndarray'> >>> len(x['id_left']) 6 >>> len(x['id_right']) 6 >>> type(y) <class 'numpy.ndarray'>
matchzoo.data_generator.dynamic_data_generator module¶
Dynamic data generator with transform function inside.
-
class
matchzoo.data_generator.dynamic_data_generator.
DynamicDataGenerator
(func, *args, **kwargs)¶ 基类:
matchzoo.data_generator.data_generator.DataGenerator
Data generator with preprocess unit inside.
Examples
>>> import matchzoo as mz >>> raw_data = mz.datasets.toy.load_data() >>> data_generator = DynamicDataGenerator(len, data_pack=raw_data, ... batch_size=1, shuffle=False) >>> len(data_generator) 100 >>> x, y = data_generator[0] >>> type(x) <class 'dict'> >>> x.keys() dict_keys(['id_left', 'text_left', 'id_right', 'text_right']) >>> type(x['id_left']) <class 'numpy.ndarray'> >>> type(x['id_right']) <class 'numpy.ndarray'> >>> type(x['text_left']) <class 'numpy.ndarray'> >>> type(x['text_right']) <class 'numpy.ndarray'> >>> type(y) <class 'numpy.ndarray'>
matchzoo.data_generator.histogram_data_generator module¶
Data generator with matching histogram.
-
class
matchzoo.data_generator.histogram_data_generator.
HistogramDataGenerator
(data_pack, embedding_matrix, bin_size=30, hist_mode='CH', batch_size=32, shuffle=True)¶ 基类:
matchzoo.data_generator.data_generator.DataGenerator
Generate data with matching histogram.
参数: - data_pack (
DataPack
) -- The input data pack. - embedding_matrix (
ndarray
) -- The embedding matrix used to generator match histogram. - bin_size (
int
) -- The number of bin size of the histogram. - hist_mode (
str
) -- The mode of theMatchingHistogramUnit
, one of CH, NH, and LCH. - batch_size (
int
) -- The batch size. - shuffle (
bool
) -- Boolean, whether to shuffle the data while generating a batch.
Examples
>>> import matchzoo as mz >>> raw_data = mz.datasets.toy.load_data() >>> preprocessor = mz.preprocessors.BasicPreprocessor() >>> processed_data = preprocessor.fit_transform(raw_data) >>> raw_embedding = mz.embedding.load_from_file( ... mz.datasets.embeddings.EMBED_10_GLOVE ... ) >>> embedding_matrix = raw_embedding.build_matrix( ... preprocessor.context['vocab_unit'].state['term_index'] ... ) >>> data_generator = HistogramDataGenerator(processed_data, ... embedding_matrix, 3, 'CH', batch_size=3, shuffle=False ... ) >>> x, y = data_generator[-1] >>> type(x) <class 'dict'> >>> x.keys() dict_keys(['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'length_right', 'match_histogram']) >>> type(x['match_histogram']) <class 'numpy.ndarray'> >>> x['match_histogram'].shape (1, 30, 3) >>> type(y) <class 'numpy.ndarray'>
- data_pack (
-
class
matchzoo.data_generator.histogram_data_generator.
HistogramPairDataGenerator
(data_pack, embedding_matrix, bin_size=30, hist_mode='CH', num_dup=1, num_neg=1, batch_size=32, shuffle=True)¶ 基类:
matchzoo.data_generator.pair_data_generator.PairDataGenerator
Generate pair-wise data with matching histogram.
参数: - data_pack (
DataPack
) -- The input data pack. - embedding_matrix (
ndarray
) -- The embedding matrix used to generator match histogram. - bin_size (
int
) -- The number of bin size of the histogram. - hist_mode (
str
) -- The mode of theMatchingHistogramUnit
, one of CH, NH, and LCH. - batch_size (
int
) -- The batch size. - shuffle (
bool
) -- Boolean, whether to shuffle the data while generating a batch.
Examples
>>> np.random.seed(111) >>> import matchzoo as mz >>> raw_data = mz.datasets.toy.load_data() >>> preprocessor = mz.preprocessors.BasicPreprocessor() >>> processed_data = preprocessor.fit_transform(raw_data) >>> raw_embedding = mz.embedding.load_from_file( ... mz.datasets.embeddings.EMBED_10_GLOVE ... ) >>> embedding_matrix = raw_embedding.build_matrix( ... preprocessor.context['vocab_unit'].state['term_index'] ... ) >>> data_generator = HistogramPairDataGenerator(processed_data, ... embedding_matrix, 3, 'CH', 1, 1, 3, False) >>> len(data_generator) 2 >>> x, y = data_generator[0] >>> type(x) <class 'dict'> >>> x.keys() dict_keys(['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'length_right', 'match_histogram']) >>> type(x['match_histogram']) <class 'numpy.ndarray'> >>> x['match_histogram'].shape (6, 30, 3) >>> type(y) <class 'numpy.ndarray'>
- data_pack (
-
matchzoo.data_generator.histogram_data_generator.
match_histogram_generator
(x, match_hist_unit)¶ Generate the matching hisogram for input.
参数: - x (
dict
) -- The input dict. - match_hist_unit (
MatchingHistogramUnit
) -- The histogram unitMatchingHistogramUnit
.
返回类型: ndarray
返回: The matching histogram.
- x (
-
matchzoo.data_generator.histogram_data_generator.
trunc_text
(input_text, length)¶ Truncating the input text according to the input length.
参数: - input_text (
list
) -- The input text need to be truncated. - length (
list
) -- The length used to truncated the text.
返回类型: list
返回: The truncated text.
- input_text (
matchzoo.data_generator.pair_data_generator module¶
Pair-wise data generator.
-
class
matchzoo.data_generator.pair_data_generator.
PairDataGenerator
(data_pack, num_dup=1, num_neg=1, batch_size=32, shuffle=True)¶ 基类:
matchzoo.data_generator.data_generator.DataGenerator
Generate pair-wise data.
Examples
>>> np.random.seed(111) >>> import matchzoo as mz >>> raw_data = mz.datasets.toy.load_data() >>> data_generator = PairDataGenerator(raw_data, 2, 1, 3, False) >>> data_generator.num_instance 10 >>> len(data_generator) 4 >>> x, y = data_generator[0] >>> type(x) <class 'dict'> >>> x.keys() dict_keys(['id_left', 'text_left', 'id_right', 'text_right']) >>> type(x['id_left']) <class 'numpy.ndarray'> >>> type(x['id_right']) <class 'numpy.ndarray'> >>> type(x['text_left']) <class 'numpy.ndarray'> >>> type(x['text_right']) <class 'numpy.ndarray'> >>> len(x['id_left']) 6 >>> len(x['id_right']) 6 >>> type(y) <class 'numpy.ndarray'>
-
num_instance
¶ Get the total number of pairs.
返回类型: int
-
classmethod
reorganize_data_pack
(data_pack, num_dup=1, num_neg=1)¶ Re-organize the data pack as pair-wise format.
参数: - data_pack (
DataPack
) -- the inputDataPack
. - num_dup (
int
) -- number of duplicates for each positive sample. - num_neg (
int
) -- number of negative samples associated with each positive sample.
返回: the reorganized
DataPack
object.- data_pack (
-