matchzoo.data_generator package

Submodules

matchzoo.data_generator.data_generator module

Base generator.

class matchzoo.data_generator.data_generator.DataGenerator(data_pack, batch_size=32, shuffle=True)

基类:keras.utils.data_utils.Sequence

Abstract base class of all matchzoo generators.

Every generator must implement _get_batch_of_transformed_samples() method.

Examples

>>> import matchzoo as mz
>>> raw_data = mz.datasets.toy.load_data()
>>> data_generator = DataGenerator(raw_data, batch_size=3,
...                                shuffle=False)
>>> len(data_generator)
34
>>> data_generator.num_instance
100
>>> x, y = data_generator[-1]
>>> type(x)
<class 'dict'>
>>> x.keys()
dict_keys(['id_left', 'text_left', 'id_right', 'text_right'])
>>> type(x['id_left'])
<class 'numpy.ndarray'>
>>> type(x['id_right'])
<class 'numpy.ndarray'>
>>> type(x['text_left'])
<class 'numpy.ndarray'>
>>> type(x['text_right'])
<class 'numpy.ndarray'>
>>> type(y)
<class 'numpy.ndarray'>
num_instance

Return the number of instances.

返回类型:int
on_epoch_end()

Reorganize the index array while epoch is ended.

reset()

Reset the generator from begin.

matchzoo.data_generator.dpool_data_generator module

Data generator with dynamic pooling.

class matchzoo.data_generator.dpool_data_generator.DPoolDataGenerator(data_pack, fixed_length_left, fixed_length_right, compress_ratio_left=1, compress_ratio_right=1, batch_size=32, shuffle=True)

基类:matchzoo.data_generator.data_generator.DataGenerator

Generate data with dynamic pooling.

Examples

>>> import matchzoo as mz
>>> raw_data = mz.datasets.toy.load_data()
>>> preprocessor = mz.preprocessors.BasicPreprocessor(
...     fixed_length_left=10,
...     fixed_length_right=40,
...     remove_stop_words=True)
>>> processed_data = preprocessor.fit_transform(raw_data)
>>> data_generator = DPoolDataGenerator(processed_data, 3, 10,
...     batch_size=3, shuffle=False)
>>> len(data_generator)
34
>>> data_generator.num_instance
100
>>> x, y = data_generator[-1]
>>> type(x)
<class 'dict'>
>>> x.keys()
dict_keys(['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'length_right', 'dpool_index'])
>>> type(x['id_left'])
<class 'numpy.ndarray'>
>>> type(x['id_right'])
<class 'numpy.ndarray'>
>>> type(x['text_left'])
<class 'numpy.ndarray'>
>>> type(x['text_right'])
<class 'numpy.ndarray'>
>>> type(y)
<class 'numpy.ndarray'>
class matchzoo.data_generator.dpool_data_generator.DPoolPairDataGenerator(data_pack, fixed_length_left, fixed_length_right, compress_ratio_left=1, compress_ratio_right=1, num_dup=1, num_neg=1, batch_size=32, shuffle=True)

基类:matchzoo.data_generator.pair_data_generator.PairDataGenerator

Generate pair-wise data with dynamic pooling.

Examples

>>> np.random.seed(111)
>>> import matchzoo as mz
>>> raw_data = mz.datasets.toy.load_data()
>>> preprocessor = mz.preprocessors.BasicPreprocessor(
...     fixed_length_left=10,
...     fixed_length_right=40,
...     remove_stop_words=True)
>>> processed_data = preprocessor.fit_transform(raw_data)
>>> data_generator = DPoolPairDataGenerator(processed_data, 3, 10,
...     1, 1, 2, 1, 3, False)
>>> data_generator.num_instance
10
>>> len(data_generator)
4
>>> x, y = data_generator[0]
>>> type(x)
<class 'dict'>
>>> x.keys()
dict_keys(['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'length_right', 'dpool_index'])
>>> type(x['id_left'])
<class 'numpy.ndarray'>
>>> type(x['id_right'])
<class 'numpy.ndarray'>
>>> type(x['text_left'])
<class 'numpy.ndarray'>
>>> type(x['text_right'])
<class 'numpy.ndarray'>
>>> len(x['id_left'])
6
>>> len(x['id_right'])
6
>>> type(y)
<class 'numpy.ndarray'>

matchzoo.data_generator.dynamic_data_generator module

Dynamic data generator with transform function inside.

class matchzoo.data_generator.dynamic_data_generator.DynamicDataGenerator(func, *args, **kwargs)

基类:matchzoo.data_generator.data_generator.DataGenerator

Data generator with preprocess unit inside.

Examples

>>> import matchzoo as mz
>>> raw_data = mz.datasets.toy.load_data()
>>> data_generator = DynamicDataGenerator(len, data_pack=raw_data,
...                                       batch_size=1, shuffle=False)
>>> len(data_generator)
100
>>> x, y = data_generator[0]
>>> type(x)
<class 'dict'>
>>> x.keys()
dict_keys(['id_left', 'text_left', 'id_right', 'text_right'])
>>> type(x['id_left'])
<class 'numpy.ndarray'>
>>> type(x['id_right'])
<class 'numpy.ndarray'>
>>> type(x['text_left'])
<class 'numpy.ndarray'>
>>> type(x['text_right'])
<class 'numpy.ndarray'>
>>> type(y)
<class 'numpy.ndarray'>

matchzoo.data_generator.histogram_data_generator module

Data generator with matching histogram.

class matchzoo.data_generator.histogram_data_generator.HistogramDataGenerator(data_pack, embedding_matrix, bin_size=30, hist_mode='CH', batch_size=32, shuffle=True)

基类:matchzoo.data_generator.data_generator.DataGenerator

Generate data with matching histogram.

参数:
  • data_pack (DataPack) -- The input data pack.
  • embedding_matrix (ndarray) -- The embedding matrix used to generator match histogram.
  • bin_size (int) -- The number of bin size of the histogram.
  • hist_mode (str) -- The mode of the MatchingHistogramUnit, one of CH, NH, and LCH.
  • batch_size (int) -- The batch size.
  • shuffle (bool) -- Boolean, whether to shuffle the data while generating a batch.

Examples

>>> import matchzoo as mz
>>> raw_data = mz.datasets.toy.load_data()
>>> preprocessor = mz.preprocessors.BasicPreprocessor()
>>> processed_data = preprocessor.fit_transform(raw_data)
>>> raw_embedding = mz.embedding.load_from_file(
...     mz.datasets.embeddings.EMBED_10_GLOVE
... )
>>> embedding_matrix = raw_embedding.build_matrix(
...     preprocessor.context['vocab_unit'].state['term_index']
... )
>>> data_generator = HistogramDataGenerator(processed_data,
...     embedding_matrix, 3, 'CH', batch_size=3, shuffle=False
... )
>>> x, y = data_generator[-1]
>>> type(x)
<class 'dict'>
>>> x.keys()
dict_keys(['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'length_right', 'match_histogram'])
>>> type(x['match_histogram'])
<class 'numpy.ndarray'>
>>> x['match_histogram'].shape
(1, 30, 3)
>>> type(y)
<class 'numpy.ndarray'>
class matchzoo.data_generator.histogram_data_generator.HistogramPairDataGenerator(data_pack, embedding_matrix, bin_size=30, hist_mode='CH', num_dup=1, num_neg=1, batch_size=32, shuffle=True)

基类:matchzoo.data_generator.pair_data_generator.PairDataGenerator

Generate pair-wise data with matching histogram.

参数:
  • data_pack (DataPack) -- The input data pack.
  • embedding_matrix (ndarray) -- The embedding matrix used to generator match histogram.
  • bin_size (int) -- The number of bin size of the histogram.
  • hist_mode (str) -- The mode of the MatchingHistogramUnit, one of CH, NH, and LCH.
  • batch_size (int) -- The batch size.
  • shuffle (bool) -- Boolean, whether to shuffle the data while generating a batch.

Examples

>>> np.random.seed(111)
>>> import matchzoo as mz
>>> raw_data = mz.datasets.toy.load_data()
>>> preprocessor = mz.preprocessors.BasicPreprocessor()
>>> processed_data = preprocessor.fit_transform(raw_data)
>>> raw_embedding = mz.embedding.load_from_file(
...     mz.datasets.embeddings.EMBED_10_GLOVE
... )
>>> embedding_matrix = raw_embedding.build_matrix(
...     preprocessor.context['vocab_unit'].state['term_index']
... )
>>> data_generator = HistogramPairDataGenerator(processed_data,
...     embedding_matrix, 3, 'CH', 1, 1, 3, False)
>>> len(data_generator)
2
>>> x, y = data_generator[0]
>>> type(x)
<class 'dict'>
>>> x.keys()
dict_keys(['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'length_right', 'match_histogram'])
>>> type(x['match_histogram'])
<class 'numpy.ndarray'>
>>> x['match_histogram'].shape
(6, 30, 3)
>>> type(y)
<class 'numpy.ndarray'>
matchzoo.data_generator.histogram_data_generator.match_histogram_generator(x, match_hist_unit)

Generate the matching hisogram for input.

参数:
  • x (dict) -- The input dict.
  • match_hist_unit (MatchingHistogramUnit) -- The histogram unit MatchingHistogramUnit.
返回类型:

ndarray

返回:

The matching histogram.

matchzoo.data_generator.histogram_data_generator.trunc_text(input_text, length)

Truncating the input text according to the input length.

参数:
  • input_text (list) -- The input text need to be truncated.
  • length (list) -- The length used to truncated the text.
返回类型:

list

返回:

The truncated text.

matchzoo.data_generator.pair_data_generator module

Pair-wise data generator.

class matchzoo.data_generator.pair_data_generator.PairDataGenerator(data_pack, num_dup=1, num_neg=1, batch_size=32, shuffle=True)

基类:matchzoo.data_generator.data_generator.DataGenerator

Generate pair-wise data.

Examples

>>> np.random.seed(111)
>>> import matchzoo as mz
>>> raw_data = mz.datasets.toy.load_data()
>>> data_generator = PairDataGenerator(raw_data, 2, 1, 3, False)
>>> data_generator.num_instance
10
>>> len(data_generator)
4
>>> x, y = data_generator[0]
>>> type(x)
<class 'dict'>
>>> x.keys()
dict_keys(['id_left', 'text_left', 'id_right', 'text_right'])
>>> type(x['id_left'])
<class 'numpy.ndarray'>
>>> type(x['id_right'])
<class 'numpy.ndarray'>
>>> type(x['text_left'])
<class 'numpy.ndarray'>
>>> type(x['text_right'])
<class 'numpy.ndarray'>
>>> len(x['id_left'])
6
>>> len(x['id_right'])
6
>>> type(y)
<class 'numpy.ndarray'>
num_instance

Get the total number of pairs.

返回类型:int
classmethod reorganize_data_pack(data_pack, num_dup=1, num_neg=1)

Re-organize the data pack as pair-wise format.

参数:
  • data_pack (DataPack) -- the input DataPack.
  • num_dup (int) -- number of duplicates for each positive sample.
  • num_neg (int) -- number of negative samples associated with each positive sample.
返回:

the reorganized DataPack object.

Module contents