matchzoo.data_pack package¶

Submodules¶

matchzoo.data_pack.build_unit_from_data_pack module¶

Build unit from data pack.

matchzoo.data_pack.build_unit_from_data_pack.build_unit_from_data_pack(unit, data_pack, mode='both', flatten=True, verbose=1)¶

Build a StatefulProcessorUnit from a DataPack object.

参数:	unit (`StatefulProcessorUnit`) -- `StatefulProcessorUnit` object to be built. data_pack (`DataPack`) -- The input `DataPack` object. mode (`str`) -- One of 'left', 'right', and 'both', to determine the source data for building the `VocabularyUnit`. flatten (`bool`) -- Flatten the datapack or not. True to organize the `DataPack` text as a list, and False to organize `DataPack` text as a list of list. verbose (`int`) -- Verbosity.
返回类型:	`StatefulProcessorUnit`
返回:	A built `StatefulProcessorUnit` object.

matchzoo.data_pack.build_vocab_unit module¶

Build a processor_units.VocabularyUnit given data_pack.

matchzoo.data_pack.build_vocab_unit.build_vocab_unit(data_pack, mode='both', verbose=1)¶

Build a processor_units.VocabularyUnit given data_pack.

The data_pack should be preprocessed forehand, and each item in text_left and text_right columns of the data_pack should be a list of tokens.

参数:	data_pack (`DataPack`) -- The `DataPack` to build vocabulary upon. mode (`str`) -- One of 'left', 'right', and 'both', to determine the source data for building the `VocabularyUnit`. verbose (`int`) -- Verbosity.
返回类型:	`VocabularyUnit`
返回:	A built vocabulary unit.

matchzoo.data_pack.data_pack module¶

Matchzoo DataPack, pair-wise tuple (feature) and context as input.

class matchzoo.data_pack.data_pack.DataPack(relation, left, right)¶

基类：object

Matchzoo DataPack data structure, store dataframe and context.

DataPack is a MatchZoo native data structure that most MatchZoo data handling processes build upon. A DataPack consists of three parts: left, right and relation, each one of is a pandas.DataFrame.

参数:	relation (`DataFrame`) -- Store the relation between left document and right document use ids. left (`DataFrame`) -- Store the content or features for id_left. right (`DataFrame`) -- Store the content or features for id_right.

Example

>>> left = [
...     ['qid1', 'query 1'],
...     ['qid2', 'query 2']
... ]
>>> right = [
...     ['did1', 'document 1'],
...     ['did2', 'document 2']
... ]
>>> relation = [['qid1', 'did1', 1], ['qid2', 'did2', 1]]
>>> relation_df = pd.DataFrame(relation)
>>> left = pd.DataFrame(left)
>>> right = pd.DataFrame(right)
>>> dp = DataPack(
...     relation=relation_df,
...     left=left,
...     right=right,
... )
>>> len(dp)
2

DATA_FILENAME = 'data.dill'¶

class FrameView(data_pack)¶

基类：object

FrameView.

append_text_length()¶

Append length_left and length_right columns.

参数:	inplace -- True to modify inplace, False to return a modified copy. (default: False)

Example

>>> import matchzoo as mz
>>> data_pack = mz.datasets.toy.load_data()
>>> 'length_left' in data_pack.frame[0].columns
False
>>> new_data_pack = data_pack.append_text_length()
>>> 'length_left' in new_data_pack.frame[0].columns
True
>>> 'length_left' in data_pack.frame[0].columns
False
>>> data_pack.append_text_length(inplace=True)
>>> 'length_left' in data_pack.frame[0].columns
True

apply_on_text(func, mode='both', rename=None, verbose=1)¶

Apply func to text columns based on mode.

参数:

func (Callable) -- The function to apply.
mode (str) -- One of "both", "left" and "right".
rename (Optional[str]) -- If set, use new names for results instead of replacing the original columns. To set rename in "both" mode, use a tuple of str, e.g. ("text_left_new_name", "text_right_new_name").
inplace -- True to modify inplace, False to return a modified copy. (default: False)
verbose (int) -- Verbosity.

返回:

Examples::

>>> import matchzoo as mz
>>> data_pack = mz.datasets.toy.load_data()
>>> frame = data_pack.frame

To apply len on the left text and add the result as 'length_left':

>>> data_pack.apply_on_text(len, mode='left',
...                         rename='length_left',
...                         inplace=True)
>>> list(frame[0].columns)
['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'label']

To do the same to the right text:

>>> data_pack.apply_on_text(len, mode='right',
...                         rename='length_right',
...                         inplace=True)
>>> list(frame[0].columns)
['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'length_right', 'label']

To do the same to the both texts at the same time:

>>> data_pack.apply_on_text(len, mode='both',
...                         rename=('extra_left', 'extra_right'),
...                         inplace=True)
>>> list(frame[0].columns)
['id_left', 'text_left', 'length_left', 'extra_left', 'id_right', 'text_right', 'length_right', 'extra_right', 'label']

To suppress outputs:

>>> data_pack.apply_on_text(len, mode='both', verbose=0,
...                         inplace=True)

copy()¶

返回类型:	`DataPack`
返回:	A deep copy.

drop_label()¶

Remove label column from the data pack.

参数:	inplace -- True to modify inplace, False to return a modified copy. (default: False)

Example

>>> import matchzoo as mz
>>> data_pack = mz.datasets.toy.load_data()
>>> data_pack.has_label
True
>>> data_pack.drop_label(inplace=True)
>>> data_pack.has_label
False

frame¶

View the data pack as a pandas.DataFrame.

Returned data frame is created by merging the left data frame, the right dataframe and the relation data frame. Use [] to access an item or a slice of items.

返回类型:	`FrameView`
返回:	A `matchzoo.DataPack.FrameView` instance.

Example

>>> import matchzoo as mz
>>> data_pack = mz.datasets.toy.load_data()
>>> type(data_pack.frame)
<class 'matchzoo.data_pack.data_pack.DataPack.FrameView'>
>>> frame_slice = data_pack.frame[0:5]
>>> type(frame_slice)
<class 'pandas.core.frame.DataFrame'>
>>> list(frame_slice.columns)
['id_left', 'text_left', 'id_right', 'text_right', 'label']
>>> full_frame = data_pack.frame()
>>> len(full_frame) == len(data_pack)
True

has_label¶

return -- True if label column exists, False other wise.

返回类型:	`bool`

left¶

Get left() of DataPack.

返回类型:	`DataFrame`

one_hot_encode_label(num_classes=2)¶

One-hot encode label column of relation.

参数:	num_classes -- Number of classes. inplace -- True to modify inplace, False to return a modified copy. (default: False)
返回:

relation¶

Get relation() of DataPack.

返回类型:	`DataFrame`

right¶

Get right() of DataPack.

返回类型:	`DataFrame`

save(dirpath)¶

Save the DataPack object.

A saved DataPack is represented as a directory with a DataPack object (transformed user input as features and context), it will be saved by pickle.

参数:	dirpath (`Union`[`str`, `Path`]) -- directory path of the saved `DataPack`.

shuffle()¶

Shuffle the data pack by shuffling the relation column.

参数:	inplace -- True to modify inplace, False to return a modified copy. (default: False)

Example

>>> import matchzoo as mz
>>> import numpy.random
>>> numpy.random.seed(0)
>>> data_pack = mz.datasets.toy.load_data()
>>> orig_ids = data_pack.relation['id_left']
>>> shuffled = data_pack.shuffle()
>>> (shuffled.relation['id_left'] != orig_ids).any()
True

unpack()¶

Unpack the data for training.

The return value can be directly feed to model.fit or model.fit_generator.

返回类型:	`Tuple`[`Dict`[`str`, <built-in function array>], `Optional`[<built-in function array>]]
返回:	A tuple of (X, y). y is None if self has no label.

Example

>>> import matchzoo as mz
>>> data_pack = mz.datasets.toy.load_data()
>>> X, y = data_pack.unpack()
>>> type(X)
<class 'dict'>
>>> sorted(X.keys())
['id_left', 'id_right', 'text_left', 'text_right']
>>> type(y)
<class 'numpy.ndarray'>
>>> X, y = data_pack.drop_label().unpack()
>>> type(y)
<class 'NoneType'>

matchzoo.data_pack.data_pack.load_data_pack(dirpath)¶

Load a DataPack. The reverse function of save().

参数:	dirpath (`Union`[`str`, `Path`]) -- directory path of the saved model.
返回类型:	`DataPack`
返回:	a `DataPack` instance.

matchzoo.data_pack.pack module¶

Convert list of input into class:DataPack expected format.

matchzoo.data_pack.pack.pack(df)¶

Pack a DataPack using df.

The df must have text_left and text_right columns. Optionally, the df can have id_left, id_right to index text_left and text_right respectively. id_left, id_right will be automatically generated if not specified.

参数:	df (`DataFrame`) -- Input `pandas.DataFrame` to use.

Examples::

>>> import matchzoo as mz
>>> import pandas as pd
>>> df = pd.DataFrame(data={'text_left': list('AABC'),
...                         'text_right': list('abbc'),
...                         'label': [0, 1, 1, 0]})
>>> mz.pack(df).frame()
  id_left text_left id_right text_right  label
0     L-0         A      R-0          a      0
1     L-0         A      R-1          b      1
2     L-1         B      R-1          b      1
3     L-2         C      R-2          c      0

返回类型:	`DataPack`

matchzoo.data_pack package¶

Submodules¶

matchzoo.data_pack.build_unit_from_data_pack module¶

matchzoo.data_pack.build_vocab_unit module¶

matchzoo.data_pack.data_pack module¶

matchzoo.data_pack.pack module¶

Module contents¶