matchzoo.data_pack package¶
Submodules¶
matchzoo.data_pack.data_pack module¶
Matchzoo DataPack, pair-wise tuple (feature) and context as input.
- class matchzoo.data_pack.data_pack.DataPack(relation, left, right)¶
Bases:
object
Matchzoo
DataPack
data structure, store dataframe and context.DataPack is a MatchZoo native data structure that most MatchZoo data handling processes build upon. A DataPack consists of three parts: left, right and relation, each one of is a pandas.DataFrame.
- Parameters
relation (
DataFrame
) – Store the relation between left document and right document use ids.left (
DataFrame
) – Store the content or features for id_left.right (
DataFrame
) – Store the content or features for id_right.
Example
>>> left = [ ... ['qid1', 'query 1'], ... ['qid2', 'query 2'] ... ] >>> right = [ ... ['did1', 'document 1'], ... ['did2', 'document 2'] ... ] >>> relation = [['qid1', 'did1', 1], ['qid2', 'did2', 1]] >>> relation_df = pd.DataFrame(relation) >>> left = pd.DataFrame(left) >>> right = pd.DataFrame(right) >>> dp = DataPack( ... relation=relation_df, ... left=left, ... right=right, ... ) >>> len(dp) 2
- DATA_FILENAME = 'data.dill'¶
- class FrameView(data_pack)¶
Bases:
object
FrameView.
- append_text_length(verbose=1)¶
Append length_left and length_right columns.
- Parameters
inplace – True to modify inplace, False to return a modified copy. (default: False)
verbose – Verbosity.
Example
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> 'length_left' in data_pack.frame[0].columns False >>> new_data_pack = data_pack.append_text_length(verbose=0) >>> 'length_left' in new_data_pack.frame[0].columns True >>> 'length_left' in data_pack.frame[0].columns False >>> data_pack.append_text_length(inplace=True, verbose=0) >>> 'length_left' in data_pack.frame[0].columns True
- apply_on_text(func, mode='both', rename=None, verbose=1)¶
Apply func to text columns based on mode.
- Parameters
func (
Callable
) – The function to apply.mode (
str
) – One of “both”, “left” and “right”.rename (
Optional
[str
]) – If set, use new names for results instead of replacing the original columns. To set rename in “both” mode, use a tuple of str, e.g. (“text_left_new_name”, “text_right_new_name”).inplace – True to modify inplace, False to return a modified copy. (default: False)
verbose (
int
) – Verbosity.
- Examples::
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> frame = data_pack.frame
- To apply len on the left text and add the result as ‘length_left’:
>>> data_pack.apply_on_text(len, mode='left', ... rename='length_left', ... inplace=True, ... verbose=0) >>> list(frame[0].columns) # noqa: E501 ['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'label']
- To do the same to the right text:
>>> data_pack.apply_on_text(len, mode='right', ... rename='length_right', ... inplace=True, ... verbose=0) >>> list(frame[0].columns) # noqa: E501 ['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'length_right', 'label']
- To do the same to the both texts at the same time:
>>> data_pack.apply_on_text(len, mode='both', ... rename=('extra_left', 'extra_right'), ... inplace=True, ... verbose=0) >>> list(frame[0].columns) # noqa: E501 ['id_left', 'text_left', 'length_left', 'extra_left', 'id_right', 'text_right', 'length_right', 'extra_right', 'label']
- To suppress outputs:
>>> data_pack.apply_on_text(len, mode='both', verbose=0, ... inplace=True)
- drop_invalid()¶
Remove rows from the data pack where the length is zero.
- Parameters
inplace – True to modify inplace, False to return a modified copy. (default: False)
Example
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> data_pack.append_text_length(inplace=True, verbose=0) >>> data_pack.drop_invalid(inplace=True)
- drop_label()¶
Remove label column from the data pack.
- Parameters
inplace – True to modify inplace, False to return a modified copy. (default: False)
Example
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> data_pack.has_label True >>> data_pack.drop_label(inplace=True) >>> data_pack.has_label False
- property frame: matchzoo.data_pack.data_pack.DataPack.FrameView¶
View the data pack as a
pandas.DataFrame
.Returned data frame is created by merging the left data frame, the right dataframe and the relation data frame. Use [] to access an item or a slice of items.
- Return type
- Returns
A
matchzoo.DataPack.FrameView
instance.
Example
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> type(data_pack.frame) <class 'matchzoo.data_pack.data_pack.DataPack.FrameView'> >>> frame_slice = data_pack.frame[0:5] >>> type(frame_slice) <class 'pandas.core.frame.DataFrame'> >>> list(frame_slice.columns) ['id_left', 'text_left', 'id_right', 'text_right', 'label'] >>> full_frame = data_pack.frame() >>> len(full_frame) == len(data_pack) True
- property has_label: bool¶
True if label column exists, False other wise.
- Type
return
- Return type
bool
- one_hot_encode_label(num_classes=2)¶
One-hot encode label column of relation.
- Parameters
num_classes – Number of classes.
inplace – True to modify inplace, False to return a modified copy. (default: False)
- Returns
- property relation¶
relation getter.
- save(dirpath)¶
Save the
DataPack
object.A saved
DataPack
is represented as a directory with aDataPack
object (transformed user input as features and context), it will be saved by pickle.- Parameters
dirpath (
Union
[str
,Path
]) – directory path of the savedDataPack
.
- shuffle()¶
Shuffle the data pack by shuffling the relation column.
- Parameters
inplace – True to modify inplace, False to return a modified copy. (default: False)
Example
>>> import matchzoo as mz >>> import numpy.random >>> numpy.random.seed(0) >>> data_pack = mz.datasets.toy.load_data() >>> orig_ids = data_pack.relation['id_left'] >>> shuffled = data_pack.shuffle() >>> (shuffled.relation['id_left'] != orig_ids).any() True
- unpack()¶
Unpack the data for training.
The return value can be directly feed to model.fit or model.fit_generator.
- Return type
Tuple
[Dict
[str
,array
],Optional
[array
]]- Returns
A tuple of (X, y). y is None if self has no label.
Example
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> X, y = data_pack.unpack() >>> type(X) <class 'dict'> >>> sorted(X.keys()) ['id_left', 'id_right', 'text_left', 'text_right'] >>> type(y) <class 'numpy.ndarray'> >>> X, y = data_pack.drop_label().unpack() >>> type(y) <class 'NoneType'>
matchzoo.data_pack.pack module¶
Convert list of input into class:DataPack expected format.
- matchzoo.data_pack.pack.pack(df)¶
Pack a
DataPack
using df.The df must have text_left and text_right columns. Optionally, the df can have id_left, id_right to index text_left and text_right respectively. id_left, id_right will be automatically generated if not specified.
- Parameters
df (
DataFrame
) – Inputpandas.DataFrame
to use.
- Examples::
>>> import matchzoo as mz >>> import pandas as pd >>> df = pd.DataFrame(data={'text_left': list('AABC'), ... 'text_right': list('abbc'), ... 'label': [0, 1, 1, 0]}) >>> mz.pack(df).frame() id_left text_left id_right text_right label 0 L-0 A R-0 a 0 1 L-0 A R-1 b 1 2 L-1 B R-1 b 1 3 L-2 C R-2 c 0
- Return type