matchzoo.data_pack package¶
Submodules¶
matchzoo.data_pack.data_pack module¶
Matchzoo DataPack, pair-wise tuple (feature) and context as input.
-
class
matchzoo.data_pack.data_pack.DataPack(relation, left, right)¶ Bases:
objectMatchzoo
DataPackdata structure, store dataframe and context.DataPack is a MatchZoo native data structure that most MatchZoo data handling processes build upon. A DataPack consists of three parts: left, right and relation, each one of is a pandas.DataFrame.
Parameters: - relation (
DataFrame) – Store the relation between left document and right document use ids. - left (
DataFrame) – Store the content or features for id_left. - right (
DataFrame) – Store the content or features for id_right.
Example
>>> left = [ ... ['qid1', 'query 1'], ... ['qid2', 'query 2'] ... ] >>> right = [ ... ['did1', 'document 1'], ... ['did2', 'document 2'] ... ] >>> relation = [['qid1', 'did1', 1], ['qid2', 'did2', 1]] >>> relation_df = pd.DataFrame(relation) >>> left = pd.DataFrame(left) >>> right = pd.DataFrame(right) >>> dp = DataPack( ... relation=relation_df, ... left=left, ... right=right, ... ) >>> len(dp) 2
-
DATA_FILENAME= 'data.dill'¶
-
class
FrameView(data_pack)¶ Bases:
objectFrameView.
-
append_text_length(verbose=1)¶ Append length_left and length_right columns.
Parameters: - inplace – True to modify inplace, False to return a modified copy. (default: False)
- verbose – Verbosity.
Example
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> 'length_left' in data_pack.frame[0].columns False >>> new_data_pack = data_pack.append_text_length(verbose=0) >>> 'length_left' in new_data_pack.frame[0].columns True >>> 'length_left' in data_pack.frame[0].columns False >>> data_pack.append_text_length(inplace=True, verbose=0) >>> 'length_left' in data_pack.frame[0].columns True
-
apply_on_text(func, mode='both', rename=None, verbose=1)¶ Apply func to text columns based on mode.
Parameters: - func (
Callable) – The function to apply. - mode (
str) – One of “both”, “left” and “right”. - rename (
Optional[str]) – If set, use new names for results instead of replacing the original columns. To set rename in “both” mode, use a tuple of str, e.g. (“text_left_new_name”, “text_right_new_name”). - inplace – True to modify inplace, False to return a modified copy. (default: False)
- verbose (
int) – Verbosity.
- Examples::
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> frame = data_pack.frame
- To apply len on the left text and add the result as ‘length_left’:
>>> data_pack.apply_on_text(len, mode='left', ... rename='length_left', ... inplace=True, ... verbose=0) >>> list(frame[0].columns) # noqa: E501 ['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'label']
- To do the same to the right text:
>>> data_pack.apply_on_text(len, mode='right', ... rename='length_right', ... inplace=True, ... verbose=0) >>> list(frame[0].columns) # noqa: E501 ['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'length_right', 'label']
- To do the same to the both texts at the same time:
>>> data_pack.apply_on_text(len, mode='both', ... rename=('extra_left', 'extra_right'), ... inplace=True, ... verbose=0) >>> list(frame[0].columns) # noqa: E501 ['id_left', 'text_left', 'length_left', 'extra_left', 'id_right', 'text_right', 'length_right', 'extra_right', 'label']
- To suppress outputs:
>>> data_pack.apply_on_text(len, mode='both', verbose=0, ... inplace=True)
- func (
-
drop_label()¶ Remove label column from the data pack.
Parameters: inplace – True to modify inplace, False to return a modified copy. (default: False) Example
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> data_pack.has_label True >>> data_pack.drop_label(inplace=True) >>> data_pack.has_label False
-
frame¶ View the data pack as a
pandas.DataFrame.Returned data frame is created by merging the left data frame, the right dataframe and the relation data frame. Use [] to access an item or a slice of items.
Return type: FrameViewReturns: A matchzoo.DataPack.FrameViewinstance.Example
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> type(data_pack.frame) <class 'matchzoo.data_pack.data_pack.DataPack.FrameView'> >>> frame_slice = data_pack.frame[0:5] >>> type(frame_slice) <class 'pandas.core.frame.DataFrame'> >>> list(frame_slice.columns) ['id_left', 'text_left', 'id_right', 'text_right', 'label'] >>> full_frame = data_pack.frame() >>> len(full_frame) == len(data_pack) True
-
has_label¶ True if label column exists, False other wise.
Type: return Return type: bool
-
one_hot_encode_label(num_classes=2)¶ One-hot encode label column of relation.
Parameters: - num_classes – Number of classes.
- inplace – True to modify inplace, False to return a modified copy. (default: False)
Returns:
-
relation¶ relation getter.
-
save(dirpath)¶ Save the
DataPackobject.A saved
DataPackis represented as a directory with aDataPackobject (transformed user input as features and context), it will be saved by pickle.Parameters: dirpath ( Union[str,Path]) – directory path of the savedDataPack.
-
shuffle()¶ Shuffle the data pack by shuffling the relation column.
Parameters: inplace – True to modify inplace, False to return a modified copy. (default: False) Example
>>> import matchzoo as mz >>> import numpy.random >>> numpy.random.seed(0) >>> data_pack = mz.datasets.toy.load_data() >>> orig_ids = data_pack.relation['id_left'] >>> shuffled = data_pack.shuffle() >>> (shuffled.relation['id_left'] != orig_ids).any() True
-
unpack()¶ Unpack the data for training.
The return value can be directly feed to model.fit or model.fit_generator.
Return type: Tuple[Dict[str, <built-in function array>],Optional[<built-in function array>]]Returns: A tuple of (X, y). y is None if self has no label. Example
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> X, y = data_pack.unpack() >>> type(X) <class 'dict'> >>> sorted(X.keys()) ['id_left', 'id_right', 'text_left', 'text_right'] >>> type(y) <class 'numpy.ndarray'> >>> X, y = data_pack.drop_label().unpack() >>> type(y) <class 'NoneType'>
- relation (
matchzoo.data_pack.pack module¶
Convert list of input into class:DataPack expected format.
-
matchzoo.data_pack.pack.pack(df)¶ Pack a
DataPackusing df.The df must have text_left and text_right columns. Optionally, the df can have id_left, id_right to index text_left and text_right respectively. id_left, id_right will be automatically generated if not specified.
Parameters: df ( DataFrame) – Inputpandas.DataFrameto use.- Examples::
>>> import matchzoo as mz >>> import pandas as pd >>> df = pd.DataFrame(data={'text_left': list('AABC'), ... 'text_right': list('abbc'), ... 'label': [0, 1, 1, 0]}) >>> mz.pack(df).frame() id_left text_left id_right text_right label 0 L-0 A R-0 a 0 1 L-0 A R-1 b 1 2 L-1 B R-1 b 1 3 L-2 C R-2 c 0
Return type: DataPack