matchzoo.data_pack package¶
Submodules¶
matchzoo.data_pack.build_unit_from_data_pack module¶
Build unit from data pack.
-
matchzoo.data_pack.build_unit_from_data_pack.
build_unit_from_data_pack
(unit, data_pack, mode='both', flatten=True, verbose=1)¶ Build a
StatefulProcessorUnit
from aDataPack
object.参数: - unit (
StatefulProcessorUnit
) --StatefulProcessorUnit
object to be built. - data_pack (
DataPack
) -- The inputDataPack
object. - mode (
str
) -- One of 'left', 'right', and 'both', to determine the source data for building theVocabularyUnit
. - flatten (
bool
) -- Flatten the datapack or not. True to organize theDataPack
text as a list, and False to organizeDataPack
text as a list of list. - verbose (
int
) -- Verbosity.
返回类型: 返回: A built
StatefulProcessorUnit
object.- unit (
matchzoo.data_pack.build_vocab_unit module¶
Build a processor_units.VocabularyUnit
given data_pack.
-
matchzoo.data_pack.build_vocab_unit.
build_vocab_unit
(data_pack, mode='both', verbose=1)¶ Build a
processor_units.VocabularyUnit
given data_pack.The data_pack should be preprocessed forehand, and each item in text_left and text_right columns of the data_pack should be a list of tokens.
参数: - data_pack (
DataPack
) -- TheDataPack
to build vocabulary upon. - mode (
str
) -- One of 'left', 'right', and 'both', to determine the source data for building theVocabularyUnit
. - verbose (
int
) -- Verbosity.
返回类型: 返回: A built vocabulary unit.
- data_pack (
matchzoo.data_pack.data_pack module¶
Matchzoo DataPack, pair-wise tuple (feature) and context as input.
-
class
matchzoo.data_pack.data_pack.
DataPack
(relation, left, right)¶ 基类:
object
Matchzoo
DataPack
data structure, store dataframe and context.DataPack is a MatchZoo native data structure that most MatchZoo data handling processes build upon. A DataPack consists of three parts: left, right and relation, each one of is a pandas.DataFrame.
参数: - relation (
DataFrame
) -- Store the relation between left document and right document use ids. - left (
DataFrame
) -- Store the content or features for id_left. - right (
DataFrame
) -- Store the content or features for id_right.
Example
>>> left = [ ... ['qid1', 'query 1'], ... ['qid2', 'query 2'] ... ] >>> right = [ ... ['did1', 'document 1'], ... ['did2', 'document 2'] ... ] >>> relation = [['qid1', 'did1', 1], ['qid2', 'did2', 1]] >>> relation_df = pd.DataFrame(relation) >>> left = pd.DataFrame(left) >>> right = pd.DataFrame(right) >>> dp = DataPack( ... relation=relation_df, ... left=left, ... right=right, ... ) >>> len(dp) 2
-
DATA_FILENAME
= 'data.dill'¶
-
class
FrameView
(data_pack)¶ 基类:
object
FrameView.
-
append_text_length
()¶ Append length_left and length_right columns.
参数: inplace -- True to modify inplace, False to return a modified copy. (default: False) Example
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> 'length_left' in data_pack.frame[0].columns False >>> new_data_pack = data_pack.append_text_length() >>> 'length_left' in new_data_pack.frame[0].columns True >>> 'length_left' in data_pack.frame[0].columns False >>> data_pack.append_text_length(inplace=True) >>> 'length_left' in data_pack.frame[0].columns True
-
apply_on_text
(func, mode='both', rename=None, verbose=1)¶ Apply func to text columns based on mode.
参数: - func (
Callable
) -- The function to apply. - mode (
str
) -- One of "both", "left" and "right". - rename (
Optional
[str
]) -- If set, use new names for results instead of replacing the original columns. To set rename in "both" mode, use a tuple of str, e.g. ("text_left_new_name", "text_right_new_name"). - inplace -- True to modify inplace, False to return a modified copy. (default: False)
- verbose (
int
) -- Verbosity.
返回: - Examples::
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> frame = data_pack.frame
- To apply len on the left text and add the result as 'length_left':
>>> data_pack.apply_on_text(len, mode='left', ... rename='length_left', ... inplace=True) >>> list(frame[0].columns) ['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'label']
- To do the same to the right text:
>>> data_pack.apply_on_text(len, mode='right', ... rename='length_right', ... inplace=True) >>> list(frame[0].columns) ['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'length_right', 'label']
- To do the same to the both texts at the same time:
>>> data_pack.apply_on_text(len, mode='both', ... rename=('extra_left', 'extra_right'), ... inplace=True) >>> list(frame[0].columns) ['id_left', 'text_left', 'length_left', 'extra_left', 'id_right', 'text_right', 'length_right', 'extra_right', 'label']
- To suppress outputs:
>>> data_pack.apply_on_text(len, mode='both', verbose=0, ... inplace=True)
- func (
-
drop_label
()¶ Remove label column from the data pack.
参数: inplace -- True to modify inplace, False to return a modified copy. (default: False) Example
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> data_pack.has_label True >>> data_pack.drop_label(inplace=True) >>> data_pack.has_label False
-
frame
¶ View the data pack as a
pandas.DataFrame
.Returned data frame is created by merging the left data frame, the right dataframe and the relation data frame. Use [] to access an item or a slice of items.
返回类型: FrameView
返回: A matchzoo.DataPack.FrameView
instance.Example
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> type(data_pack.frame) <class 'matchzoo.data_pack.data_pack.DataPack.FrameView'> >>> frame_slice = data_pack.frame[0:5] >>> type(frame_slice) <class 'pandas.core.frame.DataFrame'> >>> list(frame_slice.columns) ['id_left', 'text_left', 'id_right', 'text_right', 'label'] >>> full_frame = data_pack.frame() >>> len(full_frame) == len(data_pack) True
-
has_label
¶ return -- True if label column exists, False other wise.
返回类型: bool
-
one_hot_encode_label
(num_classes=2)¶ One-hot encode label column of relation.
参数: - num_classes -- Number of classes.
- inplace -- True to modify inplace, False to return a modified copy. (default: False)
返回:
-
relation
¶ Get
relation()
ofDataPack
.返回类型: DataFrame
-
save
(dirpath)¶ Save the
DataPack
object.A saved
DataPack
is represented as a directory with aDataPack
object (transformed user input as features and context), it will be saved by pickle.参数: dirpath ( Union
[str
,Path
]) -- directory path of the savedDataPack
.
-
shuffle
()¶ Shuffle the data pack by shuffling the relation column.
参数: inplace -- True to modify inplace, False to return a modified copy. (default: False) Example
>>> import matchzoo as mz >>> import numpy.random >>> numpy.random.seed(0) >>> data_pack = mz.datasets.toy.load_data() >>> orig_ids = data_pack.relation['id_left'] >>> shuffled = data_pack.shuffle() >>> (shuffled.relation['id_left'] != orig_ids).any() True
-
unpack
()¶ Unpack the data for training.
The return value can be directly feed to model.fit or model.fit_generator.
返回类型: Tuple
[Dict
[str
, <built-in function array>],Optional
[<built-in function array>]]返回: A tuple of (X, y). y is None if self has no label. Example
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> X, y = data_pack.unpack() >>> type(X) <class 'dict'> >>> sorted(X.keys()) ['id_left', 'id_right', 'text_left', 'text_right'] >>> type(y) <class 'numpy.ndarray'> >>> X, y = data_pack.drop_label().unpack() >>> type(y) <class 'NoneType'>
- relation (
matchzoo.data_pack.pack module¶
Convert list of input into class:DataPack expected format.
-
matchzoo.data_pack.pack.
pack
(df)¶ Pack a
DataPack
using df.The df must have text_left and text_right columns. Optionally, the df can have id_left, id_right to index text_left and text_right respectively. id_left, id_right will be automatically generated if not specified.
参数: df ( DataFrame
) -- Inputpandas.DataFrame
to use.- Examples::
>>> import matchzoo as mz >>> import pandas as pd >>> df = pd.DataFrame(data={'text_left': list('AABC'), ... 'text_right': list('abbc'), ... 'label': [0, 1, 1, 0]}) >>> mz.pack(df).frame() id_left text_left id_right text_right label 0 L-0 A R-0 a 0 1 L-0 A R-1 b 1 2 L-1 B R-1 b 1 3 L-2 C R-2 c 0
返回类型: DataPack