crslab.data.dataloader package¶
Submodules¶
-
class
crslab.data.dataloader.base.
BaseDataLoader
(opt, dataset)[source]¶ Bases:
abc.ABC
Abstract class of dataloader
Notes
'scale'
can be set in config to limit the size of dataset.- Parameters
opt (Config or dict) – config for dataloader or the whole system.
dataset – dataset
-
conv_batchify
(batch)[source]¶ batchify data for conversation after process.
- Parameters
batch (list) – processed batch dataset.
- Returns
batch data for the system to train conversation part.
-
conv_interact
(data)[source]¶ Process user input data for system to converse.
- Parameters
data – user input data.
- Returns
data for system in converse.
-
conv_process_fn
()[source]¶ Process whole data for conversation before batch_fn.
- Returns
processed dataset. Defaults to return the same as self.dataset.
-
get_conv_data
(batch_size, shuffle=True)[source]¶ get_data wrapper for conversation.
You can implement your own process_fn in
conv_process_fn
, batch_fn inconv_batchify
.- Parameters
batch_size (int) –
shuffle (bool, optional) – Defaults to True.
- Yields
tuple or dict of torch.Tensor – batch data for conversation.
-
get_data
(batch_fn, batch_size, shuffle=True, process_fn=None)[source]¶ Collate batch data for system to fit
- Parameters
batch_fn (func) – function to collate data
batch_size (int) –
shuffle (bool, optional) – Defaults to True.
process_fn (func, optional) – function to process dataset before batchify. Defaults to None.
- Yields
tuple or dict of torch.Tensor – batch data for system to fit
-
get_policy_data
(batch_size, shuffle=True)[source]¶ get_data wrapper for policy.
You can implement your own process_fn in
self.policy_process_fn
, batch_fn inpolicy_batchify
.- Parameters
batch_size (int) –
shuffle (bool, optional) – Defaults to True.
- Yields
tuple or dict of torch.Tensor – batch data for policy.
-
get_rec_data
(batch_size, shuffle=True)[source]¶ get_data wrapper for recommendation.
You can implement your own process_fn in
rec_process_fn
, batch_fn inrec_batchify
.- Parameters
batch_size (int) –
shuffle (bool, optional) – Defaults to True.
- Yields
tuple or dict of torch.Tensor – batch data for recommendation.
-
policy_batchify
(batch)[source]¶ batchify data for policy after process.
- Parameters
batch (list) – processed batch dataset.
- Returns
batch data for the system to train policy part.
-
policy_process_fn
()[source]¶ Process whole data for policy before batch_fn.
- Returns
processed dataset. Defaults to return the same as self.dataset.
-
rec_batchify
(batch)[source]¶ batchify data for recommendation after process.
- Parameters
batch (list) – processed batch dataset.
- Returns
batch data for the system to train recommendation part.
-
rec_interact
(data)[source]¶ process user input data for system to recommend.
- Parameters
data – user input data.
- Returns
data for system to recommend.
-
class
crslab.data.dataloader.kbrd.
KBRDDataLoader
(opt, dataset, vocab)[source]¶ Bases:
crslab.data.dataloader.base.BaseDataLoader
Dataloader for model KBRD.
Notes
You can set the following parameters in config:
'context_truncate'
: the maximum length of context.'response_truncate'
: the maximum length of response.'entity_truncate'
: the maximum length of mentioned entities in context.
The following values must be specified in
vocab
:'pad'
'start'
'end'
'pad_entity'
the above values specify the id of needed special token.
- Parameters
opt (Config or dict) – config for dataloader or the whole system.
dataset – data for model.
vocab (dict) – all kinds of useful size, idx and map between token and idx.
-
conv_batchify
(batch)[source]¶ batchify data for conversation after process.
- Parameters
batch (list) – processed batch dataset.
- Returns
batch data for the system to train conversation part.
-
conv_process_fn
(*args, **kwargs)[source]¶ Process whole data for conversation before batch_fn.
- Returns
processed dataset. Defaults to return the same as self.dataset.
-
policy_batchify
(*args, **kwargs)[source]¶ batchify data for policy after process.
- Parameters
batch (list) – processed batch dataset.
- Returns
batch data for the system to train policy part.
-
class
crslab.data.dataloader.kgsf.
KGSFDataLoader
(opt, dataset, vocab)[source]¶ Bases:
crslab.data.dataloader.base.BaseDataLoader
Dataloader for model KGSF.
Notes
You can set the following parameters in config:
'context_truncate'
: the maximum length of context.'response_truncate'
: the maximum length of response.'entity_truncate'
: the maximum length of mentioned entities in context.'word_truncate'
: the maximum length of mentioned words in context.
The following values must be specified in
vocab
:'pad'
'start'
'end'
'pad_entity'
'pad_word'
the above values specify the id of needed special token.
'n_entity'
: the number of entities in the entity KG of dataset.
- Parameters
opt (Config or dict) – config for dataloader or the whole system.
dataset – data for model.
vocab (dict) – all kinds of useful size, idx and map between token and idx.
-
conv_batchify
(batch)[source]¶ batchify data for conversation after process.
- Parameters
batch (list) – processed batch dataset.
- Returns
batch data for the system to train conversation part.
-
conv_process_fn
(*args, **kwargs)[source]¶ Process whole data for conversation before batch_fn.
- Returns
processed dataset. Defaults to return the same as self.dataset.
-
policy_batchify
(*args, **kwargs)[source]¶ batchify data for policy after process.
- Parameters
batch (list) – processed batch dataset.
- Returns
batch data for the system to train policy part.
-
class
crslab.data.dataloader.redial.
ReDialDataLoader
(opt, dataset, vocab)[source]¶ Bases:
crslab.data.dataloader.base.BaseDataLoader
Dataloader for model ReDial.
Notes
You can set the following parameters in config:
'utterance_truncate'
: the maximum length of a single utterance.'conversation_truncate'
: the maximum length of the whole conversation.
The following values must be specified in
vocab
:'pad'
'start'
'end'
'unk'
the above values specify the id of needed special token.
'ind2tok'
: map from index to token.'n_entity'
: number of entities in the entity KG of dataset.'vocab_size'
: size of vocab.
- Parameters
opt (Config or dict) – config for dataloader or the whole system.
dataset – data for model.
vocab (dict) – all kinds of useful size, idx and map between token and idx.
-
conv_batchify
(batch)[source]¶ batchify data for conversation after process.
- Parameters
batch (list) – processed batch dataset.
- Returns
batch data for the system to train conversation part.
-
conv_process_fn
()[source]¶ Process whole data for conversation before batch_fn.
- Returns
processed dataset. Defaults to return the same as self.dataset.
-
policy_batchify
(batch)[source]¶ batchify data for policy after process.
- Parameters
batch (list) – processed batch dataset.
- Returns
batch data for the system to train policy part.
-
class
crslab.data.dataloader.tgredial.
TGReDialDataLoader
(opt, dataset, vocab)[source]¶ Bases:
crslab.data.dataloader.base.BaseDataLoader
Dataloader for model TGReDial.
Notes
You can set the following parameters in config:
'context_truncate'
: the maximum length of context.'response_truncate'
: the maximum length of response.'entity_truncate'
: the maximum length of mentioned entities in context.'word_truncate'
: the maximum length of mentioned words in context.'item_truncate'
: the maximum length of mentioned items in context.
The following values must be specified in
vocab
:'pad'
'start'
'end'
'unk'
'pad_entity'
'pad_word'
the above values specify the id of needed special token.
'ind2tok'
: map from index to token.'tok2ind'
: map from token to index.'vocab_size'
: size of vocab.'id2entity'
: map from index to entity.'n_entity'
: number of entities in the entity KG of dataset.'sent_split'
(optional): token used to split sentence. Defaults to'end'
.'word_split'
(optional): token used to split word. Defaults to'end'
.'pad_topic'
(optional): token used to pad topic.'ind2topic'
(optional): map from index to topic.
- Parameters
opt (Config or dict) – config for dataloader or the whole system.
dataset – data for model.
vocab (dict) – all kinds of useful size, idx and map between token and idx.
-
conv_batchify
(batch)[source]¶ batchify data for conversation after process.
- Parameters
batch (list) – processed batch dataset.
- Returns
batch data for the system to train conversation part.
-
conv_interact
(data)[source]¶ Process user input data for system to converse.
- Parameters
data – user input data.
- Returns
data for system in converse.
-
policy_batchify
(batch)[source]¶ batchify data for policy after process.
- Parameters
batch (list) – processed batch dataset.
- Returns
batch data for the system to train policy part.
-
policy_process_fn
(*args, **kwargs)[source]¶ Process whole data for policy before batch_fn.
- Returns
processed dataset. Defaults to return the same as self.dataset.
-
rec_batchify
(batch)[source]¶ batchify data for recommendation after process.
- Parameters
batch (list) – processed batch dataset.
- Returns
batch data for the system to train recommendation part.
-
crslab.data.dataloader.utils.
add_start_end_token_idx
(vec: list, start_token_idx: Optional[int] = None, end_token_idx: Optional[int] = None)[source]¶ Can choose to add start token in the beginning and end token in the end.
- Parameters
vec – source list composed of indexes.
start_token_idx – index of start token.
end_token_idx – index of end token.
- Returns
list added start or end token index.
- Return type
list
-
crslab.data.dataloader.utils.
get_onehot
(data_list, categories) → torch.Tensor[source]¶ Transform lists of label into one-hot.
- Parameters
data_list (list of list of int) – source data.
categories (int) – #label class.
- Returns
one-hot labels.
- Return type
torch.Tensor
-
crslab.data.dataloader.utils.
merge_utt
(conversation, split_token_idx=None, keep_split_in_tail=False, final_token_idx=None)[source]¶ merge utterances in one conversation.
- Parameters
conversation (list of list of int) – conversation consist of utterances consist of tokens.
split_token_idx (int) – index of split token. Defaults to None.
keep_split_in_tail (bool) – split in tail or head. Defaults to False.
final_token_idx (int) – index of final token. Defaults to None.
- Returns
tokens of all utterances in one list.
- Return type
list
-
crslab.data.dataloader.utils.
merge_utt_replace
(conversation, detect_token=None, replace_token=None, method='in')[source]¶
-
crslab.data.dataloader.utils.
padded_tensor
(items: List[Union[List[int], torch.LongTensor]], pad_idx: int = 0, pad_tail: bool = True, max_len: Optional[int] = None) → torch.LongTensor[source]¶ Create a padded matrix from an uneven list of lists.
Returns padded matrix.
Matrix is right-padded (filled to the right) by default, but can be left padded if the flag is set to True.
Matrix can also be placed on cuda automatically.
- Parameters
items (list[iter[int]]) – List of items
pad_idx (int) – the value to use for padding
pad_tail (bool) –
max_len (int) – if None, the max length is the maximum item length
- Returns
padded tensor.
- Return type
Tensor[int64]