crslab.data.dataloader package¶

Submodules¶

class crslab.data.dataloader.base.BaseDataLoader(opt, dataset)[source]¶

Bases: abc.ABC

Abstract class of dataloader

Notes

'scale' can be set in config to limit the size of dataset.

Parameters

opt (Config or dict) – config for dataloader or the whole system.
dataset – dataset

conv_batchify(batch)[source]¶

batchify data for conversation after process.

Parameters: batch (list) – processed batch dataset.
Returns: batch data for the system to train conversation part.

conv_interact(data)[source]¶

Process user input data for system to converse.

Parameters: data – user input data.
Returns: data for system in converse.

conv_process_fn()[source]¶

Process whole data for conversation before batch_fn.

Returns: processed dataset. Defaults to return the same as self.dataset.

get_conv_data(batch_size, shuffle=True)[source]¶

get_data wrapper for conversation.

You can implement your own process_fn in conv_process_fn, batch_fn in conv_batchify.

Parameters

batch_size (int) –
shuffle (bool, optional) – Defaults to True.

Yields

tuple or dict of torch.Tensor – batch data for conversation.

get_data(batch_fn, batch_size, shuffle=True, process_fn=None)[source]¶

Collate batch data for system to fit

Parameters

batch_fn (func) – function to collate data
batch_size (int) –
shuffle (bool, optional) – Defaults to True.
process_fn (func, optional) – function to process dataset before batchify. Defaults to None.

Yields

tuple or dict of torch.Tensor – batch data for system to fit

get_policy_data(batch_size, shuffle=True)[source]¶

get_data wrapper for policy.

You can implement your own process_fn in self.policy_process_fn, batch_fn in policy_batchify.

Parameters

batch_size (int) –
shuffle (bool, optional) – Defaults to True.

Yields

tuple or dict of torch.Tensor – batch data for policy.

get_rec_data(batch_size, shuffle=True)[source]¶

get_data wrapper for recommendation.

You can implement your own process_fn in rec_process_fn, batch_fn in rec_batchify.

Parameters

batch_size (int) –
shuffle (bool, optional) – Defaults to True.

Yields

tuple or dict of torch.Tensor – batch data for recommendation.

policy_batchify(batch)[source]¶

batchify data for policy after process.

Parameters: batch (list) – processed batch dataset.
Returns: batch data for the system to train policy part.

policy_process_fn()[source]¶

Process whole data for policy before batch_fn.

Returns: processed dataset. Defaults to return the same as self.dataset.

rec_batchify(batch)[source]¶

batchify data for recommendation after process.

Parameters: batch (list) – processed batch dataset.
Returns: batch data for the system to train recommendation part.

rec_interact(data)[source]¶

process user input data for system to recommend.

Parameters: data – user input data.
Returns: data for system to recommend.

rec_process_fn()[source]¶

Process whole data for recommendation before batch_fn.

Returns: processed dataset. Defaults to return the same as self.dataset.

retain_recommender_target()[source]¶

keep data whose role is recommender.

Returns: Recommender part of self.dataset.

class crslab.data.dataloader.kbrd.KBRDDataLoader(opt, dataset, vocab)[source]¶

Bases: crslab.data.dataloader.base.BaseDataLoader

Dataloader for model KBRD.

Notes

You can set the following parameters in config:

'context_truncate': the maximum length of context.
'response_truncate': the maximum length of response.
'entity_truncate': the maximum length of mentioned entities in context.

The following values must be specified in vocab:

'pad'
'start'
'end'
'pad_entity'

the above values specify the id of needed special token.

Parameters

opt (Config or dict) – config for dataloader or the whole system.
dataset – data for model.
vocab (dict) – all kinds of useful size, idx and map between token and idx.

conv_batchify(batch)[source]¶

batchify data for conversation after process.

Parameters: batch (list) – processed batch dataset.
Returns: batch data for the system to train conversation part.

conv_process_fn(*args, **kwargs)[source]¶

Process whole data for conversation before batch_fn.

Returns: processed dataset. Defaults to return the same as self.dataset.

policy_batchify(*args, **kwargs)[source]¶

batchify data for policy after process.

Parameters: batch (list) – processed batch dataset.
Returns: batch data for the system to train policy part.

rec_batchify(batch)[source]¶

batchify data for recommendation after process.

Parameters: batch (list) – processed batch dataset.
Returns: batch data for the system to train recommendation part.

rec_process_fn()[source]¶

Process whole data for recommendation before batch_fn.

Returns: processed dataset. Defaults to return the same as self.dataset.

class crslab.data.dataloader.kgsf.KGSFDataLoader(opt, dataset, vocab)[source]¶

Bases: crslab.data.dataloader.base.BaseDataLoader

Dataloader for model KGSF.

Notes

You can set the following parameters in config:

'context_truncate': the maximum length of context.
'response_truncate': the maximum length of response.
'entity_truncate': the maximum length of mentioned entities in context.
'word_truncate': the maximum length of mentioned words in context.

The following values must be specified in vocab:

'pad'
'start'
'end'
'pad_entity'
'pad_word'

the above values specify the id of needed special token.

'n_entity': the number of entities in the entity KG of dataset.

Parameters

opt (Config or dict) – config for dataloader or the whole system.
dataset – data for model.
vocab (dict) – all kinds of useful size, idx and map between token and idx.

conv_batchify(batch)[source]¶

batchify data for conversation after process.

Parameters: batch (list) – processed batch dataset.
Returns: batch data for the system to train conversation part.

conv_process_fn(*args, **kwargs)[source]¶

Process whole data for conversation before batch_fn.

Returns: processed dataset. Defaults to return the same as self.dataset.

get_pretrain_data(batch_size, shuffle=True)[source]¶

policy_batchify(*args, **kwargs)[source]¶

batchify data for policy after process.

Parameters: batch (list) – processed batch dataset.
Returns: batch data for the system to train policy part.

pretrain_batchify(batch)[source]¶

rec_batchify(batch)[source]¶

batchify data for recommendation after process.

Parameters: batch (list) – processed batch dataset.
Returns: batch data for the system to train recommendation part.

rec_process_fn()[source]¶

Process whole data for recommendation before batch_fn.

Returns: processed dataset. Defaults to return the same as self.dataset.

class crslab.data.dataloader.redial.ReDialDataLoader(opt, dataset, vocab)[source]¶

Bases: crslab.data.dataloader.base.BaseDataLoader

Dataloader for model ReDial.

Notes

You can set the following parameters in config:

'utterance_truncate': the maximum length of a single utterance.
'conversation_truncate': the maximum length of the whole conversation.

The following values must be specified in vocab:

'pad'
'start'
'end'
'unk'

the above values specify the id of needed special token.

'ind2tok': map from index to token.
'n_entity': number of entities in the entity KG of dataset.
'vocab_size': size of vocab.

Parameters

opt (Config or dict) – config for dataloader or the whole system.
dataset – data for model.
vocab (dict) – all kinds of useful size, idx and map between token and idx.

conv_batchify(batch)[source]¶

batchify data for conversation after process.

Parameters: batch (list) – processed batch dataset.
Returns: batch data for the system to train conversation part.

conv_process_fn()[source]¶

Process whole data for conversation before batch_fn.

Returns: processed dataset. Defaults to return the same as self.dataset.

policy_batchify(batch)[source]¶

batchify data for policy after process.

Parameters: batch (list) – processed batch dataset.
Returns: batch data for the system to train policy part.

rec_batchify(batch)[source]¶

batchify data for recommendation after process.

Parameters: batch (list) – processed batch dataset.
Returns: batch data for the system to train recommendation part.

rec_process_fn(*args, **kwargs)[source]¶

Process whole data for recommendation before batch_fn.

Returns: processed dataset. Defaults to return the same as self.dataset.

class crslab.data.dataloader.tgredial.TGReDialDataLoader(opt, dataset, vocab)[source]¶

Bases: crslab.data.dataloader.base.BaseDataLoader

Dataloader for model TGReDial.

Notes

You can set the following parameters in config:

'context_truncate': the maximum length of context.
'response_truncate': the maximum length of response.
'entity_truncate': the maximum length of mentioned entities in context.
'word_truncate': the maximum length of mentioned words in context.
'item_truncate': the maximum length of mentioned items in context.

The following values must be specified in vocab:

'pad'
'start'
'end'
'unk'
'pad_entity'
'pad_word'

the above values specify the id of needed special token.

'ind2tok': map from index to token.
'tok2ind': map from token to index.
'vocab_size': size of vocab.
'id2entity': map from index to entity.
'n_entity': number of entities in the entity KG of dataset.
'sent_split' (optional): token used to split sentence. Defaults to 'end'.
'word_split' (optional): token used to split word. Defaults to 'end'.
'pad_topic' (optional): token used to pad topic.
'ind2topic' (optional): map from index to topic.

Parameters

opt (Config or dict) – config for dataloader or the whole system.
dataset – data for model.
vocab (dict) – all kinds of useful size, idx and map between token and idx.

conv_batchify(batch)[source]¶

batchify data for conversation after process.

Parameters: batch (list) – processed batch dataset.
Returns: batch data for the system to train conversation part.

conv_interact(data)[source]¶

Process user input data for system to converse.

Parameters: data – user input data.
Returns: data for system in converse.

policy_batchify(batch)[source]¶

batchify data for policy after process.

Parameters: batch (list) – processed batch dataset.
Returns: batch data for the system to train policy part.

policy_process_fn(*args, **kwargs)[source]¶

Process whole data for policy before batch_fn.

Returns: processed dataset. Defaults to return the same as self.dataset.

rec_batchify(batch)[source]¶

batchify data for recommendation after process.

Parameters: batch (list) – processed batch dataset.
Returns: batch data for the system to train recommendation part.

rec_interact(data)[source]¶

process user input data for system to recommend.

Parameters: data – user input data.
Returns: data for system to recommend.

rec_process_fn(*args, **kwargs)[source]¶

Process whole data for recommendation before batch_fn.

Returns: processed dataset. Defaults to return the same as self.dataset.

crslab.data.dataloader.utils.add_start_end_token_idx(vec: list, start_token_idx: Optional[int] = None, end_token_idx: Optional[int] = None)[source]¶

Can choose to add start token in the beginning and end token in the end.

Parameters

vec – source list composed of indexes.
start_token_idx – index of start token.
end_token_idx – index of end token.

Returns

list added start or end token index.

Return type

list

crslab.data.dataloader.utils.get_onehot(data_list, categories) → torch.Tensor[source]¶

Transform lists of label into one-hot.

Parameters

data_list (list of list of int) – source data.
categories (int) – #label class.

Returns

one-hot labels.

Return type

torch.Tensor

crslab.data.dataloader.utils.merge_utt(conversation, split_token_idx=None, keep_split_in_tail=False, final_token_idx=None)[source]¶

merge utterances in one conversation.

Parameters

conversation (list of list of int) – conversation consist of utterances consist of tokens.
split_token_idx (int) – index of split token. Defaults to None.
keep_split_in_tail (bool) – split in tail or head. Defaults to False.
final_token_idx (int) – index of final token. Defaults to None.

Returns

tokens of all utterances in one list.

Return type

list

crslab.data.dataloader.utils.merge_utt_replace(conversation, detect_token=None, replace_token=None, method='in')[source]¶

crslab.data.dataloader.utils.padded_tensor(items: List[Union[List[int], torch.LongTensor]], pad_idx: int = 0, pad_tail: bool = True, max_len: Optional[int] = None) → torch.LongTensor[source]¶

Create a padded matrix from an uneven list of lists.

Returns padded matrix.

Matrix is right-padded (filled to the right) by default, but can be left padded if the flag is set to True.

Matrix can also be placed on cuda automatically.

Parameters

items (list[iter[int]]) – List of items
pad_idx (int) – the value to use for padding
pad_tail (bool) –
max_len (int) – if None, the max length is the maximum item length

Returns

padded tensor.

Return type

Tensor[int64]

crslab.data.dataloader.utils.truncate(vec, max_length, truncate_tail=True)[source]¶

truncate vec to make its length no more than max length.

Parameters

vec (list) – source list.
max_length (int) –
truncate_tail (bool, optional) – Defaults to True.

Returns

truncated vec.

Return type

list

crslab.data.dataloader package¶

Submodules¶

Module contents¶