crslab.data.dataloader package

Submodules

class crslab.data.dataloader.base.BaseDataLoader(opt, dataset)[source]

Bases: abc.ABC

Abstract class of dataloader

Notes

'scale' can be set in config to limit the size of dataset.

Parameters
  • opt (Config or dict) – config for dataloader or the whole system.

  • dataset – dataset

conv_batchify(batch)[source]

batchify data for conversation after process.

Parameters

batch (list) – processed batch dataset.

Returns

batch data for the system to train conversation part.

conv_interact(data)[source]

Process user input data for system to converse.

Parameters

data – user input data.

Returns

data for system in converse.

conv_process_fn()[source]

Process whole data for conversation before batch_fn.

Returns

processed dataset. Defaults to return the same as self.dataset.

get_conv_data(batch_size, shuffle=True)[source]

get_data wrapper for conversation.

You can implement your own process_fn in conv_process_fn, batch_fn in conv_batchify.

Parameters
  • batch_size (int) –

  • shuffle (bool, optional) – Defaults to True.

Yields

tuple or dict of torch.Tensor – batch data for conversation.

get_data(batch_fn, batch_size, shuffle=True, process_fn=None)[source]

Collate batch data for system to fit

Parameters
  • batch_fn (func) – function to collate data

  • batch_size (int) –

  • shuffle (bool, optional) – Defaults to True.

  • process_fn (func, optional) – function to process dataset before batchify. Defaults to None.

Yields

tuple or dict of torch.Tensor – batch data for system to fit

get_policy_data(batch_size, shuffle=True)[source]

get_data wrapper for policy.

You can implement your own process_fn in self.policy_process_fn, batch_fn in policy_batchify.

Parameters
  • batch_size (int) –

  • shuffle (bool, optional) – Defaults to True.

Yields

tuple or dict of torch.Tensor – batch data for policy.

get_rec_data(batch_size, shuffle=True)[source]

get_data wrapper for recommendation.

You can implement your own process_fn in rec_process_fn, batch_fn in rec_batchify.

Parameters
  • batch_size (int) –

  • shuffle (bool, optional) – Defaults to True.

Yields

tuple or dict of torch.Tensor – batch data for recommendation.

policy_batchify(batch)[source]

batchify data for policy after process.

Parameters

batch (list) – processed batch dataset.

Returns

batch data for the system to train policy part.

policy_process_fn()[source]

Process whole data for policy before batch_fn.

Returns

processed dataset. Defaults to return the same as self.dataset.

rec_batchify(batch)[source]

batchify data for recommendation after process.

Parameters

batch (list) – processed batch dataset.

Returns

batch data for the system to train recommendation part.

rec_interact(data)[source]

process user input data for system to recommend.

Parameters

data – user input data.

Returns

data for system to recommend.

rec_process_fn()[source]

Process whole data for recommendation before batch_fn.

Returns

processed dataset. Defaults to return the same as self.dataset.

retain_recommender_target()[source]

keep data whose role is recommender.

Returns

Recommender part of self.dataset.

class crslab.data.dataloader.kbrd.KBRDDataLoader(opt, dataset, vocab)[source]

Bases: crslab.data.dataloader.base.BaseDataLoader

Dataloader for model KBRD.

Notes

You can set the following parameters in config:

  • 'context_truncate': the maximum length of context.

  • 'response_truncate': the maximum length of response.

  • 'entity_truncate': the maximum length of mentioned entities in context.

The following values must be specified in vocab:

  • 'pad'

  • 'start'

  • 'end'

  • 'pad_entity'

the above values specify the id of needed special token.

Parameters
  • opt (Config or dict) – config for dataloader or the whole system.

  • dataset – data for model.

  • vocab (dict) – all kinds of useful size, idx and map between token and idx.

conv_batchify(batch)[source]

batchify data for conversation after process.

Parameters

batch (list) – processed batch dataset.

Returns

batch data for the system to train conversation part.

conv_process_fn(*args, **kwargs)[source]

Process whole data for conversation before batch_fn.

Returns

processed dataset. Defaults to return the same as self.dataset.

policy_batchify(*args, **kwargs)[source]

batchify data for policy after process.

Parameters

batch (list) – processed batch dataset.

Returns

batch data for the system to train policy part.

rec_batchify(batch)[source]

batchify data for recommendation after process.

Parameters

batch (list) – processed batch dataset.

Returns

batch data for the system to train recommendation part.

rec_process_fn()[source]

Process whole data for recommendation before batch_fn.

Returns

processed dataset. Defaults to return the same as self.dataset.

class crslab.data.dataloader.kgsf.KGSFDataLoader(opt, dataset, vocab)[source]

Bases: crslab.data.dataloader.base.BaseDataLoader

Dataloader for model KGSF.

Notes

You can set the following parameters in config:

  • 'context_truncate': the maximum length of context.

  • 'response_truncate': the maximum length of response.

  • 'entity_truncate': the maximum length of mentioned entities in context.

  • 'word_truncate': the maximum length of mentioned words in context.

The following values must be specified in vocab:

  • 'pad'

  • 'start'

  • 'end'

  • 'pad_entity'

  • 'pad_word'

the above values specify the id of needed special token.

  • 'n_entity': the number of entities in the entity KG of dataset.

Parameters
  • opt (Config or dict) – config for dataloader or the whole system.

  • dataset – data for model.

  • vocab (dict) – all kinds of useful size, idx and map between token and idx.

conv_batchify(batch)[source]

batchify data for conversation after process.

Parameters

batch (list) – processed batch dataset.

Returns

batch data for the system to train conversation part.

conv_process_fn(*args, **kwargs)[source]

Process whole data for conversation before batch_fn.

Returns

processed dataset. Defaults to return the same as self.dataset.

get_pretrain_data(batch_size, shuffle=True)[source]
policy_batchify(*args, **kwargs)[source]

batchify data for policy after process.

Parameters

batch (list) – processed batch dataset.

Returns

batch data for the system to train policy part.

pretrain_batchify(batch)[source]
rec_batchify(batch)[source]

batchify data for recommendation after process.

Parameters

batch (list) – processed batch dataset.

Returns

batch data for the system to train recommendation part.

rec_process_fn()[source]

Process whole data for recommendation before batch_fn.

Returns

processed dataset. Defaults to return the same as self.dataset.

class crslab.data.dataloader.redial.ReDialDataLoader(opt, dataset, vocab)[source]

Bases: crslab.data.dataloader.base.BaseDataLoader

Dataloader for model ReDial.

Notes

You can set the following parameters in config:

  • 'utterance_truncate': the maximum length of a single utterance.

  • 'conversation_truncate': the maximum length of the whole conversation.

The following values must be specified in vocab:

  • 'pad'

  • 'start'

  • 'end'

  • 'unk'

the above values specify the id of needed special token.

  • 'ind2tok': map from index to token.

  • 'n_entity': number of entities in the entity KG of dataset.

  • 'vocab_size': size of vocab.

Parameters
  • opt (Config or dict) – config for dataloader or the whole system.

  • dataset – data for model.

  • vocab (dict) – all kinds of useful size, idx and map between token and idx.

conv_batchify(batch)[source]

batchify data for conversation after process.

Parameters

batch (list) – processed batch dataset.

Returns

batch data for the system to train conversation part.

conv_process_fn()[source]

Process whole data for conversation before batch_fn.

Returns

processed dataset. Defaults to return the same as self.dataset.

policy_batchify(batch)[source]

batchify data for policy after process.

Parameters

batch (list) – processed batch dataset.

Returns

batch data for the system to train policy part.

rec_batchify(batch)[source]

batchify data for recommendation after process.

Parameters

batch (list) – processed batch dataset.

Returns

batch data for the system to train recommendation part.

rec_process_fn(*args, **kwargs)[source]

Process whole data for recommendation before batch_fn.

Returns

processed dataset. Defaults to return the same as self.dataset.

class crslab.data.dataloader.tgredial.TGReDialDataLoader(opt, dataset, vocab)[source]

Bases: crslab.data.dataloader.base.BaseDataLoader

Dataloader for model TGReDial.

Notes

You can set the following parameters in config:

  • 'context_truncate': the maximum length of context.

  • 'response_truncate': the maximum length of response.

  • 'entity_truncate': the maximum length of mentioned entities in context.

  • 'word_truncate': the maximum length of mentioned words in context.

  • 'item_truncate': the maximum length of mentioned items in context.

The following values must be specified in vocab:

  • 'pad'

  • 'start'

  • 'end'

  • 'unk'

  • 'pad_entity'

  • 'pad_word'

the above values specify the id of needed special token.

  • 'ind2tok': map from index to token.

  • 'tok2ind': map from token to index.

  • 'vocab_size': size of vocab.

  • 'id2entity': map from index to entity.

  • 'n_entity': number of entities in the entity KG of dataset.

  • 'sent_split' (optional): token used to split sentence. Defaults to 'end'.

  • 'word_split' (optional): token used to split word. Defaults to 'end'.

  • 'pad_topic' (optional): token used to pad topic.

  • 'ind2topic' (optional): map from index to topic.

Parameters
  • opt (Config or dict) – config for dataloader or the whole system.

  • dataset – data for model.

  • vocab (dict) – all kinds of useful size, idx and map between token and idx.

conv_batchify(batch)[source]

batchify data for conversation after process.

Parameters

batch (list) – processed batch dataset.

Returns

batch data for the system to train conversation part.

conv_interact(data)[source]

Process user input data for system to converse.

Parameters

data – user input data.

Returns

data for system in converse.

policy_batchify(batch)[source]

batchify data for policy after process.

Parameters

batch (list) – processed batch dataset.

Returns

batch data for the system to train policy part.

policy_process_fn(*args, **kwargs)[source]

Process whole data for policy before batch_fn.

Returns

processed dataset. Defaults to return the same as self.dataset.

rec_batchify(batch)[source]

batchify data for recommendation after process.

Parameters

batch (list) – processed batch dataset.

Returns

batch data for the system to train recommendation part.

rec_interact(data)[source]

process user input data for system to recommend.

Parameters

data – user input data.

Returns

data for system to recommend.

rec_process_fn(*args, **kwargs)[source]

Process whole data for recommendation before batch_fn.

Returns

processed dataset. Defaults to return the same as self.dataset.

crslab.data.dataloader.utils.add_start_end_token_idx(vec: list, start_token_idx: Optional[int] = None, end_token_idx: Optional[int] = None)[source]

Can choose to add start token in the beginning and end token in the end.

Parameters
  • vec – source list composed of indexes.

  • start_token_idx – index of start token.

  • end_token_idx – index of end token.

Returns

list added start or end token index.

Return type

list

crslab.data.dataloader.utils.get_onehot(data_list, categories) → torch.Tensor[source]

Transform lists of label into one-hot.

Parameters
  • data_list (list of list of int) – source data.

  • categories (int) – #label class.

Returns

one-hot labels.

Return type

torch.Tensor

crslab.data.dataloader.utils.merge_utt(conversation, split_token_idx=None, keep_split_in_tail=False, final_token_idx=None)[source]

merge utterances in one conversation.

Parameters
  • conversation (list of list of int) – conversation consist of utterances consist of tokens.

  • split_token_idx (int) – index of split token. Defaults to None.

  • keep_split_in_tail (bool) – split in tail or head. Defaults to False.

  • final_token_idx (int) – index of final token. Defaults to None.

Returns

tokens of all utterances in one list.

Return type

list

crslab.data.dataloader.utils.merge_utt_replace(conversation, detect_token=None, replace_token=None, method='in')[source]
crslab.data.dataloader.utils.padded_tensor(items: List[Union[List[int], torch.LongTensor]], pad_idx: int = 0, pad_tail: bool = True, max_len: Optional[int] = None) → torch.LongTensor[source]

Create a padded matrix from an uneven list of lists.

Returns padded matrix.

Matrix is right-padded (filled to the right) by default, but can be left padded if the flag is set to True.

Matrix can also be placed on cuda automatically.

Parameters
  • items (list[iter[int]]) – List of items

  • pad_idx (int) – the value to use for padding

  • pad_tail (bool) –

  • max_len (int) – if None, the max length is the maximum item length

Returns

padded tensor.

Return type

Tensor[int64]

crslab.data.dataloader.utils.truncate(vec, max_length, truncate_tail=True)[source]

truncate vec to make its length no more than max length.

Parameters
  • vec (list) – source list.

  • max_length (int) –

  • truncate_tail (bool, optional) – Defaults to True.

Returns

truncated vec.

Return type

list

Module contents