Dataset¶

`BaseDataset`
`NodeClassificationDataset`	The class NodeClassificationDataset is a base class for datasets which can be used in task node classification.
`LinkPredictionDataset`	metric: Accuracy, multi-label f1 or multi-class f1.
`RecommendationDataset`
`AcademicDataset`
`HGBDataset`
`OHGBDataset`
`GTNDataset`	GTN Dataset.
`AsLinkPredictionDataset`	Repurpose a dataset for link prediction task.
`AsNodeClassificationDataset`	Repurpose a dataset for a standard semi-supervised transductive node prediction task.
`EdgeClassificationDataset`	The class EdgeClassificationDataset is a base class for datasets which can be used in task edge classification.
`HypergraphDataset`

class BaseDataset(*args, **kwargs)[源代码]¶: 基类：ABC

class NodeClassificationDataset(*args, **kwargs)[源代码]¶

基类：BaseDataset

The class NodeClassificationDataset is a base class for datasets which can be used in task node classification. So its subclass should contain attributes such as graph, category, num_classes and so on. Besides, it should implement the functions get_labels() and get_split().

g¶

The heterogeneous graph.

Type: dgl.DGLHeteroGraph

category¶

The category(or target) node type need to be predict. In general, we predict only one node type.

Type: str

num_classes¶

The target node will be classified into num_classes categories.

Type: int

has_feature¶

Whether the dataset has feature. Default False.

Type: bool

multi_label¶

Whether the node has multi label. Default False. For now, only HGBn-IMDB has multi-label.

Type: bool

class LinkPredictionDataset(*args, **kwargs)[源代码]¶

metric: Accuracy, multi-label f1 or multi-class f1. Default: accuracy

get_split(val_ratio=0.1, test_ratio=0.2)[源代码]¶

Get subgraphs for train, valid and test. Generally, the original will have train_mask and test_mask in edata, or we will split it automatically.

If the original graph do not has the train_mask in edata, we default that there is no valid_mask and test_mask. So we will split the edges of the original graph into train/valid/test 0.7/0.1/0.2.

The dataset has not validation_mask, so we split train edges randomly. :param val_ratio: The ratio of validation. Default: 0.1 :type val_ratio: int :param test_ratio: The ratio of test. Default: 0.2 :type test_ratio: int

返回类型: train_hg

class RecommendationDataset(*args, **kwargs)[源代码]¶

class AcademicDataset(name, raw_dir=None, force_reload=False, verbose=True)[源代码]¶

download()[源代码]¶

Overwite to realize your own logic of downloading data.

It is recommended to download the to the self.raw_dir folder. Can be ignored if the dataset is already in self.raw_dir.

process()[源代码]¶: Overwrite to realize your own logic of processing the input data.

save()[源代码]¶

Overwite to realize your own logic of saving the processed dataset into files.

It is recommended to use dgl.data.utils.save_graphs to save dgl graph into files and use dgl.data.utils.save_info to save extra information into files.

load()[源代码]¶

Overwite to realize your own logic of loading the saved dataset from files.

It is recommended to use dgl.data.utils.load_graphs to load dgl graph from files and use dgl.data.utils.load_info to load extra information into python dict object.

has_cache()[源代码]¶

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

By default False.

class HGBDataset(name, raw_dir=None, force_reload=False, verbose=True)[源代码]¶

download()[源代码]¶

Overwite to realize your own logic of downloading data.

It is recommended to download the to the self.raw_dir folder. Can be ignored if the dataset is already in self.raw_dir.

process()[源代码]¶: Overwrite to realize your own logic of processing the input data.

save()[源代码]¶

Overwite to realize your own logic of saving the processed dataset into files.

It is recommended to use dgl.data.utils.save_graphs to save dgl graph into files and use dgl.data.utils.save_info to save extra information into files.

load()[源代码]¶

Overwite to realize your own logic of loading the saved dataset from files.

It is recommended to use dgl.data.utils.load_graphs to load dgl graph from files and use dgl.data.utils.load_info to load extra information into python dict object.

has_cache()[源代码]¶

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

By default False.

class OHGBDataset(name, raw_dir=None, force_reload=False, verbose=True)[源代码]¶

download()[源代码]¶

Overwite to realize your own logic of downloading data.

It is recommended to download the to the self.raw_dir folder. Can be ignored if the dataset is already in self.raw_dir.

process()[源代码]¶: Overwrite to realize your own logic of processing the input data.

save()[源代码]¶

Overwite to realize your own logic of saving the processed dataset into files.

It is recommended to use dgl.data.utils.save_graphs to save dgl graph into files and use dgl.data.utils.save_info to save extra information into files.

load()[源代码]¶

Overwite to realize your own logic of loading the saved dataset from files.

It is recommended to use dgl.data.utils.load_graphs to load dgl graph from files and use dgl.data.utils.load_info to load extra information into python dict object.

has_cache()[源代码]¶

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

By default False.

class GTNDataset(name, raw_dir=None, force_reload=False, verbose=False, transform=None)[源代码]¶

GTN Dataset.

It contains three datasets used in a NeurIPS’19 paper Graph Transformer Networks <https://arxiv.org/abs/1911.06455>, which includes two citation network datasets DBLP and ACM, and a movie dataset IMDB. DBLP contains three types of nodes (papers (P), authors (A), conferences (C)), four types of edges (PA, AP, PC, CP), and research areas of authors as labels. ACM contains three types of nodes (papers (P), authors (A), subject (S)), four types of edges (PA, AP, PS, SP), and categories of papers as labels. Each node in the two datasets is represented as bag-of-words of keywords. On the other hand, IMDB contains three types of nodes (movies (M), actors (A), and directors (D)) and labels are genres of movies. Node features are given as bag-of-words representations of plots.

Dataset statistics:

Dataset Nodes Edges Edge type Features Training Validation Test

DBLP 18405 67946 4 334 800 400 2857

ACM 8994 25922 4 1902 600 300 2125

IMDB 12772 37288 4 1256 300 300 2339

Data source link: <https://drive.google.com/file/d/1qOZ3QjqWMIIvWjzrIdRe3EA4iKzPi6S5/view?usp=sharing>

参数

name (str) – Name of the dataset. Supported dataset names are ‘dblp4GTN’, ‘acm4GTN’ and ‘imdb4GTN’.
raw_dir (str) – Specifying the directory that will store the downloaded data or the directory that already stores the input data. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: False
transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

示例

>>> dataset = GTNDataset(name='imdb4GTN')
>>> graph = dataset[0]

process()[源代码]¶: Overwrite to realize your own logic of processing the input data.

save()[源代码]¶

Overwite to realize your own logic of saving the processed dataset into files.

It is recommended to use dgl.data.utils.save_graphs to save dgl graph into files and use dgl.data.utils.save_info to save extra information into files.

load()[源代码]¶

Overwite to realize your own logic of loading the saved dataset from files.

It is recommended to use dgl.data.utils.load_graphs to load dgl graph from files and use dgl.data.utils.load_info to load extra information into python dict object.

class AsLinkPredictionDataset(dataset, target_link, target_link_r, split_ratio=None, neg_ratio=3, neg_sampler='global', **kwargs)[源代码]¶

Repurpose a dataset for link prediction task.

The created dataset will include data needed for link prediction. It will keep only the first graph in the provided dataset and generate train/val/test edges according to the given split ratio, and the correspondent negative edges based on the neg_ratio. The generated edges will be cached to disk for fast re-loading. If the provided split ratio differs from the cached one, it will re-process the dataset properly.

参数

dataset (DGLDataset) – The dataset to be converted.
split_ratio ((float, float, float), optional) – Split ratios for training, validation and test sets. Must sum to one.
neg_ratio (int, optional) – Indicate how much negative samples to be sampled The number of the negative samples will be equal or less than neg_ratio * num_positive_edges.
target_link (list[tuple[str, str, str]]) – The edge types on which predictions are make.
target_link_r (list[tuple[str, str, str]], optional) – The reverse edge types of the target links. Used to remove reverse edges of val/test edges from train graph.
neg_sampler (str, optional) – Indicate how negative edges of val/test edges are sampled. ‘global’ or ‘per_source’.

train_graph¶

The DGLHeteroGraph for training

Type: DGLHeteroGraph

pos_val_graph¶

The DGLHeteroGraph containing positive validation edges

Type: DGLHeteroGraph

pos_test_graph¶

The DGLHeteroGraph containing positive test edges

Type: DGLHeteroGraph

neg_val_graph¶

The DGLHeteroGraph containing negative validation edges

Type: DGLHeteroGraph

neg_test_graph¶

The DGLHeteroGraph containing negative test edges

Type: DGLHeteroGraph

process()[源代码]¶: Overwrite to realize your own logic of processing the input data.

has_cache()[源代码]¶

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

By default False.

load()[源代码]¶

Overwite to realize your own logic of loading the saved dataset from files.

It is recommended to use dgl.data.utils.load_graphs to load dgl graph from files and use dgl.data.utils.load_info to load extra information into python dict object.

save()[源代码]¶

Overwite to realize your own logic of saving the processed dataset into files.

It is recommended to use dgl.data.utils.save_graphs to save dgl graph into files and use dgl.data.utils.save_info to save extra information into files.

class AsNodeClassificationDataset(data, name=None, labeled_nodes_split_ratio=None, prediction_ratio=None, target_ntype=None, label_feat_name='label', label_mask_feat_name=None, **kwargs)[源代码]¶

Repurpose a dataset for a standard semi-supervised transductive node prediction task.

The class converts a given dataset into a new dataset object that:

Contains only one heterogeneous graph, accessible from dataset[0].

The graph stores:

Node labels in g.nodes[target_ntype].data['label'].

Train/val/test masks in g.nodes[target_ntype].data['train_mask'], g.nodes[target_ntype].data['val_mask'], and g.nodes[target_ntype].data['test_mask'] respectively.

In addition, the dataset contains the following attributes:

num_classes, the number of classes to predict.

train_idx, val_idx, test_idx, train/val/test indexes.

The class will keep only the first graph in the provided dataset and generate train/val/test masks according to the given spplit ratio. The generated masks will be cached to disk for fast re-loading. If the provided split ratio differs from the cached one, it will re-process the dataset properly.

参数

data (DGLDataset or DGLHeteroGraph) – The dataset or graph to be converted.
name (str) – The dataset name. Optional when data is DGLDataset. Required when data is DGLHeteroGraph.
labeled_nodes_split_ratio ((float, float, float), optional) – Split ratios for training, validation and test sets. Must sum to 1. If None, we will use the train_mask, val_mask and test_mask from the original graph.
prediction_ratio (float, optional) – The ratio of number of prediction nodes to all unlabeled nodes. Prediction_ratio ranges from 0 to 1. If None, we will use the pred_mask from the original graph.
target_ntype (str) – The node type to add split mask for.
label_feat_name (str, optional) – The feature name of label. If None, we will use the name “label”.
label_mask_feat_name (str, optional) – The feature name of the mask indicating the indices of nodes with labels. None means that all nodes are labeled.

num_classes¶

Number of classes to predict.

Type: int

train_idx¶

An 1-D integer tensor of training node IDs.

Type: Tensor

val_idx¶

An 1-D integer tensor of validation node IDs.

Type: Tensor

test_idx¶

An 1-D integer tensor of test node IDs.

Type: Tensor

pred_idx¶

An 1-D integer tensor of prediction node IDs.

Type: Tensor

process()[源代码]¶: Overwrite to realize your own logic of processing the input data.

has_cache()[源代码]¶

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

By default False.

load()[源代码]¶

Overwite to realize your own logic of loading the saved dataset from files.

It is recommended to use dgl.data.utils.load_graphs to load dgl graph from files and use dgl.data.utils.load_info to load extra information into python dict object.

save()[源代码]¶

Overwite to realize your own logic of saving the processed dataset into files.

It is recommended to use dgl.data.utils.save_graphs to save dgl graph into files and use dgl.data.utils.save_info to save extra information into files.

class EdgeClassificationDataset(*args, **kwargs)[源代码]¶

The class EdgeClassificationDataset is a base class for datasets which can be used in task edge classification. So its subclass should contain attributes such as graph, category, num_classes and so on. Besides, it should implement the functions get_labels() and get_split().

g¶

The heterogeneous graph.

Type: dgl.DGLHeteroGraph

category¶

The category(or target) node type need to be predict. In general, we predict only one node type.

Type: str

num_classes¶

The target node will be classified into num_classes categories.

Type: int

has_feature¶

Whether the dataset has feature. Default False.

Type: bool

multi_label¶

Whether the node has multi label. Default False. For now, only HGBn-IMDB has multi-label.

Type: bool

get_labels()[源代码]¶

The subclass of dataset should overwrite the function. We can get labels of target nodes through it.

备注

In general, the labels are th.LongTensor. But for multi-label dataset, they should be th.FloatTensor. Or it will raise RuntimeError: Expected object of scalar type Long but got scalar type Float for argument #2 target’ in call to _thnn_nll_loss_forward

返回: labels
返回类型: torch.Tensor

get_split(validation=True)[源代码]¶

参数

validation (bool) – Whether to split dataset. Default True. If it is False, val_idx will be same with train_idx.
train (We can get idx of) –
it. (validation and test through) –

返回

train_idx, val_idx, test_idx

返回类型

torch.Tensor, torch.Tensor, torch.Tensor