Dataset¶
The class NodeClassificationDataset is a base class for datasets which can be used in task node classification. |
|
metric: Accuracy, multi-label f1 or multi-class f1. |
|
GTN Dataset. |
|
Repurpose a dataset for link prediction task. |
|
Repurpose a dataset for a standard semi-supervised transductive node prediction task. |
|
The class EdgeClassificationDataset is a base class for datasets which can be used in task edge classification. |
|
|
- class NodeClassificationDataset(*args, **kwargs)[源代码]¶
基类:
BaseDataset
The class NodeClassificationDataset is a base class for datasets which can be used in task node classification. So its subclass should contain attributes such as graph, category, num_classes and so on. Besides, it should implement the functions get_labels() and get_split().
- g¶
The heterogeneous graph.
- Type
dgl.DGLHeteroGraph
- category¶
The category(or target) node type need to be predict. In general, we predict only one node type.
- Type
- class LinkPredictionDataset(*args, **kwargs)[源代码]¶
metric: Accuracy, multi-label f1 or multi-class f1. Default: accuracy
- get_split(val_ratio=0.1, test_ratio=0.2)[源代码]¶
Get subgraphs for train, valid and test. Generally, the original will have train_mask and test_mask in edata, or we will split it automatically.
If the original graph do not has the train_mask in edata, we default that there is no valid_mask and test_mask. So we will split the edges of the original graph into train/valid/test 0.7/0.1/0.2.
The dataset has not validation_mask, so we split train edges randomly. :param val_ratio: The ratio of validation. Default: 0.1 :type val_ratio: int :param test_ratio: The ratio of test. Default: 0.2 :type test_ratio: int
- 返回类型
train_hg
- class AcademicDataset(name, raw_dir=None, force_reload=False, verbose=True)[源代码]¶
- download()[源代码]¶
Overwite to realize your own logic of downloading data.
It is recommended to download the to the
self.raw_dir
folder. Can be ignored if the dataset is already inself.raw_dir
.
- save()[源代码]¶
Overwite to realize your own logic of saving the processed dataset into files.
It is recommended to use
dgl.data.utils.save_graphs
to save dgl graph into files and usedgl.data.utils.save_info
to save extra information into files.
- class HGBDataset(name, raw_dir=None, force_reload=False, verbose=True)[源代码]¶
- download()[源代码]¶
Overwite to realize your own logic of downloading data.
It is recommended to download the to the
self.raw_dir
folder. Can be ignored if the dataset is already inself.raw_dir
.
- save()[源代码]¶
Overwite to realize your own logic of saving the processed dataset into files.
It is recommended to use
dgl.data.utils.save_graphs
to save dgl graph into files and usedgl.data.utils.save_info
to save extra information into files.
- class OHGBDataset(name, raw_dir=None, force_reload=False, verbose=True)[源代码]¶
- download()[源代码]¶
Overwite to realize your own logic of downloading data.
It is recommended to download the to the
self.raw_dir
folder. Can be ignored if the dataset is already inself.raw_dir
.
- save()[源代码]¶
Overwite to realize your own logic of saving the processed dataset into files.
It is recommended to use
dgl.data.utils.save_graphs
to save dgl graph into files and usedgl.data.utils.save_info
to save extra information into files.
- class GTNDataset(name, raw_dir=None, force_reload=False, verbose=False, transform=None)[源代码]¶
GTN Dataset.
It contains three datasets used in a NeurIPS’19 paper Graph Transformer Networks <https://arxiv.org/abs/1911.06455>, which includes two citation network datasets DBLP and ACM, and a movie dataset IMDB. DBLP contains three types of nodes (papers (P), authors (A), conferences (C)), four types of edges (PA, AP, PC, CP), and research areas of authors as labels. ACM contains three types of nodes (papers (P), authors (A), subject (S)), four types of edges (PA, AP, PS, SP), and categories of papers as labels. Each node in the two datasets is represented as bag-of-words of keywords. On the other hand, IMDB contains three types of nodes (movies (M), actors (A), and directors (D)) and labels are genres of movies. Node features are given as bag-of-words representations of plots.
Dataset statistics:
Dataset Nodes Edges Edge type Features Training Validation Test
DBLP 18405 67946 4 334 800 400 2857
ACM 8994 25922 4 1902 600 300 2125
IMDB 12772 37288 4 1256 300 300 2339
Data source link: <https://drive.google.com/file/d/1qOZ3QjqWMIIvWjzrIdRe3EA4iKzPi6S5/view?usp=sharing>
- 参数
name (str) – Name of the dataset. Supported dataset names are ‘dblp4GTN’, ‘acm4GTN’ and ‘imdb4GTN’.
raw_dir (str) – Specifying the directory that will store the downloaded data or the directory that already stores the input data. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: False
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.
示例
>>> dataset = GTNDataset(name='imdb4GTN') >>> graph = dataset[0]
- class AsLinkPredictionDataset(dataset, target_link, target_link_r, split_ratio=None, neg_ratio=3, neg_sampler='global', **kwargs)[源代码]¶
Repurpose a dataset for link prediction task.
The created dataset will include data needed for link prediction. It will keep only the first graph in the provided dataset and generate train/val/test edges according to the given split ratio, and the correspondent negative edges based on the neg_ratio. The generated edges will be cached to disk for fast re-loading. If the provided split ratio differs from the cached one, it will re-process the dataset properly.
- 参数
dataset (DGLDataset) – The dataset to be converted.
split_ratio ((float, float, float), optional) – Split ratios for training, validation and test sets. Must sum to one.
neg_ratio (int, optional) – Indicate how much negative samples to be sampled The number of the negative samples will be equal or less than neg_ratio * num_positive_edges.
target_link (list[tuple[str, str, str]]) – The edge types on which predictions are make.
target_link_r (list[tuple[str, str, str]], optional) – The reverse edge types of the target links. Used to remove reverse edges of val/test edges from train graph.
neg_sampler (str, optional) – Indicate how negative edges of val/test edges are sampled. ‘global’ or ‘per_source’.
- train_graph¶
The DGLHeteroGraph for training
- Type
DGLHeteroGraph
- pos_val_graph¶
The DGLHeteroGraph containing positive validation edges
- Type
DGLHeteroGraph
- pos_test_graph¶
The DGLHeteroGraph containing positive test edges
- Type
DGLHeteroGraph
- neg_val_graph¶
The DGLHeteroGraph containing negative validation edges
- Type
DGLHeteroGraph
- neg_test_graph¶
The DGLHeteroGraph containing negative test edges
- Type
DGLHeteroGraph
- has_cache()[源代码]¶
Overwrite to realize your own logic of deciding whether there exists a cached dataset.
By default False.
- class AsNodeClassificationDataset(data, name=None, labeled_nodes_split_ratio=None, prediction_ratio=None, target_ntype=None, label_feat_name='label', label_mask_feat_name=None, **kwargs)[源代码]¶
Repurpose a dataset for a standard semi-supervised transductive node prediction task.
The class converts a given dataset into a new dataset object that:
Contains only one heterogeneous graph, accessible from
dataset[0]
.The graph stores:
Node labels in
g.nodes[target_ntype].data['label']
.Train/val/test masks in
g.nodes[target_ntype].data['train_mask']
,g.nodes[target_ntype].data['val_mask']
, andg.nodes[target_ntype].data['test_mask']
respectively.
In addition, the dataset contains the following attributes:
num_classes
, the number of classes to predict.train_idx
,val_idx
,test_idx
, train/val/test indexes.
The class will keep only the first graph in the provided dataset and generate train/val/test masks according to the given spplit ratio. The generated masks will be cached to disk for fast re-loading. If the provided split ratio differs from the cached one, it will re-process the dataset properly.
- 参数
data (DGLDataset or DGLHeteroGraph) – The dataset or graph to be converted.
name (str) – The dataset name. Optional when data is DGLDataset. Required when data is DGLHeteroGraph.
labeled_nodes_split_ratio ((float, float, float), optional) – Split ratios for training, validation and test sets. Must sum to 1. If None, we will use the train_mask, val_mask and test_mask from the original graph.
prediction_ratio (float, optional) – The ratio of number of prediction nodes to all unlabeled nodes. Prediction_ratio ranges from 0 to 1. If None, we will use the pred_mask from the original graph.
target_ntype (str) – The node type to add split mask for.
label_feat_name (str, optional) – The feature name of label. If None, we will use the name “label”.
label_mask_feat_name (str, optional) – The feature name of the mask indicating the indices of nodes with labels. None means that all nodes are labeled.
- train_idx¶
An 1-D integer tensor of training node IDs.
- Type
Tensor
- val_idx¶
An 1-D integer tensor of validation node IDs.
- Type
Tensor
- test_idx¶
An 1-D integer tensor of test node IDs.
- Type
Tensor
- pred_idx¶
An 1-D integer tensor of prediction node IDs.
- Type
Tensor
- has_cache()[源代码]¶
Overwrite to realize your own logic of deciding whether there exists a cached dataset.
By default False.
- class EdgeClassificationDataset(*args, **kwargs)[源代码]¶
The class EdgeClassificationDataset is a base class for datasets which can be used in task edge classification. So its subclass should contain attributes such as graph, category, num_classes and so on. Besides, it should implement the functions get_labels() and get_split().
- g¶
The heterogeneous graph.
- Type
dgl.DGLHeteroGraph
- category¶
The category(or target) node type need to be predict. In general, we predict only one node type.
- Type
- multi_label¶
Whether the node has multi label. Default
False
. For now, only HGBn-IMDB has multi-label.- Type
- get_labels()[源代码]¶
The subclass of dataset should overwrite the function. We can get labels of target nodes through it.
备注
In general, the labels are th.LongTensor. But for multi-label dataset, they should be th.FloatTensor. Or it will raise RuntimeError: Expected object of scalar type Long but got scalar type Float for argument #2 target’ in call to _thnn_nll_loss_forward
- 返回
labels
- 返回类型
torch.Tensor