openhgnn.dataset.GTNDataset¶

class GTNDataset(name, raw_dir=None, force_reload=False, verbose=False, transform=None)[source]¶

GTN Dataset.

It contains three datasets used in a NeurIPS’19 paper Graph Transformer Networks <https://arxiv.org/abs/1911.06455>, which includes two citation network datasets DBLP and ACM, and a movie dataset IMDB. DBLP contains three types of nodes (papers (P), authors (A), conferences (C)), four types of edges (PA, AP, PC, CP), and research areas of authors as labels. ACM contains three types of nodes (papers (P), authors (A), subject (S)), four types of edges (PA, AP, PS, SP), and categories of papers as labels. Each node in the two datasets is represented as bag-of-words of keywords. On the other hand, IMDB contains three types of nodes (movies (M), actors (A), and directors (D)) and labels are genres of movies. Node features are given as bag-of-words representations of plots.

Dataset statistics:

Dataset Nodes Edges Edge type Features Training Validation Test

DBLP 18405 67946 4 334 800 400 2857

ACM 8994 25922 4 1902 600 300 2125

IMDB 12772 37288 4 1256 300 300 2339

Data source link: <https://drive.google.com/file/d/1qOZ3QjqWMIIvWjzrIdRe3EA4iKzPi6S5/view?usp=sharing>

Parameters:

name (str) – Name of the dataset. Supported dataset names are ‘dblp4GTN’, ‘acm4GTN’ and ‘imdb4GTN’.
raw_dir (str) – Specifying the directory that will store the downloaded data or the directory that already stores the input data. Default: ~/.dgl/
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: False
transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

Examples

>>> dataset = GTNDataset(name='imdb4GTN')
>>> graph = dataset[0]