Developer Guide

Evaluate a new dataset

You can specify your dataset if necessary. In this section we use HGBn-ACM as an example for the node classification dataset.

How to build a new dataset

First step: Process dataset

We give a demo to process the HGBn-ACM, which is a node classification dataset.

First, download the HGBn-ACM from the Link. After that, we process it as a dgl.heterograph.

The following code snippet is an example of creating a heterogeneous graph in DGL.

>>> import dgl
>>> import torch as th

>>> # Create a heterograph with 3 node types and 3 edges types.
>>> graph_data = {
...    ('drug', 'interacts', 'drug'): (th.tensor([0, 1]), th.tensor([1, 2])),
...    ('drug', 'interacts', 'gene'): (th.tensor([0, 1]), th.tensor([2, 3])),
...    ('drug', 'treats', 'disease'): (th.tensor([1]), th.tensor([2]))
... }
>>> g = dgl.heterograph(graph_data)
>>> g.ntypes
['disease', 'drug', 'gene']
>>> g.etypes
['interacts', 'interacts', 'treats']
>>> g.canonical_etypes
[('drug', 'interacts', 'drug'),
 ('drug', 'interacts', 'gene'),
 ('drug', 'treats', 'disease')]

We recommend to set the feature name as 'h'.

>>> g.nodes['drug'].data['h'] = th.ones(3, 1)

DGL provides dgl.save_graphs() and dgl.load_graphs() respectively for saving and loading heterogeneous graphs in binary format. So we can use dgl.save_graphs to store graphs into the disk.

>>> dgl.save_graphs("demo_graph.bin", g)

Second step: Add extra information

We can get a binary file named demo_graph.bin after the first step, and we should move it into the directory openhgnn/dataset/. The next step is to specify information in the NodeClassificationDataset.py

For example, we should set the category, num_classes and multi_label (if necessary) with "paper", 3, True, representing the node type to predict classes for, the number of classes, and whether the task is multi-label classification respectively. Please refer to Base Node Classification Dataset for more details.

if name_dataset == 'demo_graph':
    data_path = './openhgnn/dataset/demo_graph.bin'
    g, _ = load_graphs(data_path)
    g = g[0].long()
    self.category = 'author'
    self.num_classes = 4
    self.multi_label = False

Third step: optional

We can use demo_graph as our dataset name to evaluate an existing model.

python main.py -m GTN -d demo_graph -t node_classification -g 0 --use_best_config

If you have another dataset name, you should also modify the build_dataset.

Apply a new model

In this section, we will create a model named RGAT, which is not in our models package <api-model>.

How to build a new model

First step: Register model

We should create a class RGAT that inherits the Base Model and register the model with @register_model(str).

from openhgnn.models import BaseModel, register_model
@register_model('RGAT')
class RGAT(BaseModel):
    ...

Second step: Implement functions

We must implement the class method build_model_from_args, other functions like __init__, forward, etc.

...
class RGAT(BaseModel):
    @classmethod
    def build_model_from_args(cls, args, hg):
        return cls(in_dim=args.hidden_dim,
                   out_dim=args.hidden_dim,
                   h_dim=args.out_dim,
                   etypes=hg.etypes,
                   num_heads=args.num_heads,
                   dropout=args.dropout)

    def __init__(self, in_dim, out_dim, h_dim, etypes, num_heads, dropout):
        super(RGAT, self).__init__()
        self.rel_names = list(set(etypes))
        self.layers = nn.ModuleList()
        self.layers.append(RGATLayer(
            in_dim, h_dim, num_heads, self.rel_names, activation=F.relu, dropout=dropout))
        self.layers.append(RGATLayer(
            h_dim, out_dim, num_heads, self.rel_names, activation=None))
        return

    def forward(self, hg, h_dict=None):
        if hasattr(hg, 'ntypes'):
            # full graph training,
            for layer in self.layers:
                h_dict = layer(hg, h_dict)
        else:
            # minibatch training, block
            for layer, block in zip(self.layers, hg):
                h_dict = layer(block, h_dict)
        return h_dict

Here we do not give the implementation details of RGATLayer. For more reading, check out: RGATLayer.

备注

In OpenHGNN, we preprocess the features of the dataset outside of the model. Specifically, we use a linear layer with bias for each node type to map all node features to a shared feature space. So the parameter h_dict of forward in the model are not original features, and your model does not need feature preprocessing.

Third step: Add to supported models dictionary

We should add a new entry to SUPPORTED_MODELS in models/init.py

Apply to a new scenario

In this section, we will apply to a recommendation scenario, which involves building a new task and trainerflow.

How to build a new task

First step: Register task

We should create a class Recommendation that inherits the BaseTask and register it with @register_task(str).

from openhgnn.tasks import BaseTask, register_task
@register_task('recommendation')
class Recommendation(BaseTask):
    ...

Second step: Implement methods

We should implement the methods involved with evaluation metrics and loss functions.

class Recommendation(BaseTask):
    """Recommendation tasks."""
    def __init__(self, args):
        super(Recommendation, self).__init__()
        self.n_dataset = args.dataset
        self.dataset = build_dataset(args.dataset, 'recommendation')
        self.train_hg, self.train_neg_hg, self.val_hg, self.test_hg = self.dataset.get_split()
        self.evaluator = Evaluator(args.seed)

    def get_loss_fn(self):
        return F.binary_cross_entropy_with_logits

    def evaluate(self, y_true, y_score, name):
        if name == 'ndcg':
            return self.evaluator.ndcg(y_true, y_score)

Finally

We should add a new entry to SUPPORTED_TASKS in task/init.py

How to build a new trainerflow

First step: Register trainerflow

We should create a class that inherits the BaseFlow and register the trainerflow with @register_trainer(str).

from openhgnn.trainerflow import BaseFlow, register_flow
@register_flow('Recommendation')
class Recommendation(BaseFlow):
    ...

Second step: Implement methods

We declared the function train() as an abstract method. So the train() must be overridden, or the trainerflow cannot be instantiated. The following gives an example of the training loop.

...
class Recommendation(BaseFlow):
    def __init__(self, args=None):
        super(Recommendation, self).__init__(args)
        self.target_link = self.task.dataset.target_link
        self.model = build_model(self.model).build_model_from_args(self.args, self.hg)
        self.evaluator = self.task.get_evaluator(self.metric)

    def train(self,):
        for epoch in epoch_iter:
            self._full_train_step()
            self._full_test_step()

    def _full_train_step(self):
        self.model.train()
        logits = self.model(self.hg)[self.category]
        loss = self.loss_fn(logits[self.train_idx], self.labels[self.train_idx])
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        return loss.item()

    def _full_test_step(self, modes=None, logits=None):
        self.model.eval()
        with torch.no_grad():
            loss = self.loss_fn(logits[mask], self.labels[mask]).item()
            metric = self.task.evaluate(pred, name=self.metric, mask=mask)
            return metric, loss

Finally

We should add a new entry to SUPPORTED_FLOWS in trainerflow/init.py