Disentangled Ontology Embedding for Zero-shot Learning

Knowledge Graph (KG) and its variant of ontology have been widely used for knowledge representation, and have shown to be quite effective in augmenting Zero-shot Learning (ZSL). However, existing ZSL methods that utilize KGs all neglect the intrinsic complexity of inter-class relationships represented in KGs. One typical feature is that a class is often related to other classes in different semantic aspects. In this paper, we focus on ontologies for augmenting ZSL, and propose to learn disentangled ontology embeddings guided by ontology properties to capture and utilize more fine-grained class relationships in different aspects. We also contribute a new ZSL framework named DOZSL, which contains two new ZSL solutions based on generative models and graph propagation models, respectively, for effectively utilizing the disentangled ontology embeddings. Extensive evaluations have been conducted on five benchmarks across zero-shot image classification (ZS-IMGC) and zero-shot KG completion (ZS-KGC). DOZSL often achieves better performance than the state-of-the-art, and its components have been verified by ablation studies and case studies. Our codes and datasets are available at https://github.com/zjukg/DOZSL.


INTRODUCTION
Zero-shot Learning (ZSL), which enables models to predict new classes that have no training samples (i.e., unseen classes), has attracted a lot of research interests in many machine learning tasks, such as image classification [7,36], relation extraction [20] and Knowledge Graph (KG) completion [25,31].To handle these unseen classes, most existing ZSL methods adopt a knowledge transfer strategy: transferring samples, sample features or model parameters from the classes that have training samples (i.e., seen classes) to these unseen classes, with the guidance of some auxiliary information which usually depicts the relationships between classes.For example, in zero-shot image classification (ZS-IMGC), some studies utilize visual attributes of objects to transfer image features learned from seen classes to unseen classes and build classifiers for the later [19,37].Other popular auxiliary information includes class's literal name [7], textual descriptions [25,40] and so on.
Recently, more and more studies leverage KG [14,24], an increasingly popular solution for managing graph structured data, to represent complex auxiliary information for augmenting ZSL [2].KGs that are composed of relational facts can model diverse relationships between classes.For example, Wang et al. [34] incorporate class hierarchies from a lexical KG named WordNet [23]; works such as [10,26] explore common sense class knowledge from ConceptNet [28].As a kind of KGs, ontologies, also known as ontological schemas when they act as parts of KGs for meta information, can represent more complex and logical inter-class relationships.For example, Chen et al. [3] use an ontology in OWL 1 to express the compositionality of classes; Geng et al. [9] define the domain and range constraints of KG relations using ontological schemas, as shown in Figure 1 (b).In addition, ontologies are also able to represent and integrate traditional auxiliary information such as attributes and textual descriptions.For example, as Figure 1 (a) shows, animal visual attributes with binary values can be represented in graph with the attributes transformed into entities.
To exploit these KGs, two ZSL paradigms have been widely investigated.One is a pipeline including two main steps.Firstly, the KG is embedded, based on which the ZSL classes that are already aligned with KG entities are represented using vectors with their relationships kept in the vector space.Secondly, a compatibility function between the class vector and the sample input (or features) is learned.It can either be a mapping function, which projects the sample input and the class vector into the same space such that a testing sample can be matched with an arbitrary class via e.g., Euclidean distance [3,7,20], or a generative model, which generates labeled samples or features for unseen classes [9,25].The other paradigm is based on graph information propagation.It often uses Graph Neural Networks (GNNs) to propagate classifier parameters or sample features from nodes of seen classes to nodes of unseen classes [4,16,34].Methods of both paradigms, together with KGs, always lead to state-of-the-art performance on many ZSL tasks.
Nevertheless, existing methods of both paradigms still have big space for improvement.In a real-world KG, an entity is often linked to other entities for knowledge of different aspects.For example, Kobe Bryant is connected to NBA teams for his career knowledge, and connected to his daughters for family knowledge.This also 1 Web Ontology Language (https://www.w3.org/TR/owl-features/) happens in those KGs (especially ontologies) used for augmenting ZSL.As shown in Figure 1 (a), Zebra is connected to Horse via rdfs:subClassOf for knowledge on taxonomy, and connected to Tiger and Panda via imgc:hasAttribute for knowledge on visual characteristics.Thus the vector representation of Zebra should be closer to Horse than Tiger and Panda considering the aspect of taxonomy, and be closer to Tiger and Panda than Horse considering the visual characteristics.The existing KG-based ZSL methods all neglect this important KG characteristic on entanglement, which prevents them from capturing more fine-grained inter-class relationships in different aspects, and limits their performance.
In this work, we focused on augmenting ZSL by ontologies, proposed to investigate Disentangled Ontology embeddings and developed a general ZSL framework named DOZSL.DOZSL first learns multiple disentangled vector representations (embeddings) for each class according to its semantics of different aspects defined in an ontology, where a new disentangled embedding method with ontology property-aware neighborhood aggregation and triple scoring is proposed, and then adopts an entangled ZSL learner, which builds upon a Generative Adversarial Network (GAN)-based generative model and a Graph Convolutional Network (GCN)-based graph propagation model, respectively, to incorporate these disentangled class representations.To apply the generative model, we concatenate the disentangled representations; while to apply the propagation model, we generate one graph for semantics of one aspect with the disentangled representations.We evaluate DOZSL with five datasets of zero-shot image classification (ZS-IMGC) and zero-shot KG completion (ZS-KGC).See Figure 1 for segments of the ontology for one IMGC dataset and the ontological schema for a KG to complete.In summary, our contributions are the following: • To the best of our knowledge, this is among the first to investigate disentangled semantic embeddings for ZSL.Widely used auxiliary information includes class attributes [19,37,38], textual information [7,40] and KGs [6,9,26,34].To support ZSL, they are often embedded to generate one semantic vector for each class, such as binary/numerical attribute vectors, pre-trained word embeddings, learnable sentence embeddings, and KG embeddings.Next, a compatibility function between the class vectors and the vector representations of samples is often learned to conduct knowledge transfer.Mapping function is a typical practice, which maps the image features to the space of class vector [3,7,19] or vice versa [39] or to a shared common space [8].However, all of these mappings are trained by seen data, and thus have a strong bias towards seen classes during prediction, especially in generalized ZSL.Recently, thanks to generative models such as GANs [12], several methods [37,40] have been proposed to synthesize samples (or features) for unseen classes conditioned on their class vectors.This converts the ZSL problem to a standard supervised learning problem with the aforementioned bias issue alleviated.
Besides, to explicitly exploit the structural inter-class relationships that exist in a KG, some ZSL works explore a graph information propagation strategy.In these works, classes are often aligned with KG entities, and a powerful GNN such as GCN [17] is then trained to output a classifier (i.e., a class-specific parameter vector) for each class, through which the classifiers of unseen classes are approximated by aggregating the classifiers of seen classes.One typical work is by Wang et al. [34], the subsequent works adopt similar ideas but vary in optimizing the graph propagation [11,16].Especially, some of them consider the multiple types of relations in the KGs by developing multi-relational GCN [4], or spliting the multi-relation KGs into multiple single-relation graphs and applying several parameter-shared GCNs to propagate features [32].

Zero-shot KG Completion (ZS-KGC).
In this task, a KG composed of relational facts is to be completed.It is denoted as G = {E, R, T }, where E is a set of entities, R is a set of relations, and T = {(ℎ, , )|ℎ,  ∈ E;  ∈ R} is a set of relational facts in form of RDF triple.The completion is to predict a missing but plausible triple with two of ℎ, ,  given.Typical KGC methods first embed entities and relations into vector spaces (i.e.,  ℎ ,   and   ) and conduct vector computations to discover missing triples.The embeddings are trained by existing triples and assume all testing entities and relations are available at training time.ZS-KGC is thus proposed to predict for unseen entities or relations that are newly added during testing and have no associated training triples.
Some ZS-KGC approaches devote to dealing with unseen entities by utilizing the auxiliary connections with seen entities [33], introducing their textual descriptions [31], or learning entity-independent graph representations so that naturally generalizing to unseen entities [5,29].In contrast, the works for unseen relations are relatively underexplored.Both Qin et al. [25] and Geng et al. [9] leverage GANs to synthesize valid embeddings for unseen relations conditioned on their auxiliary information which are textual descriptions and ontological schemas, respectively.
In this study, we target at unseen relations.Two disjoint relation sets: the seen relation set R  and the unseen relation set R  are set.The triple set T  = {(ℎ,   , )|ℎ,  ∈ E;   ∈ R  } is collected for training, and T  = {(ℎ,   , )|ℎ,  ∈ E;   ∈ R  } is collected to evaluate the completion of the triples of unseen relations.A closed set of entities is considered following previous works, i.e., each entity that appears in the testing set has appeared during training.

Ontology
Ontology is famous for representing and exchanging general or domain knowledge, often with hierarchical concepts as the backbone and properties for describing semantic relationships [15].In this study, we use a simple form of ontology, namely in RDF Schema (RDFS) 2 , while those more complicated OWL ontologies can be transformed into RDFS ones following some criteria.An ontology can be used as a schema of a KG, defining entity types, relations and so on.Accordingly, we represent an ontology as O = {C, P, T  }, where C is the set of concepts (a.k.a.types), P is the set of properties, and T  = C × P × C is the set of triples.To serve as auxiliary information for ZSL, an ontology models the relevant domain knowledge of a given ZSL task.For example, in IMGC, concepts are used to represent image classes and image attributes; in KGC, ontology triples can be used to define domains (i.e., head entity types) and ranges (i.e., tail entity types) of KG relations.Note we sometimes also call concept as concept node in introducing ontology embedding.
Ontology properties can be either built-in properties of RDFS, such as rdfs:subClassOf and rdfs:subPropertyOf, or user defined for a specific task, such as imgc:hasAtrtibute.Figure 1 shows two ontology segments for ZS-IMGC and ZS-KGC.The triple (Zebra, imgc:hasAttribute, Stripe) means that an animal class Zebra has an attribute Stripe in decoration, while the triple (radiostation_in_city, rdfs:subPropertyOf, has_office_in_city) means that the KG relation radiostation_in_city is a subrelation of has_office_in_city.It is worth mentioning that properties are also often defined with hierarchies, as the concepts.One general property is often defined for semantics of one aspect, and then more sub-properties are defined for more fine-grained semantics.Thus we can often easily find out relevant properties for different semantic aspects of an ontology by simple visualization of the property hierarchies.
In our ZS-KGC case study, we adopt ontologies developed in [9] as the auxiliary information for completing relational facts of their corresponding KGs in the zero-shot setting, where KG relations are modeled as ontology concepts and their meta-relationships are modeled by ontology properties.Our DOZSL framework contains a disentangled ontology encoder to learn disentangled representations for all concept nodes in an ontology, through which the fined-grained inter-concept relationships can be figured out and well utilized in downstream zero-shot learning and prediction steps.

Disentangled Representation Learning
The goal of disentangled representation learning is to learn embedding including various separate components behind the data.In the field of the graph, DisenGCN [22] is the first work tending to learn disentangled node representations, which uses a neighborhood routing mechanism to identify the latent factor that may have caused the link from a given node to one of its neighbors.However, it mainly focuses on homogeneous graphs with a single relation type.To process graphs with more diverse relation types, DisenE [18] and DisenKGAT [35], which leverage an attention mechanism and a dynamic assignment mechanism, respectively, disentangle the entity embeddings according to the relations in a KG.Different from these works, we propose to learn disentangled ontology embeddings in terms of the characteristics of the ontology used for ZSL and develop a novel disentanglement mechanism which is guided by the properties in an ontology.
There are also some works that explore the disentangled representation learning in ZSL [21,38].However, they all focus on disentangling the representations of samples such as the image features learned by CNNs, none of them have taken into account the impact of learning disentangled auxiliary information representations, especially when richer but complex auxiliary information are introduced.In contrast, our work made the first attempt.

METHODOLOGY
As shown in Figure 2, DOZSL includes two core modules: Disentangled Ontology Encoder learning disentangled ontology embeddings, and Entangled ZSL Learner utilizing the embeddings for generation-based and propagation-based ZSL methods.

Disentangled Ontology Encoder
In DOZSL, the embedding of each concept node  is disentangled into multiple distinct components as  = [ 1 ,  2 , ...,   ], where  is the component numbers,   ∈ R   represents the -th component encoding semantics of one aspect of  and  is the embedding size.
To learn disentangled embedding for each concept, we first aggregate information from its graph neighborhoods that characterize it.In the aggregation of each component for a concept, only a subset of neighbors actually carries valuable information since each component represents a specific semantic aspect.To identify the aspect-specific subset, we follow the attention-based neighborhood routing strategy in previous works [22,35].Also, considering the various relation types in the ontologies, we propose a propertyaware attention mechanism.Specifically, for the -th aspect, the attention value of one neighbor   of concept   is computed by the similarity of the -th component embeddings of   and   in the subspace of their connection property  following the assumption that when a neighbor contributes more to   in the aggregation, their property-aware representations are more similar, formally: where  ∈ {0, 1, ...,  − 1} with  as the number of aggregation layers.ℎ  ,, is the -th component embedding of   w.r.t.property  in the -th aggregation layer, • denotes the Hadamard product, and   is a learnable projection matrix of  for projecting   's th component embedding ℎ  , into the property specific subspace.
is the set of pairs of neighboring concept nodes and properties of   , which also includes   itself with a special self-connection property   .T  is the ontology triple set.A dot-product similarity is adopted here.
With attention values, we separately aggregate the neighborhood information for representing each component and also update the property embedding after each aggregation as: where ℎ   is the embedding of property  in the -th layer.Θ   is the layer-specific linear transformation matrix for . is a combination operator for fusing the information of neighboring concept nodes and property edges.Here, we refer to CompGCN [30] to implement it via e.g.vector multiplication.ℎ 0 , is randomly initialized, and ℎ  , is outputted at last layer which has encoded the neighborhood information specific to aspect .We make    = ℎ  , for simplicity.To further improve the disentanglement, we propose to refine the semantics of each disentangled component embedding of concepts according to their associated properties.It is inspired by the characteristic of knowledge in ontologies, i.e., ontology properties are often represented with hierarchies, thus one general property can always be selected for representing one distinct semantic aspect of a concept; for example, the properties imgc:hasAttribute and rdfs:subClassOf in Figure 1 represent the semantics on animal visual characteristics and taxonomy, respectively.
To achieve this goal, we (i) select a set of properties for aspects of the semantics of the concepts to encode (e.g., imgc:hasAttributes for visual characteristics in the ontology for IMGC) and set the number of disentangled components to be the number of selected properties, and (ii) design a property guided triple scoring mechanism extracting property-specific components to constitute a valid ontology triple.Specifically, for an ontology triple (  ,   ,   ), we extract the -th components of   and   with respective to property   , and leverage the score function on KG embedding methods to calculate the triple score with the extracted components.In this way, we accurately endow each component embedding with a specific semantic meaning w.r.t properties.Here, the score function of TransE [1] is adopted to compute the triple score as: where    and    denote the extracted component embeddings of concepts   and   respectively, and   represents the embedding of property   . is the logistic sigmoid function.A higher score indicates a stronger relatedness between    ,   and    .Finally, we use the standard cross entropy with label smoothing to train the whole disentangled ontology encoder as: where  is the batch size, C is the concept node set of the ontology,   is the label of the given query (  ,   ), whose value is 1 when the triple (  ,   ,    ) holds and 0 otherwise.

Entangled ZSL Learner
With the disentangled ontology embeddings, we next show how to utilize them for ZSL.Specifically, we develop two kinds of methods.
In consideration of the effectiveness of GANs in learning the compatibility between class vectors and their samples, the first method is generation-based leveraging GANs to generate discriminative   ,  2  , ...,    ]) , and then adopt a typical scheme of GAN for feature generation.Specifically, the GAN consists of three networks: a generator  synthesizing sample features for a class from random noises conditioned on its embedding; a feature extractor  providing the real sample features; and a discriminator  distinguishing the generated features from the real ones.We generate sample features instead of raw samples for both higher accuracy and efficiency, as in many works [9,25,37].
Formally, for a class   , the generator  takes as input its embedding and a random noise vector  sampled from Normal distribution, and generates its features: x =  (,   ).The loss of  is defined as: where the first term is the Wasserstein loss, the second term is a supervised classification loss for classifying the synthesized features, and the third is for regularizing the mean of generated features of each class to be the mean of its real features.The latter two both encourage the generated features to have more inter-class discrimination. 1 and  2 are the corresponding weight coefficients.The discriminator  takes as input the synthesized features x from  and the real features  from .Its loss is defined as: where the first two terms approximate the Wasserstein distance of the distributions of  and x.The last term is the gradient penalty to enforce the gradient of  to have unit norm in which x = +(1−) x with  ∼  (0, 1). is the weight coefficient.
In view of the different data form in different ZSL tasks, we adopt different feature extractor .For ZS-IMGC, we employ ResNet101 [13] to extract the features of images following previous works [36]; and for ZS-KGC, we follow [9,25] to learn cluster-structured features for KG relations.In general,  is trained in advance with only samples of seen classes, and is fixed during adversarial training.Also, our framework is compatible to different feature extractors.
With well trained GAN, we use generator  to synthesize features and train task-specific prediction models for unseen classes.
In ZS-IMGC, we train a softmax classifier for each unseen class to classify its testing images; in ZS-KGC, a testing triple is completed by calculating the similarity between the generated embedding of the relation  and the joint embedding of the entity pair (ℎ, ).

Propagation-based.
With disentangled concept embeddings, more fine-grained relatedness between concepts could be utilized.Therefore, as shown in Figure 2, we generate one semantic graph for each component, where nodes correspond to the classes (relations in KGC) in the dataset and edges are generated by calculating the cosine similarity between the component embeddings of two class nodes, and conduct graph propagation on it to transfer features between classes under each semantic aspect.The initialized node features are the class's component embedding.Formally, we represent the -th semantic graph as G  (  ,   ), where   ∈ R ×   is the input feature matrix of graph nodes, and   ∈ R × is the graph adjacency matrix indicating the connections among  classes defined as below,  denotes the similarity threshold.
Since G  is a graph with one single relation, we use GCN for feature propagation.Each graph convolutional layer performs as: where   is the normalized adjacent matrix, and Φ   is a layerspecific weight matrix shared among all semantic graphs. 0  =   .For each semantic graph, the GCN outputs a set of node embeddings   ∈ R × , through which we can obtain a set of classifiers W for all  classes as: W =  ( 1 ,  2 , ...,   ), where  is a fusion function.In our experiments, we implement  by averaging: W = 1     , or linear transformation: W =  1 ([ 1 ;  2 ; ...;   ]) where  1 ∈ R  × is a trainable transformation matrix.Then, following [16,32,34], we compute the Mean Square Error between the fused classifiers and the ground-truth classifiers as loss function: where W  ⊂ W is the set of classifiers of the seen classes,  ( ) denotes the corresponding ground-truth.Different from the traditional classifier which is a network trained using labeled samples, the classifier here is actually a real-valued vector that represents the class-specific features, and is obtained by averaging the features of all the training samples of one class in our paper.The sample features are also extracted via the feature extractor  mentioned in Section 3.2.1.By using these ground-truth seen classifiers to supervise the training of GCNs, classifiers of the unseen classes can be learned by aggregation.During prediction, for an input testing sample, we first extract its features using the same feature extractors, and then perform classification or completion by calculating the similarity between the learned classifiers and the extracted features.

EVALUATION 4.1 Experiment Settings
4.1.1Datasets and Ontologies.For ZS-IMGC, we use a popular benchmark named Animals with Attributes (AwA) [36] and two benchmarks ImNet-A and ImNet-O extracted from ImageNet by Geng et al. [9].AwA is for coarse-grained animal image classification wth 50 classes and 37, 322 images.ImNet-A is for more fine-grained animal image classification and ImNet-O is for finegrained general object classification.The classes are split into a seen set and an unseen set, following [36].For ZS-KGC, we use two KGs provided in [25] for completion, i.e., NELL-ZS and Wiki-ZS extracted from NELL and Wikidata 3 , respectively.In each KG, the relations are split into a training set with seen relations, a validation set and a testing set with unseen relations, following [25].Accordingly, their associated triples compose a training set, a validation set and a testing set.It is ensured that all entities are seen.Each dataset has an ontology as its auxiliary information.We use the ontologies developed in [9] and take the latest version released in [10].For ZS-IMGC, the ontologies contain class hierarchies (taxonomies), class visual attributes and attribute hierarchies.In our property guided disentangled embedding, we select two general properties: rdfs:subClassOf for semantic aspect on taxonomy, and imgc:hasAttribute for semantic aspect on visual characteristics.For ZS-KGC, the ontologies contain type constraints of the head and tail entities of relations, represented by properties rdfs:domain and rdfs:range, relation hierarchies represented by property rdfs:subProperty, and type hierarchies represented by property rdfs:subClassOf.These four properties are selected as general properties used in ontology encoder.See Table 1 for detailed statistics. 3NELL (http://rtw.ml.cmu.edu/rtw/) and Wikidata (https://www.wikidata.org/)4.1.2Variants of DOZSL and Baselines.In disentangled ontology encoder, we compare two settings for component embeddings that are fed to score triple (Eq.( 4)): aggregating neighborhood information (Eq.( 1) and ( 3)), and randomly initializing component embeddings without neighborhood aggregation.This leads to two DOZSL variants.Meanwhile, they can be combined with two downstream ZSL methods: generation-based with GAN and propagation-based with GCN.Thus we have four DOZSL variants and denote them as "DOZSL(X+Y)", where X can be AGG (neighborhood aggregation) and RD (random initialization), Y can be GAN and GCN.
The baselines include those generation-based and propagationbased ZSL methods that often achieve state-of-the-art performance on many ZSL datasets.OntoZSL [9] is a generation-based method that uses GANs to synthesize samples, where we take TransE as its ontology encoder for a fair comparison.DGP [16] is a propagationbased method using a two-layers GCN which only supports singlerelation graphs.To deal with the multi-relation ontology graph, we take the method proposed in [32] as a baseline.Meanwhile, two relation-aware GNNs, RGCN [27] and CompGCN [30], are also used to implement another two propagation-based ZSL baselines.We also consider different disentangled and non-disentangled semantic embedding methods for more baselines.For non-disentangled embedding, we choose classical TransE, and RGAT which also performs attentive relation-aware graph aggregation.For disentangled embedding, we choose two state-of-the-art methods DisenE [18] and DisenKGAT [35].These embedding methods can also be combined with GAN-based and GCN-based ZSL learners as in DOZSL, leading to baselines such as "DisenKGAT+GAN".Note "TransE+GAN" is equivalent to OntoZSL.

Evaluation Metrics.
For ZS-IMGC, we report macro accuracy following [36], where accuracy of each class is first calculated with its testing images, and the accuraccies of all testing classes are then averaged.For standard ZSL testing, we compute accuracy on all unseen classes, denoted as ; while for generalized ZSL testing, we first calculate accuracy for all the seen classes and all the unseen classes separately, denoted as   and   , respectively, and then report a harmonic mean  = (2 ×   ×   )/(  +   ).
Our ZS-KGC task is to predict the tail entity  given a head entity ℎ and an unseen relation   .Thus for the input of a testing triple (ℎ,   ), we rank a set of candidate entities according to their predicted scores of being the tail entity, and see the rank of the ground truth tail entity -the smaller rank, the better performance.As in most KGC works, we report Mean Reciprocal Ranking () and ℎ@ (i.e., the ratio of testing samples whose ground truths are ranked in the top- position). is set to 1, 5, 10.Different from ZS-IMGC where predicting the class label of an image tends to be confused by other classes, the prediction for a seen relation in ZS-KGC is relatively independent of the prediction for an unseen relation.Thus the generalized ZSL testing setting in ZS-KGC, which is a simple addition of normal KGC, is not considered in our paper.

ZS-IMGC.
The results are reported based on these settings.For ontology encoder, we set the component embedding size and the property embedding size to 100. is set to 2 (corresponding to rdfs:subClassOf and imgc:hasAttribute) for all DOZSL(RD) variants, Table 2: uracy and  (%) of ZS-IMGC on AwA, ImNet-A and ImNet-O. and ℎ@ (%) of ZS-KGC on NELL-ZS and Wiki-ZS.The best results in a method category (resp. in the whole column) are in bold (resp.underlined).TransE+GAN equals OntoZSL.

Category
Methods AwA ImNet-A ImNet-O NELL-ZS Wiki-ZS       ℎ @10ℎ @5 ℎ @1  ℎ @10ℎ @5 ℎ @1  Generation TransE+GAN but to 5 for all DOZSL(AGG) variants since two reverse properties and a self-connection property are added during aggregation.The initial learning rate is set to 0.001.The number of the aggregation layer for DOZSL(AGG) variants is set to 1.
For ZSL learner, we employ ResNet101 to extract 2, 048-dimensional image features.It is ensured that unseen classes of all the three datasets have never appeared in training ResNet101.Regarding GAN, the generator and discriminator both consist of two fully connected layers with 4, 096 hidden units; their learning rates are both set to 0.0001; the dimension of noise vector  is set to 100;  1 ,  2 and  are set to 0.01, 5 and 10, respectively.Regarding GCN, the size of the classifier vector is 2, 048; 2 convolutional layers with a hidden dimension of 2, 048 are used; the learning rate is set to 0.001.As for the optimum similarity threshold for creating semantic graphs, we provide a detailed evaluation in Section 4.3.
For baselines DisenE and DisenKGAT, we test different  values and report the better ones in the main body, and attach the complete results in Appendix A. More details please see our released codes.Overall Results.The results are shown in the left side of Table 2.We can see DOZSL always achieves the best performance on AwA and ImNet-O, no matter what downstream ZSL learners are applied (+GAN or +GCN).On ImNet-A, DOZSL is still the best in most cases.Although DOZSL does not outperform RGCN-ZSL on the metric of  , the result is still comparable.Results on Ontology Encoders.First, we find the methods with our disentangled embeddings often outperform those methods with non-disentangled embeddings.In particular, DOZSL(AGG) outperforms RGAT and TransE on all the datasets no matter what ZSL learners are used.Second, we find DOZSL(AGG) often performs better than DOZSL(RD) on most metrics.This indicates the superiority of capturing neighborhood information in learning disentangled ontology embeddings.Third, our property guided component-wise triple score is quite effective in learning disentangled embeddings.This can be verified by the fact that DOZSL(AGG) outperforms DisenE and DisenKGAT on all the three datasets.Even without aggregation, DOZSL(RD) is still quite good in most cases.
Results on ZSL Learners.Using either GAN or GCN can make our framework perform better than the baselines.Especially, when the input ontology embedding is fixed, we can often select one of them for better performance.For example, on AwA, i) DOZSL(RD+GAN) has worse performance than DisenE+GAN and DisenE+GCN, but DOZSL(RD+GCN) outperforms DisenE+GCN and DisenE+GAN; ii) using GCN with DOZSL(AGG) can achieve good performance, while using GAN with DOZSL(AGG) achieves even higher performance on both metrics  and .Moreover, our DOZSL variants with GCN perform better than previous propagation-based ZSL methods in most situations, illustrating that our method can more effectively capture the structural class relationships in ontologies.

ZS-KGC.
For ontology encoder, we re-use the settings in ZS-IMGC.The dimension of component embedding and property embedding is set to 200. is 4 for DOZSL(RD) and is 9 for DOZSL(AGG) considering the reverse properties and the self-connection property.The feature extractors are pre-trained to extract 200-dimensional and 100-dimensional relation features for NELL-ZS and Wiki-ZS, respectively, following the settings in [9,25], with TransE-based embeddings as the input.For ZSL learner, we also employ the same GAN and GCN architectures as in ZS-IMGC, but use some different settings.Regarding the GAN for NELL-ZS, the generator has 250 hidden units, while the discriminator has 200 hidden units.Regarding the GAN for Wiki-ZS, the corresponding unit numbers are 200 and 100.For both datasets, the noise vector size is set to 15;  1 ,  2 are set to 1 and 3, respectively.Regarding GCN, the classifier vector size is 200 for NELL-ZS and 100 for Wiki-ZS.As in ZS-IMGC, the selection of similarity thresholds for creating semantic graphs is evaluated in Section 4.3; different  values are tested for DisenE and DisenKGAT with the optimum performance reported in Table 2 and the complete results attached in Appendix A. Overall Results.The results are presented in the right of Table 2. On NELL-ZS, our method achieves the best on ℎ@10 and ℎ@5, DOZSL(RD+GAN) and DOZSL(RD+GCN) are both very competitive to the baseline RGCN-ZSL and better than other baselines on ℎ@1 and .On Wiki-ZS, two baselines RGAT+GAN and RGCN-ZSL

Ablation Studies
We conduct extensive ablation studies to analyze the impact of different factors in DOZSL, including the property guided triple scoring, the neighborhood aggregation, the similarity threshold for constructing semantic graphs and the classifier fusion.Property Guided Triple Scoring.We replace the property guided triple scoring in DOZSL(RD) and DOZSL(AGG) by the widelyadopted attentive triple scoring and keep the same setting of .This leads to two new variants, denoted as DOZSL(RD atten ) and DOZSL(AGG atten ), respectively.These variants' results with GAN are reported in Table 3, the results with GCN are attached in Appendix B. We can find that DOZSL(RD atten ) and DOZSL(AGG atten ) always obtain dramatically worse results than DOZSL(RD) and DOZSL(AGG), respectively, on all the datasets of the two tasks, with the only exception of DOZSL(RD atten +GAN) on AwA.These results illustrate the effectiveness of our proposed property guided triple scoring.The except may be due to the imbalanced associated triples of different properties in AwA's ontology: imgc:hasAttribute has 1, 562 associated triples, which can well train its corresponding component, while rdfs:subClassOf has only 197 associated triples, making its corresponding component under fitted.The two components are concatenated and fed to GANs together, thus they may influence each other.In contrast, the GCN-based method, which performs independent feature propagation in isolated semantic graphs, suffers less from the imbalance issue.Neighborhood Aggregation.In DOZSL, we aggregate information from all the neighboring concepts in the ontology, with an attention mechanism for combination.Here, we want to test a more straightforward solution, i.e., aggregating information from a neighborhood subset which only includes concepts that are connected by the property corresponding to the embedding component.This leads to new variants denoted by DOZSL(AGG sub ).The results with GAN are shown in Table 3, the results with GCN are in Appendix B. In comparison with DOZSL (AGG), DOZSL(AGG sub ) performs worse on most metrics across two tasks, except for DOZSL (AGG sub +GAN) on ImNet-O w.r.t. and DOZSL(AGG sub +GCN) on NELL-ZS.The overall worse results of DOZSL(AGG sub ) indicate that learning a component embedding should (attentively) aggregate all the neighboring concepts rather than select a part of them according to the specific properties.The exceptions may be due to the simple neighborhoods in NELL-ZS and ImNet-O and/or the independent propagation in each semantic graph.Similarity Threshold and Classifier Fusion.We compare different similarity thresholds ranging from 0.85 to 0.999 for constructing semantic graphs, and compare different classifier fusion functions, under different ontology encoding methods.The results are reported in Figure 4 in Appendix C, from which we can find that the optimum similarity threshold varies when different ontology encoding methods are used, and the two fusion functions -Average and Linear Transformation both positively contribute to the learning of the classifier.Please see Appendix C for more details.

Case Study
We use examples from NELL-ZS to analyze disentanglement of concept embeddings we learned.In the left of Figure 3, we visualize the component embeddings of KG relations learned from NELL-ZS's ontology by DOZSL(RD), where different colors indicate different components.We can find that i) the embeddings are clustered into different groups under each component's subspace, and ii) the component embeddings of each relation are divided into different clusters across different components.These observations illustrate that i) our method indeed captures the semantically similarity among relation concepts under each semantic aspect and ii) different relatedness is presented across different aspects.
Also, to further verify that different components represent different semantic aspects, for each relation, we randomly select two neighbors from the cluster of each component.The right of Figure 3  presents two examples.For relation league_players, its two neighbors from the first component are league_teams and league_coaches, the head entity types of these three relations are identical, i.e., sports_league; while its two neighbors from the second component are athlete_beat_athlete and sports_team_position_athlete, their tail entity types are athlete.According to these two examples, we can find that these four components respectively reflect four semantic aspects of the relations, i.e., rdfs:domain, rdfs:range, rdfs:subPropertyOf and rdfs:subClassOf, and we can also conclude that the semantic of one component is a fixed across different relations.

CONCLUSION AND DISCUSSION
In this study, we focused on ontology augmented ZSL and proposed a novel property guided disentangled ontology embedding method.
With the new disentangled embeddings, different semantic aspects of ZSL classes are figured out and more fine-grained inter-class relationships are extracted, through which the ontology can be better utilized.To integrate these disentangled embeddings, we also developed a general ZSL framework DOZSL, including a GAN-based generative model and a GCN-based propagation model.Extensive evaluations with ablation studies and case studies on five datasets of ZS-IMGC and ZS-KGC show that DOZSL often outperforms the state-of-the-art baselines and its components are quite effective.DOZSL is compatible to both ZSL learners developed by us, and they together lead to higher robustness and better performance.Meanwhile, the performance of DOZSL is less competitive to the state-of-the-art on one of the five datasets.This motivates us to take an in-depth analysis of this dataset and its ontology, and to develop more robust disentangled embedding methods and ZSL learners in the future.We also realize some relation-aware GNNs such as RGCN achieve quite promising results on some datasets.This motivates us to study the propagation-based ZSL learner with these GNNs.Lastly, we will apply and evaluate DOZSL in other tasks such as open information extraction and visual question answering.

A SENSITIVITY STUDY OF DISENE AND DISENKGAT
In this section, we study the sensitivity of the number of components  used in the baselines DisenE [18] and DisenKGAT [35].
Specifically, we  to 2 and 4, two values with which the baselines perform well, and experiment with the GAN-based learner.The results on the six datasets of the two ZSL tasks are presented in Table 4.We can find that DisenE gets higher performance on all the three ZS-IMGC datasets and on Wiki-ZS when  = 2.It also gets better results on most metrics on NELL-ZS when  = 4.As for DisenKGAT, the optimum  values on AwA, ImNet-A, ImNet-O, NELL-ZS and Wiki-ZS are 4, 2, 2, 2, 4, respectively.

B ABLATION STUDY OF THE ONTOLOGY ENCODER WITH GCN-BASED METHODS
In this section, we report the results of ablation studies on the property guided triple scoring and the neighborhood aggregation in the disentangled ontology encoder when incorporating with GCN-based methods.The results are shown in Table 5.

C ABLATION STUDY OF THE GCN-BASED LEARNER
In this section, we study the impact of the similarity threshold and the classifier fusion function under different disentangled ontology embeddings, using all our evaluation datasets.The results are presented in Figure 4. Specifically, we report the results of the metric of  (i.e., the standard ZSL testing setting) for ZS-IMGC task and the results of the metrics of ℎ@10 and  for ZS-KGC task.Moreover, the curve of the Average fusion function is decorated with circular, while the curve of the Linear Transformation fusion function is decorated with triangle.Different ontology encoding methods are presented in different colors.

Figure 1 :
Figure 1: (a) an ontology segment for zero-shot image classification where Zebra is an unseen class while the other animals are seen classes; and (b) an ontological schema segment for zero-shot KG completion where has_office_in_city is an unseen relation while the other relations are seen.The unseen class (or relation) connects itself to different seen classes (or relations) in different semantic aspects.

Figure 2 :
Figure 2: Illustration of DOZSL with  = 3. Different color means different semantic aspects.samples for classes (each of which corresponds to an ontology concept).The other is propagation-based propagating features among classes based on the disentangled graphs generated from the original ontology.

3. 2 . 1
Generation-based.We first get the embedding of each class by concatenating all  component embeddings of its corresponding ontology concept (i.e.,   = [1

Figure 3 :
Figure 3: Cases of relations in NELL-ZS.Best viewed in color.
Hit@10 and MRR on NELL-ZS

Figure 4 :
Figure 4: Results of GCN-based DOZSL variants using different ontology encoders with different similarity thresholds and different classifier fusion functions.Best viewed in color.
,  ∈ Y  } be the training set, where  is the CNN features of a training image and  is its class in Y  which is a set of seen classes, and D  = {(, )| ∈ X  ,  ∈ Y  } be the testing set, where Y  , the set of unseen classes, has no overlap with Y 2.1.1Zero-shot Image Classification (ZS-IMGC).ZSL has been thoroughly studied in Computer Vision for image classification with new classes whose images are not seen during training.Formally, let D  = {(, )| ∈ X  .Given D  and some auxiliary information A for describing the relationships between seen and unseen classes, ZS-IMGC aims to learn a classifier for each unseen class.There are often two evaluation settings: standard ZSL which recognizes the testing samples in X  by only searching in Y  and generalized ZSL which recognizes the testing samples in X  ∪ X  by searching in Y  ∪ Y  .

Table 1 :
Statistics of benchmarks in two ZSL tasks and their ontologies.Trip./Conp./Prop. in the column of # Ontologies denotes the number of triples/concepts/properties.S/U denotes seen/unseen classes.Tr/V/Te is short for training/validation/testing.