Representing SNOMED CT RF2 Distribution in OWL

This document describes how a SNOMED CT RF2 distribution should be represented in OWL. It is a generalization of an earlier specification written in Perl, called the "Spackman Perl Script", and will work with any reasonable RF2 content.

Introduction

For some time now, the SNOMED CT International RF2 distribution has included a Perl script named "tls2_StatedRelationshipsToOwlKRSS_INT_<date>.pl", which can be used to generate an OWL representation of the distribution in RDF XML, Owl Functional or KRSS syntax. This script is the closest that the SNOMED International organization has come to defining an official OWL representation for SNOMED CT. This approach has a number of shortcomings, however, including:

The transformation only works with the SNOMED CT International release. It cannot be applied unchanged to other distributions.
The transformation can only emit descriptions in one language
There are subtle but significant differences between the various output formats

In addition, while this Perl script could be viewed as a formal specification, one has to be a bit of a coding expert to be able to understand the actual transformation rules expressed within the document.
The intent of this specification is to "reverse engineer" the Perl transformation and document the transformation rules in such a way that:

They can be consistently applied to any SNOMED CT release or extension
They apply to all languages
Multiple language descriptions can appear in one OWL rendering
They are described in such a way that they can be implemented in any target language
The semantics of the output is independent of the format

To the best of the authors' knowledge, this specification describes the intent of the Perl transformation, as it exists today. One of the goals of this specification is that it can serve as a baseline where subsequent changes and enhancements to the OWL representation of SNOMED CT can be clearly identified and where tooling developers can clearly understand the ramifications of the changes to both the output and the transformation tools themselves.

Differences Between this Specification and the Spackman Transform

The following list summarizes the intentional differences between this specification and the Spackman transformation:

The Spackman transformation wraps all of the rdfs:subClassOf assertions in an owl:intersectionOf wrapper. This specification does not.
The Spackman transformation does not take the language refset into account when emitting text definitions. This frequently results in the wrong definition(s) being generated. This specification treats text definitions in the same way as descriptions with the exception that it does not distinguish between preferred and acceptable language refset entries.
The Spackman transformation generates ObjectProperty definitions for all active descendants of 410662002 | Concept model attribute |, with the exception of 116680003|is a|. This specification emits an additional ObjectProperty definition for 410662002 | Concept model attribute | itself and defines all descendants as its direct or indirect subproperty.
The Spackman transformation uses the following predicates for descriptions and definitions:
1. sctf:Description.term.{specific language}.preferred "{text}"@{general language} (e.g sctf:Description.term.en-us.preferred "Due to"@en)
2. sctf:Description.term.{specific language}.synonym "{text}"@{general language}
3. sctf:Description.TextDefinition.term "{text}"@{general language}

This specifications uses the SKOS predicates instead:

Approach

SQL Notation

This specification describes how to transform the SNOMED CT RF2 distribution content into an OWL equivalent. As part of this process, we need a way to specify which RF2 files are used, which columns are transformed and how they are selected. We have chosen to use the SQL query syntax for this purpose, as its syntax and semantics is well understood and it makes it possible for us to test and verify the correctness of the specification. It should be noted that this specification does not require that SQL be used in an actual implementation. Any implementation that produces the same results as described by the SQL in this document would be considered "conformant".

Turtle Notation

A second part of the transformation process requires a way to specify the OWL statements that are generated and their relationship to the RF2 tables. We have chosen the Turtle RDF syntax to represent the results, where substitutions are represented with a "$" prefix and italics. As an example, the assertion that the variable named "subject" is declared to be an instance of an OWL Class would be asserted as:

sct:$subject rdf:type owl:Class .

This specification does not require that the output of an actual implementation be in the Turtle RDF format, but whatever output format is used, it must be semantically equivalent to the Turtle as specified in this document.
A short synopsis of the subset of the Turtle notation appears in Appendix B: RDF Turtle Notation

.

Notation

RF2 files are represented in bold. Examples: Concept, StatedRelationship

RF2 fields are represented in a monospace font. Examples: id, acceptabilityId

Context variables are represented in bold monospace. Examples: LANGUAGE_MAP, VERSION, RIGHT_IDS

Turtle output is represented in monospace, with substitution variables as monospace italics. Example:
sct:$subject rdfs:subClassOf sct:$destinationId .

SNOMED CT concepts are represented using the conceptReference production as defined in the SNOMED CT Compositional Grammar Specification v2.3.1. For the sake of brevity, this specification uses the (US) English preferred name rather than the complete FSN. Example: 74400008 | Appendicitis |

Lists are represented as comma-separated values inside square braces. Example: LANGUAGES:["en-us", "en-gb"]

Maps are represented as comma-separated entries within curly braces, with a colon between the key and value. Example:
LANGUAGE_MAP: {"en-us": 900000000000509007,
"en-gb": 900000000000508004}

Map lookup is indicated via "$MAP(key)". Example:
$LANGUAGE_MAP("en-gb") = 900000000000508004

SNOMED CT RF2 Files

The transformations in the document apply to the Snapshot representation of a SNOMED CT RF2 distribution. Transformations of the Full or Delta representations are not defined in this document.

The table below shows the Release Format 2 (RF2) distribution files and corresponding fields that are used in the RF2 to OWL transformation. FIelds identified as "FILTER" are used to determine whether a given entry (row) is used to generate output, but are not represented in the output itself.

File	Field	Purpose

File	Field	Purpose
Concept	`id`	Used to generate the subject IRI
	`active`	FILTER - only active concepts are included in an the OWL output
	`moduleId`	FILTER - a separate ontology is generated for each module
	`definitionStatusId`	determines whether the OWL definition uses `owl:equivalentClass` or `rdfs:subClassOf`
Description	`active`	FILTER - only active descriptions are included in the OWL output
	`moduleId`	FILTER - a separate ontology is generated for each module
	`conceptId`	Link to Concept.`id`
	`languageCode`	Language facet of RDF literal string
	`typeId`	Determines the specific predicate for description text (One of: 900000000000013009 \| Synonym \| or 900000000000003001 \| Fully specified name \|)
	`term`	Literal text of label or description
TextDefinition	`active`	FILTER - only active definitions are included in the OWL output
	`moduleId`	FILTER - a separate ontology is generated for each module
	`conceptId`	Link to Concept.`id`
	`languageCode`	Language facet of RDF literal string
	`typeId`	Determines the specific predicate for definition text (900000000000550004 \| Definition \|)
	`term`	Definition text
StatedRelationship The Perl transformation states that it only works with the StatedRelationship file. There is nothing in the transformation rules themselves that prevents them from being applied to the Relationship file as well, but caution should be used as multiple modules and their inferences could get mixed in the latter file	`active`	FILTER - only active relationships are included
	`sourceId`	Used to generate IRI of the subject
	`destinationId`	Used to generate IRI of the object
	`relationshipGroup`	Definition nesting
	`typeId`	Used to generate IRI of the predicate
	`characteristicTypeId`	FILTER - only descendants of 900000000000006009\| Defining relationship\| (i.e. stated, inferred) are included in the transformation output.
	`modifierId`	FILTER - only rows with the existential modifier, 900000000000451002\| Some \| is included in the transformation output.
Language	`active`	FILTER - only active language entries are included
	`moduleId`	FILTER - only language entries for the target module are included are included in the output.
	`acceptabilityId`	Used to determine whether a description is preferred or acceptable. (One of: 900000000000548007 \| Preferred\| or 900000000000549004 \| Acceptable \| )
	`referencedComponentId`	FILTER - id of associated description row.
Transitive	`concept`	Concept identifier
	`ancestor`	Concept identifier of parent or parent's parent, etc.
DescriptionAndDefinition	`active`	FILTER - only active definitions are included in the OWL output
	`moduleId`	FILTER - a separate ontology is generated for each module
	`conceptId`	Link to Concept.`id`
	`typeId`	Determines the specific predicate for description text (One of: 900000000000013009 \| Synonym \| or 900000000000003001 \| Fully specified name \| or 900000000000550004 \| Definition \|)
	`languageCode`	Language facet of RDF literal string
	`term`	Description or TextDefinition text
LanguageNames	`languageText`	ISO 639-1 language name from Languages
	`refsetId`	Corresponding SNOMED CT `refsetId`

Note that moduleId is not used as a filter on the Relationship file. While it is theoretically possible for more multiple modules to collectively define the meaning of a concept, the ramifications of a shifting meaning depending on which module is referenced is problematic. For this reason, all active qualifying relationship rows are used in the definition of a concept.

Transitive File

The Transitive file represents the transitive closure of the "isA" relationship in the StatedRelationship table. In SQL, this table could be created via the following steps:

CREATE TABLE Transitive AS
SELECT sourceId concept, destinationId ancestor
FROM StatedRelationship
WHERE active=1 AND
typeId = 116680003;

ALTER TABLE Transitive ADD UNIQUE k1 (concept, ancestor);

Followed by repeated execution of the statement below until no new rows are inserted:

INSERT INTO Transitive
SELECT DISTINCT t1.concept, t2.ancestor
              FROM Transitive t1, Transitive t2
WHERE t1.ancestor = t2.concept AND
             (SELECT count (*) FROM transitive
                             WHERE concept = t1.concept AND
                                           ancestor = t2.ancestor LIMIT 1) = 0;

DescriptionAndDefinition File

The DescriptionAndDefinition file represents the union of the Description and TextDefinition files. These files are distributed separately in the RF2 distribution because the maximum size of a Description term entry is considerably smaller than that of a Definition but, with this exception are structurally identical.

CREATE TABLE DescriptionAndDefinition AS SELECT * FROM TextDefinition;

INSERT INTO DescriptionAndDefinition SELECT * FROM Description;

LanguageNames File

The LanguageNames file combines the information from the LANGUAGES and LANGUAGE_MAP metadata entries into a form that can be referenced within the SQL syntax used in this document. It contains a row for each LANGUAGE_MAP entry that has an entry in the LANGUAGES table. In the following example we create a table that carries a single entry that corresponds to the example in the next section.

CREATE TABLE LanguageNames (languageText char(36) NOT NULL,

refsetId bigint NOT NULL,

PRIMARY KEY (refsetId) ;

INSERT INTO LanguageNames VALUES ('en-us', 900000000000509007) ;

OWL Transformation Context

The OWL transformation requires several contextual variables. Some of these (MODULE, VERSION, LANGUAGES) are inputs that must be provided by the user. Many of the others, however, should really be part of a distribution. Appendix B includes recommendations on where and how the variables below might be included in future RF2 distributions.

Identifer	Description	Type	Example
`MODULE`	The SNOMED CT concept identifier of the module being transformed.	Input	900000000000207008
`MODULE_LABEL`	The formal textual name of the module	Metadata	"SNOMED Clinical Terms, International Release, Stated Relationships in OWL RDF"
`MODULE_DESCRIPTION`	A textual description of module including its purpose and derivation	Metadata	"Generated as OWL RDF/XML from SNOMED CT release files by Perl transform script Input concepts file was ..."
`MODULE_COPYRIGHT`	Copyright information for the module	Metadata	"Copyright 2015 The International Health Terminology Standards Development Organisation (IHTSDO). ... "
`VERSION`	The version of the module being transformed. Typical format yyyymmdd	Input	`20160131`
`VERSION_DESCRIPTION`	A textual description of the specific version	Metadata	"International Release, Core Module, Release Date: 20160131"
`LANGUAGES`	An list of ISO 639-1 language identifiers. Only descriptions and definitions in the specified language(s) will be emitted.	Input	["en-us"]
`LANGUAGE_MAP`	A map from ISO 639-1 identifier(s) in LANGUAGES to the corresponding SNOMED CT Language Refset Identifier.	Metadata	{"en-us": 900000000000509007, "en-gb": 900000000000508004, "zh": 722128001, "es": 448879004}
`NEVER_GROUPED_LIST`	The set of concept identifiers that are guaranteed to never appear within a role group	Metadata	[123005000, 272741003, 127489000, 411116001]
`RIGHT_IDS`	A map from a set of concept identifiers to a list of their "right identifiers" or property chains	Metadata	{363701004 : 127489000}

SQL

The SQL language is used in this to specify the particular rows, columns and linkages between the various RF2 files. It is used as a convenient shorthand for selection and linkage criteria. Implementations may use any technology they choose to realize this specification as long as the results are consistent with what is described in the SQL below.

Transformation

Transformation Namespaces

The following namespaces are used in the transformation. The Namespace URI's are normative and must be used exactly as specified. The namespace names listed below are recommendations and, while not strictly necessary, are strongly recommended for readability.

Namespace Name	Namespace URI	Description
rdf:	http://www.w3.org/1999/02/22-rdf-syntax-ns#	RDF built-in vocabulary
rdfs:	http://www.w3.org/2000/01/rdf-schema#	RDF Schema vocabulary
owl:	http://www.w3.org/2002/07/owl#	Web Ontology Language (OWL)
sctf:	http://snomed.info/field/	Description and Definition predicates
sct:	http://snomed.info/id/	SNOMED CT Concept identifiers

Representation of SNOMED in OWL.v0.9