Situation

Due to a need to a historical need to support systems that could only store and display ASCII characters, descriptions that one would normally expect to contain superscript and subscript characters have employed some sort of workaround. The RF2 specification states that SNOMED CT supports UTF-8 encoding and so - in principle - we should be able to store and display any character in the Unicode characterset (including superscript and subscript characters) where appropriate. See https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts#Superscripts_and_subscripts_block

This was raised to SI in a FreshDesk ticket (#34218) which included various examples shown in the right hand panel.

Clarification of Original Enquiry

The original query has since been clarified in that it was more a concern about the inconsistent use of carets and angle brackets being used to indicate super and subscript markup, rather than seeking to move to a more direct representation. However the fact that SNOMED CT does support UTF-8 encoding does mean that someone might want to do this, so the discussion is still relevant.

Questions

Is this something we want to do? Is the benefit of "looking right" outweighed by the possible impact to vendor systems? Would we need to produce a technical preview containing some examples so that vendors can test out their systems?

What do we do about character folding ie would we expect a search on H2O to match H₂O ?

Presumably we need to leave existing synonyms in place if more stylish variants can be added.

What other considerations are there?

Tooling Investigation

Attempting to add a synonym of H₂O to 11713004 |Water (substance)| I tried looking for the UTF-8 character U+2082 in Window's character map to copy it into SI's Authoring Platform. However, the characters in this map apparently jump from U+207F to U+2090

Stackoverflow suggested that this was an issue with the font and that a font that specifically targets "Unicode" should include those characters...and so it does:

So now that I can get hold of a subscript 2, lets try and paste that into a description

Well the tooling seems a little unhappy and the display isn't very readable....and in fact when attempting to save we hit further problems:

OK so whatever else we do, we need the tooling to error out gracefully.

Work Done

A report was written (see this java class) to list existing uses of super/subscript workarounds and this lead to an update to the Editorial Guide via https://projects.jira.snomed.org/browse/GC-906

2024 Update

This topic was discussed again at the October 2024 Business Meetings in Seoul. It was noted that the SI tooling issues mentioned above have since been resolved and, in fact, we discover that there are already some subscript and superscript characters in the International Edition. So, with SNOMED CT having been advertised as supporting UTF-8 since 2002, there is no technical issue holding us back from using super/subscript characters. This has always been primarily a question of what vendors are able to display in their systems, and as noted above, that all depends on ensuring the use of a Unicode font.

Editorial Guidance

Current Editorial Guidance (Punctuation and Symbols) assumes that super/subscript characters can only be represented using an ASCII workaround such as the caret symbol eg 10^3 for 10³, but it also specifically says : "Current guidance for substance and product hierarchies is to not create new instances containing symbols for superscript and subscript.". So this discussion needs to include the Editorial Advisory Group for agreement about how and where these characters can be used.

Potential Solutions

Ideally the preferred term would be the one to feature the super/subscript characters where that representation is the one that is most generally accepted. However, we need to cater for the fact that systems that are less technologically adaptable (which would struggle to display such characters correctly), are also likely to be systems that would struggle to pivot to any additional complexity - such as a new language refset which identifies descriptions containing exotic characters. Here follows some potential solutions, with discussion:

Solution	Discussion

Solution	Discussion
International Edition vs Country Extensions	Since the International Edition of SNOMED CT is inherited by all extensions, it must necessarily be more cautious when introducing changes that would have unknown consequences to downstream implementations. However, country extensions may be in a better position to contact organisations which use their extension, and gauge their reaction to introducing special characters. Country extensions may be able to advance the use of special characters more quickly than the International Edition.
Use super/subscript characters in the Preferred Term	This is the most straightforward solution which doesn't require any added complexity around extensions or additional language reference sets. It is the solution most likely to result in pushback from the community and unforeseen consequences, and it would require robust Editorial Guidance to be in place before the introduction of such descriptions.
Specify as acceptable synonym and allow countries to specify own preferences	This is perhaps the gentlest option that allows us to move forward with more advanced notation. In this solution the description containing the super/subscript symbols would be a normal acceptable synonym and therefore less likely to cause a problem for downstream implementations as they would likely display the Preferred Term in the first instance. Country extensions could then choose to specify the super/subscript description as their Preferred Term - if they were confident that this would not cause a problem for users of their extension.
Phased Approach	A compromise might be to start softly and just start introducing special characters in acceptable synonyms with no other complexity. If this were to propagate for a year or two, then it would give implementers a chance to start adapting to the situation and perhaps enhance their existing display capabilities, before any further changes were made such as using these characters in a preferred term.
Language Reference Set to indicate preferred term in capable system.	This approach extends the previous solution of using these characters in acceptable synonyms, by identifying them in a new "Symbol Friendly" language reference set. This language reference set could then be used by implementers who were capable of displaying symbols, by using this language reference set first, and then falling back to the standard language reference sets if a "symbol" description was not present for that concept. This is a complex solution, unlikely to be popular, but it does have the advantage that we would then have an easy way to identify these "special" descriptions in any system.
Community Extension to add more capable terms and override language reference set entries.	A community content extension could be set up, which would add descriptions featuring super/subscript characters (and potentially other symbols), and override existing language reference set preferences so that the symbol friendly terms become the preferred terms. This would be a costly solution for SNOMED International to implement and requires the completion of the "pick'n'mix" SNOMED Edition Packaging solution. However, it does keep the symbol variant descriptions away from systems that would not be able to consume them and ensures that only systems that want to see these symbols would have them.
Additional Description Type	It has been discussed that a new Description Type could be created to ensure that descriptions containing special characters are kept very distinct. This increase in complexity and the fact that it would force implementers to make code changes for relatively minor benefit seems likely to be an unpopular approach.
National Extension Override	Use the correct characters in the Preferred Term and invite extensions to replace them with the alternatives of their choice.

Items of Concern

Concern	Discussion

Concern	Discussion
Searching	It is important that searching works both ways, so for example that searching for H2O would match with H₂O and visa versa. We will need to ensure either that terms are created that also contain those characters in a 'normal' style, or that character folding techniques are used in back end storage systems so that these two characters are considered to be equivalent when searching.
Validation	Similar to the question of searching, we would need to address whether a term submitted for validation eg in the FHIR $validate-code operation, should be considered a match where there is a normal vs sub/superscript versions of a character.
Drug Safety / Pharmacovigilance	Specific to substances and medicinal products, there is a strong push from the pharmaceutical community to avoid using super/subscript characters as they are harder to read, may be confused or misinterpreted and could lead to identification or dosing mistakes.
Identification	We have a problem in identifying which descriptions might benefit from super/subscript characters where some ASCII workaround like ^ or >N< has not been employed . For example the existing acceptable synonym H2O; we all know that it might be written H₂O, but what about less well known cases, and how could we programatically detect these? This might be a good task for AI, to look through SNOMED CT descriptions and point out the terms where they would more commonly be written using super or subscript characters.
Multiple characters	Unicode is a very broad church and there many symbols that look similar that could be used inappropriately. Fortunately there only appears to be a single instance of super and subscript 2 and 3, but we could have a problem setting the rules for the prime / apostrophe / single quote character.

Status	DISCUSSION

Original Enquiry

We came across the fact that subscript and superscript seem to be handled the following way (SNOMED Translation Guideline page 29):

Depending on the tools used to assist the translation, tags can also be an issue, since they might be used to indicate for example superscript or subscript: alpha^+^ thalassemia or beta^+^ thalassemia, normal Hb A>2<, type 2.

So superscript is indicated by placing the character between two “^^” signs and subscript is indicated by placing the character between “>” and “<”.

Subscript Example:

Superscript Example:

However we also found that this is not always the case, e.g.:

I know there is a quality initiative ongoing, so not all the concepts follow the most current guidelines yet. We just want to make sure to give the most current/accurate guidance for the German Translation Guide.

In that regard should we recommend to write all superscripts and subscripts as outlined above (^^ and ><) or are there exceptions from this rule?

Super and Subscript