RF2 to RF1 Conversion Process - Technical Notes
Background
As per the 2015 Deprecation Notice, IHTSDO will cease to provide twice yearly releases of RF1 files as of 2017, and will instead provide a conversion tool which can be used to produce RF1 for all future releases.
For the July 2016 International Edition, we are running this tool in-house and providing the files in the usual packaged format. Feedback would therefore be appreciated to ensure the new conversion tools is fit for purpose.
The main requirements for the conversion tool were:
Ability to run on a end user's computer
Compatible with both Windows and Macintosh Operating Systems
Ability to produce RF1 based purely on an RF2 zipped archive
Ability to remain as consistent as possible with previously published RF1 releases
In the case of Laterality Qualifiers, this last requirement did not prove entirely possible because the information required is not present in the RF2 files. A laterality reference file that is specific to the release must therefore be provided if these laterality qualifiers are to be generated. The conversion tool asks the user if Laterality Qualifiers are required, and if so prompts for the laterality reference file as an input to the process.
Date | 20160602 |
Version | 1.0 |
Status | PRODUCTION |
Availability
The RF2 to RF1 Conversion tool is available for use under an Open Source Apache 2.0 licence here: https://github.com/IHTSDO/rf2-to-rf1-conversion
Differences from previous RF1 Releases
Optional Qualifying Relationships are now restricted to Laterality Qualifiers
The RF1 Deprecation document stated that Qualifying Relationships are "not provided in RF2 and that there are issues with completeness and quality of the current optional qualifiers in RF1." Because of these issues, plus the ongoing maintenance burden these RF1 specific relationships represent, it was decided to restrict their generation to the ones specifying Laterality. A fuller MRCM based solution is still intended at some point in the future, which would be presented first in RF2 and could then be reliably and programmatically translated into RF1.
Relationship Ids are now Blank
It was decided that Relationship ID's should be blank in the new process. This is due both to the relationship Ids not being "reliably persistent", and also to the fact that releasing a conversion tools which is capable of generating new qualifying relationships would cause SCTIDs to be consumed by every user who runs the process. The primary key in this component is therefore the Source, Type and Target of the relationship (arguably plus the group number).
Due to issues reported by a Terminology Centre, an unsupported option has been added to include Relationship Ids, and in the case of Laterality Qualifiers (which don't exist in RF2 and so cannot be used as a source of identifiers), a file of 10'000 Partition 02 identifiers is included in the conversion tool. To use this unsupported option, simply provide the previous RF1 archive using parameter -p.
Subset Ids and Version are now Deterministic
The RF2 to RF1 Conversion tool now contains a file of 1000 Partition 03 identifiers which are used to provide subset ids. If the previous RF1 archive is supplied then the next larger SCTID will be used. If a previous RF1 archive is not supplied, then an SCTID is selected from the file using an index based on the release date. This same date based index is also used to increment the subset version if the previous RF1 archive is not supplied, or the version is just incremented by 1 if it is.
Additionally RealmId and ContextId are no longer maintained and have been set to 0.
RF1 Description now Status 7 if REFERS_TO association exists
In the January 2016 RF1 release, descriptions were given the inactive status 7 (Inappropriate) only where there was a corresponding currently active RF2 Inactivation indicator of 900000000000494007 | Inappropriate component (foundation metadata concept) | . If no inactivation reason could be found, the description would receive status 1 - "withdrawn without a specified reason". However for the 20160430 Spanish RF1 release the view was taken that, additionally, the presence of an active historical association REFERS_TO entry should cause the relevant inactive description to show status 7 even if the inactivation indicator itself no longer applied. This same approach has been applied to the new RF2 to RF1 conversion process and is reflected in the resulting description file.
RF1 Ref: www.snomed.org/rf1?t=trg_app_table_struct_descriptions_table_data_fields_descriptionstatus
RF2 Ref: www.snomed.org/tig?t=trg2rfs_spec_attrval_eg
RF2 Ref: www.snomed.org/tig?t=trg2rfs_spec_assoc_metadata
File types now removed from the RF1 package
The following files will be removed from the RF1 package, as they are incompatible with the new algorithmic RF1 conversion process, which (as detailed above) needs to be self-contained going forward:
res1_DualKeyIndex_Concepts-en-US_INT_[date].txt
res1_DualKeyIndex_Descriptions-en-US_INT_[date].txt
res1_WordKeyIndex_Concepts-en-US_INT_[date].txt
res1_WordKeyIndex_Descriptions-en-US_INT_[date].txt
res1_Canonical_Core_INT_[date].txt
der1_CrossMaps_ICDO_INT_[date].txt
der1_CrossMapSets_ICDO_INT_[date].txt
der1_CrossMapTargets_ICDO_INT_[date].txt
Test coverage
A primary requirement for the RF1 Conversion process is that it be as consistent as possible with the previously published RF1 Releases. For this reason (and to avoid having the same assumptions appear in the tests as in the conversion program itself) the primary method of testing has been generation and comparison against a previous known release, in this case 20160131.
IHTSDO had previously developed a command line script for comparing two release archives and listing the differences between them. This script can be found in our GitHub repository: https://github.com/IHTSDO/snomed-release-service/tree/develop/compare-packages Because this script is primarily for our own internal use, it is currently only expected to be run on a Mac. Additionally to make the comparison more performant, the GNU utility "Parallel" should be installed.
Because a number of expected differences exist between the new RF2->RF1 conversion process and previous releases, the comparison script makes a number of changes to the files in order to 'whitelist' these differences, which are detailed here:
File | Change Made for Comparison |
|---|---|
sct1_Relationships_Core_INT_<date>.txt | Qualifying Relationships (ie with characteristic type 1) are stripped out, with the exception of Laterality qualifiers. |
sct1_Relationships_Core_INT_<date>.txt | Column 6 - Refinability - is stripped out, as it is out of scope. |
sct1_Relationships_Core_INT_<date>.txt | Column 1 - RelationshipId - is stripped out as it has been set to blank for the new process |
sct1_ComponentHistory_Core_INT_<date>.txt | Reason column is converted to upper case. Semicolons covert to commas. Commas normalised to have single space after them. |
sct1_ComponentHistory_Core_INT_<date>.txt | "LANGUAGECODE CHANGE, DESCRIPTIONTYPE CHANGE" is normalised to "DESCRIPTIONTYPE CHANGE, LANGUAGECODE CHANGE" |
sct1_ComponentHistory_Core_INT_<date>.txt | "INITIALCAPITALSTATUS CHANGE, DESCRIPTIONTYPE CHANGE" is normalised to "DESCRIPTIONTYPE CHANGE, INITIALCAPITALSTATUS CHANGE" |
der1_CrossMapTargets_ICDO_INT_<date>.txt | ICD Maps are provided in RF2 and are out of scope for the RF2 -> RF1 conversion process. |
With those expected differences filtered out, the remaining variances between the published 20160131 RF1 release and the same release obtained using the new conversion process are detailed here: https://docs.google.com/spreadsheets/d/1_wIAXAEJM9_8Gyjix9H9mV0PnaSgEdAlul4xW3l06E8/edit?usp=sharing
Testing Laterality Qualifiers
The production of laterality qualifiers is dependent on a laterality reference text file which is produced specific to each release. The changes in this file since 20160131 were compared against the changes in the RF2 Concept files to give a list of concepts that we might expect to gain or lose a laterality qualifying relationship. In a comparison of laterality qualifiers in 20160131 to those in 20160731, we lost 1 laterality indicator and gained 2. The one lost was caused by a change to the laterality reference file (10013000) which no longer had the "Y" indicator required. The following list is a cross reference of changes in the reference file against the 20160731 Concept Snapshot with the result of the investigation of each one shown:
181694005 20140131 0 - inactive so no qualifier expected
182075002 20150731 0 - inactive so no qualifier expected
609390000 20130731 1 - showed up as a reference file change because the flag was previously lower case y and was corrected to Y. Qualifying relationship remains present.
609391001 20130731 1 - ditto
609392008 20130731 1 - ditto
609393003 20130731 1 - ditto
715648002 20160731 1 - new concept but already has a defining Laterality attribute, so no need to add duplicate qualifying one. No qualifier expected.
715649005 20160731 1 - ditto
715650005 20160731 1 - ditto
715651009 20160731 1 - ditto
717798006 20160731 1 - this new concept has a new laterality qualifier as expected.
Additionally 714983005 turned up with a laterality qualifier which was not new to the reference file. On investigation, it turns out that this concept had it's "Laterality - Side" defining attribute inactivated in the 20160731 release and so was eligible to receive it as a qualifying attribute instead:
6489664028 20160731 0 900000000000207008 714983005 182353008 0 272741003 900000000000011006 900000000000451002