This section of the website explains how to write a mapping. It will also be useful for reference if you need to modify an existing mapping file.
Identifying your MultiTerm fields
If you have your termbase open in MultiTerm, view the termbase definition (Catalog view -> Definition, or Termbase menu -> Modify Termbase Definition). The entry structure will give a list of all the descriptive fields in your termbase. Copy them out to three working lists of your own: One for the entry/concept level, one for the index/language level, and one for the term level. Be sure to copy them exactly, including capitalization.
If you only have access to the exported XML, you will find it full of elements like this:
Searching for "descrip type" will help you find all the names of the fields, or you can enlist your local scripting expert to automate the process further. Running the XML through a pretty-printer will make it easier to determine which list to put the field on: a <descrip> indented three times is at concept level. If it is indented four times it is either at language level or subordinate to another data category at concept level; likewise five indents is either at term level or subordinate to a four-indents data category. If you miss any fields, we'll catch them later.
If any fields take their values from a picklist, you will need a copy of the list. In MultiTerm, these are listed with the descriptive fields. If you have only the XML, again we recommend that you enlist a local scripting expert to extract them. As before, capitalization matters, and as before, we'll later catch any values that you missed.
Finally, you need to know the meaning of every field and picklist value. If you have questions about any of them, study the data and/or ask the person who created or maintains the termbase. You will need to be clear on their meanings in order to choose TBX data categories to correspond to them.
Choosing TBX data categories
TBX is an extensible format. Any kind of terminological data can be formalized as a TBX data category. Many data categories have already been defined and standardized, and it is best to use these whenever possible, because they are widely understood and because they clearly document the meaning of your data. We will discuss how to find the ones that correspond to your MultiTerm fields. In case no standardized data category is suitable, we will also discuss how to define your own.
The note data category is built into TBX, so it is always available for general purposes. Many other data categories are described in:
For each of your MultiTerm fields:
- Scan Section 9.3 ('Data-categories specialized from meta data-categories through the default XCS file', pp. 17-21), which is a list of TBX data categories, sorted by function. Take note of any data-category whose name suggests it may be equivalent to your MultiTerm field. Note also its meta data-category and its levels.
- Check the description of the data category in Annex D, part 5 ('Default data-categories', pp. 61-72). The list is sorted by meta data-category. You want the description to match or include the meaning of your MultiTerm field. Don't worry if picklist values differ, so long as they express the same information. Merely make a note of how your values correspond to the standard ones.
- The levels in TBX correspond to the structure of MultiTerm, and each data category can be used only on certain levels. Verify that the TBX data category is available where you need it: MultiTerm's concept level corresponds to TBX's termEntry (terminological entry). MultiTerm's index level corresponds to TBX's langSet (language section). Levels named term correspond. If you have a term-level field that describes only part of a term, such as a morpheme or syllable, any fields subordinated to it are on the equivalent of TBX's termComponent level.
Ideally, your MultiTerm field will have been anticipated among the TBX default data categories. If this is the case, simply take note of it. This also suffices if the TBX data category is broader than your MultiTerm field.
You may find that your MultiTerm field corresponds to more than one TBX data category. For example, a field named 'Grammar' might contain values that pertain to both grammaticalGender and grammaticalNumber in TBX. In this case, record both, and list the values that belong to each (some may belong to both).
You may find that the TBX data category slightly mismatches your MultiTerm field: you need a value TBX does not provide, you have the field at a level where TBX does not allow it, etc. You must decide whether to extend the TBX data category, or match some values of your field to the unextended TBX data category and other values to a new one that you create. Either approach will require some private documentation about a data category's meaning.
The ISOcat data registry contains numerous standardized data categories beyond those provided in TBX, which may help you meet additional needs. It can be searched or browsed by application area, and is intended to be the continuing authority on data categories for the future. It also contains definitions and explanations of TBX's standard data categories, in case you are in doubt whether they match.
Failing all of these, you can define your own data category. Determine and document what it means and what values it can take, and assign it an ID of some kind. Recipients of your data will need to consult this documentation. One approach is to put the documentation online and use its URL for the ID.
TBX uses a file format called XCS to document which data categories are in use. If all of your data categories are found in the TBX spec, you can simply use the default XCS file, or you can make a subset of it that only covers the data categories you need. If you extend a data category, adopt one from ISOcat, or create one from scratch, you will need your own XCS file. The TBX spec documents the XCS format.
Do I need to restructure my MultiTerm fields?
Both MultiTerm and TBX provide a hierarchical organization for your termbase: Concept entries contain language sections, which contain term sections; descriptive or administrative data can be attached at any of these levels. Both architectures carry the hierarchy further by allowing data to be attached to other data. For example, if you provide a definition, you might need to cite the source of that definition. In MultiTerm, you would create a nested entry structure with a Source field subordinate to the Definition field, creating a parent/child relationship. TBX uses grouping elements for the same purpose.
Often, however, a termbase definition in MultiTerm does not use this structure. Instead, there will be a field with a name like Source_of_Definition, placed not underneath Definition but alongside it—a sibling, not a child.
It is possible to carry this sibling structure over into TBX, but it is not desirable. One reason is that it misrepresents the facts about the data. The source citation is really information about the definition, not about the concept itself. The parent/child structure makes this relationship explicit and usable, whereas the sibling structure leaves it implicit. Another reason is that the standardized data categories do not differentiate sources of definitions from sources of contextual examples etc., but provide one data category to represent this single concept. Of course, one could define a new data category corresponding to Source_of_Definition, but this would impede blind interchange by requiring an explanation of its meaning.
Our TBX conversion program accounts for this problem by allowing you to restructure sibling relationships into parent/child relationships. If you have any data categories that fit this description, note their names, which should be the parent, and which should be the child. Later we will incorporate this information into the mapping file.
Writing templates for TBX elements
A large part of the mapping file consists of templates for TBX elements. When processing your data, the converter program will refer to these templates to determine what kind of element to create, how to derive its content from the content of your MultiTerm fields, where to place it in the new TBX document, etc. This information can be collected under five headings, whose initials spell TEASP, and therefore we have adopted the name teasp for these templates.
You will need to write at least one teasp for each field. If one field (or multiple fields with the same name) are used on different structural levels (concept, index, term), you will need a teasp for each. Finally, if one field corresponds to two or more TBX data categories, each TBX data category will also need its own teasp, and you may choose to write an additional teasp as that field's default (or fallback) mapping.
The pages below will show how to collect the information for a teasp and how to encode it in JSON format for our converter. The letters of TEASP stand for Target, Element, Attributes, Substitution, and Placement, but we will alter this order slightly for easier learning:
- Substitution of picklist values, and other transformations of your field's content.
- Element and Attributes used in TBX for your data category.
- Placement of your (transformed) data within the TBX element.
- Target of the TBX element within its containing structure.
- Assembling one teasp.
- Assembling one field's teasps.
Last updated: February 9, 2017 at 15:28 pm