Skip to main content

Denormalizer

Introduction#

This guide describes steps that are required to use the denormalizator, a service for inverse normalization of texts in Slovenian language.

Please contact us to obtain credentials (username & password) for testing purposes.

Inverse normalization#

The text must be provided inside request's body as a JSON object:

{  "content" : "devetnajsti četrti tisoč devetsto štiriindevetdeset"}

The response contains denormalized text:

{   "denormalizedContent":[      {         "text":"19.",         "index":[            0         ]      },      {         "text":"4.",         "index":[            1         ]      },      {         "text":"1994",         "index":[            2,            3,            4         ]      }   ],   "denormalizedString":"19. 4. 1994"}

DenormalizedContent returns a list of objects with their denormalized form and the index of the original tokens, whereas denormalizedString returns a denormalized sentence (but only if the content (input) is in the form of a string).

Types of input#

Three types of input are supported.

String#

The text to be normalized can be sent as a string.

{  "content":"Danes je lep sončen dan."}

List#

The text can be sent as a list of tokens, should you wish to use a different tokenizer.

{  "content":[    "Danes",    "je",    "lep",    "sončen",    "dan",    "."  ]}

Dictionary#

If you wish to denormalize a text from an audio or video file, you may also send it as a dictionary object.

The structure must be as follows:

{  "content":[      {        "text":"Danes"      },      {        "text":"je"      },      {        "text":"lep"      },      {        "text":"dan"      }    ]}

Optionally, the fields startTime and endTime may be added. This will affect construction of numbers consisting of more words in instances where there is more than one possible way to construct two consecutive numbers.

{  "content":[      {        "text":"Danes",        "startTime":0.67,        "endTime":1.23      },      {        "text":"je",        "startTime":1.24,        "endTime":1.34      },      {        "text":"lep",        "startTime":1.89,        "endTime":2.03      },      {        "text":"dan",        "startTime":2.24,        "endTime":2.78      }    ]}

If no information on start and end time of tokens is provided, the numbers will be constructed linear. For example: "dva tisoč tri tisoč" -> "2003 1000". If information on start and end times is provided, the numbers will be parsed where there is a longer pause between the tokens in question. For example: "dva tisoč (0.03 s pause) tri (0.02 s pause) tisoč" -> "2000 3000".

Configuration#

You can choose between three different preset config options (default, everyday, and technical). Each of them consists of 10 parameters:

With the setting punctIsIncluded (default False) you can specify whether sentence punctuation is included in the input text or not. This affects the normalization of other categories.

With the setting includeSlash (default False, technical: True) you can choose whether whether numbers and units should be united into one token if appropriate, for example (120 skozi 80 -> 120/80; kilometrov na uro -> km/h).

With the setting includeNumbers (default True) you can choose whether you want to denormalize numbers or not.

The setting includeNumbersPartToken (default True) applies to numbers that are part of words (for example: enajstletni -> 11-letni).

With the setting includeUnits (default True, everyday: False), you can choose whether measurement units should be denormalized into their corresponding abbreviation or symbol.

The following units are included: meter, gram, liter, tona, bar, newton, kelvin, hec, joule, stopinja, Celzija, promil, odstotek, procent, evro, dolar; the following prefixes are included: piko, nano, mikro, mili, centi, deci, deka, hekto, kilo, mega, giga, tera.

The setting includeEmail (default True) applies to email addresses within the .si and .com domain spaces.

With the setting includeTitle (default True) you can choose whether you want titles to be written as abbreviations.

The following titles are included: doktor/-ica, profesor/-ica, diplomiran/-a, gospa, gospodična, gospod, docent, specialist/-ka, primarij/-ka, magister/-ica, redni, izredni, univerzitetni.

The setting includeAbbr (default True) applies to abbreviations that are not titles.

The following abbreviations are included: oziroma, in tako dalje, in tako naprej, in podobno, tako imenovan.

With the setting includeStylistic (default True, technical: False), you can choose whether stylistic changes should apply or not (for example: if this option is set to True, numbers smaller than 11 that are not followed by a unit will be written as words, and not numbers).

The setting properTokenization (default True) tokenizes strings. Set to False if you are using special characters for annotation that may not be tokenized properly. If set to False, string inputs will be split at whitespace.

The preset config is the default config. Should you wish to change it, add a field into the JSON object with the name of the config you want to use:

{  "content":"dvajseti dvanajsti dva tisoč enaindvajset",  "config": "technical"}

You can also set specific parameters if you do not want to use any of the premade configs.

{   "content":"dvajseti dvanajsti dva tisoč enaindvajset",   "config":{      "includeNumbers":"False"   }}

The values of the parameters that you do not set will be taken from the default config.