Module parse.date_array in plugin tabular v0.5.2
Create an array of date objects from an array of strings.
This module is very simplistic at the moment, more functionality and options will be added in the future.
At its core, this module uses the standard parser from the dateutil package to parse strings into dates. As this parser can't handle complex strings, the input strings can be pre-processed in the following ways:
- 'cut' non-relevant parts of the string (using 'min_index' & 'max_index' input/config options)
- remove matching tokens from the string, and replace them with a single whitespace (using the 'remove_tokens' option)
By default, if an input string can't be parsed this module will raise an exception. This can be prevented by setting this modules 'force_non_null' config option or input to 'False', in which case un-parsable strings will appear as 'NULL' value in the resulting array.
Author(s) | Markus Binsteiner (markus@frkl.io) |
Tags | tabular |
Python class | kiara_plugin.tabular.modules.array.ExtractDateModule |
Module configuration options
Configuration class: kiara_plugin.tabular.modules.array.ExtractDateConfig
Name | Description | Type | Required? | Default |
---|---|---|---|---|
constants | Value constants for this module. | object | false | null |
defaults | Value defaults for this module. | object | false | null |
input_fields | If not empty, only add the fields specified in here to the module inputs schema. | array | false | null |
max_index | The maximum index until whic to parse the string(s). | anyOf: [{'type': 'integer'}, {'type': 'null'}] | false | null |
min_index | The minimum index from where to start parsing the string(s). | anyOf: [{'type': 'integer'}, {'type': 'null'}] | false | null |
remove_tokens | A list of tokens/characters to replace with a single white-space before parsing the input. | array | false | null |
add_inputs | If set to 'True', parse options will be available as inputs. | boolean | false | true |
force_non_null | If set to 'True', raise an error if any of the strings in the array can't be parsed. | boolean | false | true |
Module source code
class ExtractDateModule(AutoInputsKiaraModule): """Create an array of date objects from an array of strings.
This module is very simplistic at the moment, more functionality and options will be added in the future.
At its core, this module uses the standard parser from the [dateutil](https://github.com/dateutil/dateutil) package to parse strings into dates. As this parser can't handle complex strings, the input strings can be pre-processed in the following ways:
- 'cut' non-relevant parts of the string (using 'min_index' & 'max_index' input/config options) - remove matching tokens from the string, and replace them with a single whitespace (using the 'remove_tokens' option)
By default, if an input string can't be parsed this module will raise an exception. This can be prevented by setting this modules 'force_non_null' config option or input to 'False', in which case un-parsable strings will appear as 'NULL' value in the resulting array. """
_module_type_name = "parse.date_array" _config_cls = ExtractDateConfig
def create_inputs_schema( self, ) -> ValueMapSchema:
inputs = {"array": {"type": "array", "doc": "The input array."}} return inputs
def create_outputs_schema( self, ) -> ValueMapSchema:
return { "date_array": { "type": "array", "doc": "The resulting array with items of a date data type.", } }
def process(self, inputs: ValueMap, outputs: ValueMap, job_log: JobLog):
import polars as pl import pyarrow as pa from dateutil import parser
force_non_null: bool = self.get_data_for_field( field_name="force_non_null", inputs=inputs ) min_pos: Union[None, int] = self.get_data_for_field( field_name="min_index", inputs=inputs ) if min_pos is None: min_pos = 0 max_pos: Union[None, int] = self.get_data_for_field( field_name="max_index", inputs=inputs ) remove_tokens: Iterable[str] = self.get_data_for_field( field_name="remove_tokens", inputs=inputs )
def parse_date(_text: str):
text = _text if min_pos: try: text = text[min_pos:] # type: ignore except Exception: return None if max_pos: try: text = text[0 : max_pos - min_pos] # type: ignore except Exception: pass
if remove_tokens: for t in remove_tokens: text = text.replace(t, " ")
try: d_obj = parser.parse(text, fuzzy=True) except Exception as e: if force_non_null: raise KiaraProcessingException(e) return None
if d_obj is None: if force_non_null: raise KiaraProcessingException( f"Can't parse date from string: {text}" ) return None
return d_obj
value = inputs.get_value_obj("array") array: KiaraArray = value.data
series = pl.Series(name="tokens", values=array.arrow_array) job_log.add_log(f"start parsing date for {len(array)} items") result = series.apply(parse_date) job_log.add_log(f"finished parsing date for {len(array)} items") result_array = result.to_arrow()
# TODO: remove this cast once the array data type can handle non-chunked arrays chunked = pa.chunked_array(result_array) outputs.set_values(date_array=chunked)