Mapping Specification

The mapping specification is at the core of data pipeliner. Here we will discuss how to create your own mapping specification. To aid the demonstration there is an example mapping specification you can download to follow along. There is also a mapping specification template for when your ready to create your own.

Example Specification

This mapping spec contains an example raw to curated data mapping for the Adventure Works Database, a sample database provided by Microsoft.

Download Example
Template Specification

This is a template mapping spec, ready for mapping to be added defined.

Download Template

Overview Of Sheets

The mapping spec is made up of several sheets. The first three have fixed names: _source_control, _source_tables, _target_tables. The _source_control logs changes to the specification while the _source_tables and _target_tables define values our input and output tables to save repetition across the mappings. The mappings define the source to the target logic. There is no upper limit on the number of mapping sheets which can be added, however if performance issues are found it may be necessary to split the mappings over several files.

Source Control

Sheet Name: _source_control
Sheet Fields:

Field	Description
`Author`	Name of the author of the change.
`Version`	The version number, we recommend to use semantic versioning
`Change Description`	A description of the change made since the last version.

The source control sheet is there so analysts can track the changes they make to the mapping specification over time. Some example changes could be: updating a column data type, adding a new table etc.

Source Tables

Sheet Name: _source_tables
Sheet Fields:

Field	Description
`database_name`	The name of the database in the source system, for example in AWS Glue it will be the Glue Database name.
`table_name`	The name of the table in the source system, for example in AWS Glue it will be the Glue Table name.
`bucket_name`	The name of the S3 bucket where the data is held.
`prefix`	The path part that comes before the table in the S3 hierarchy. For example, `internal/db_name/table_name/<data files>` would have prefix `internal/db_name`
`alias`	An alias for the table for use in SQL queries. This is to ensure consistent aliasing of the table across mappings.
`file_type`	The type of file, currently supported values are `parquet`, `csv` and `json`.
`file_options`	Spark read options in json form to apply when reading the data, for example `{"compression": "snappy"}`.

Target Tables

Sheet Name: _target_tables
Sheet Fields:

Field	Description
`database_name`	The name of the database in the source system, for example in AWS Glue it will be the Glue Database name.
`table_name`	The name of the table in the source system, for example in AWS Glue it will be the Glue Table name.
`bucket_name`	The name of the S3 bucket where the data is held.
`prefix`	The path part that comes before the table in the S3 hierarchy. For example, `internal/db_name/table_name/<data files>` would have prefix `internal/db_name`
`alias`	An alias for the table for use in SQL queries. This is to ensure consistent aliasing of the table across mappings.
`file_type`	The type of file, currently supported values are `parquet`, `csv` and `json`.
`file_options`	Spark read options in json form to apply when reading the data, for example `{"compression": "snappy"}`.
`is_active`	A flag to determine if pipeliner should produce outputs for this tables, `y` to process or `n` to not process.

Mappings

Sheet Name: This should be the name of the target table

Warning

Excel has a maximum of 32 characters for a sheet name, so if your table is longer than this you should truncate from the start of the table name.

Sheet Fields:

Field	Description
`source_database_name`	The name of the database in the source system, for example in AWS Glue it will be the Glue Database name. This should match the entry in `_source_tables` sheet.
`source_table_name`	The name of the table in the source system, for example in AWS Glue it will be the Glue Table name. This should match the entry in `_source_tables` sheet.
`source_column_name`	The name of the column in the source table which will map to the taget column. Spark SQL expressions are sorted here such as `max()`
`source_data_type`	The data type of the source column, this should be the spark compatible data type. The `Type` suffix should be removed and lower case, for example `string` not `StringType`.
`source_length`	The length of the source column, this is not enforced but useful metadata. Ror example, the source comes from a relationable database as is `char(25)` type we can specify `25`.
`source_precision`	The precision of the source column, if the source is `decimal` type.
`source_scale`	The scale of the source column, if the source is `decimal` type.
`source_column_comment`	A description the source column field, this could explain why the field is chosen.
`target_database_name`	The name of the database in the target system, for example in AWS Glue it will be the Glue Database name. This should match the entry in `_target_tables` sheet.
`target_table_name`	The name of the table in the target system, for example in AWS Glue it will be the Glue Table name. This should match the entry in `_target_tables` sheet, and there should only be one `target_table_name` value per sheet.
`target_column_name`	The name of the column in the target table. This should be a unique column.
`target_data_type`	The data type of the target column, this should be the spark compatible data type. The `Type` suffix should be removed and lower case, for example `string` not `StringType`.
`target_length`	The length of the source column, this is not enforced but useful metadata. Ror example, the source comes from a relationable database as is `char(25)` type we can specify `25`.
`target_precision`	The precision of the target column, if the source is `decimal` type.
`target_scale`	The scale of the target column, if the source is `decimal` type.
`target_column_comment`	A description the target column field. This could include business defintion, allowed values, example value and contraint information (constaints not enforced). It will be added to applicable target systems such as AWS Glue.
`target_column_tags`	Tags to apply to the target column and or table. Some tags to add are: `partition_order` if the column is a partition then should be numeric value, `contains_pii`: boolean value to indicate if the column contains pii.
`query`	A query to apply to the table. This should be used when a new table is added in the `source_table_name` field, it can contain the join logic. The final `query` cell can additionally contain filter logic.