Skip to content

Mapping Specification

The mapping specification is at the core of data pipeliner. Here we will discuss how to create your own mapping specification. To aid the demonstration there is an example mapping specification you can download to follow along. There is also a mapping specification template for when your ready to create your own.

Overview Of Sheets

The mapping spec is made up of several sheets. The first three have fixed names: _source_control, _source_tables, _target_tables. The _source_control logs changes to the specification while the _source_tables and _target_tables define values our input and output tables to save repetition across the mappings. The mappings define the source to the target logic. There is no upper limit on the number of mapping sheets which can be added, however if performance issues are found it may be necessary to split the mappings over several files.

Source Control

  • Sheet Name: _source_control
  • Sheet Fields:
Field Description
Author Name of the author of the change.
Version The version number, we recommend to use semantic versioning
Change Description A description of the change made since the last version.

The source control sheet is there so analysts can track the changes they make to the mapping specification over time. Some example changes could be: updating a column data type, adding a new table etc.

Source Tables

  • Sheet Name: _source_tables
  • Sheet Fields:
Field Description
database_name The name of the database in the source system, for example in AWS Glue it will be the Glue Database name.
table_name The name of the table in the source system, for example in AWS Glue it will be the Glue Table name.
bucket_name The name of the S3 bucket where the data is held.
prefix The path part that comes before the table in the S3 hierarchy. For example, internal/db_name/table_name/<data files> would have prefix internal/db_name
alias An alias for the table for use in SQL queries. This is to ensure consistent aliasing of the table across mappings.
file_type The type of file, currently supported values are parquet, csv and json.
file_options Spark read options in json form to apply when reading the data, for example {"compression": "snappy"}.

Target Tables

  • Sheet Name: _target_tables
  • Sheet Fields:
Field Description
database_name The name of the database in the source system, for example in AWS Glue it will be the Glue Database name.
table_name The name of the table in the source system, for example in AWS Glue it will be the Glue Table name.
bucket_name The name of the S3 bucket where the data is held.
prefix The path part that comes before the table in the S3 hierarchy. For example, internal/db_name/table_name/<data files> would have prefix internal/db_name
alias An alias for the table for use in SQL queries. This is to ensure consistent aliasing of the table across mappings.
file_type The type of file, currently supported values are parquet, csv and json.
file_options Spark read options in json form to apply when reading the data, for example {"compression": "snappy"}.
is_active A flag to determine if pipeliner should produce outputs for this tables, y to process or n to not process.

Mappings

  • Sheet Name: This should be the name of the target table

Warning

Excel has a maximum of 32 characters for a sheet name, so if your table is longer than this you should truncate from the start of the table name.

  • Sheet Fields:
Field Description
source_database_name The name of the database in the source system, for example in AWS Glue it will be the Glue Database name. This should match the entry in _source_tables sheet.
source_table_name The name of the table in the source system, for example in AWS Glue it will be the Glue Table name. This should match the entry in _source_tables sheet.
source_column_name The name of the column in the source table which will map to the taget column. Spark SQL expressions are sorted here such as max()
source_data_type The data type of the source column, this should be the spark compatible data type. The Type suffix should be removed and lower case, for example string not StringType.
source_length The length of the source column, this is not enforced but useful metadata. Ror example, the source comes from a relationable database as is char(25) type we can specify 25.
source_precision The precision of the source column, if the source is decimal type.
source_scale The scale of the source column, if the source is decimal type.
source_column_comment A description the source column field, this could explain why the field is chosen.
target_database_name The name of the database in the target system, for example in AWS Glue it will be the Glue Database name. This should match the entry in _target_tables sheet.
target_table_name The name of the table in the target system, for example in AWS Glue it will be the Glue Table name. This should match the entry in _target_tables sheet, and there should only be one target_table_name value per sheet.
target_column_name The name of the column in the target table. This should be a unique column.
target_data_type The data type of the target column, this should be the spark compatible data type. The Type suffix should be removed and lower case, for example string not StringType.
target_length The length of the source column, this is not enforced but useful metadata. Ror example, the source comes from a relationable database as is char(25) type we can specify 25.
target_precision The precision of the target column, if the source is decimal type.
target_scale The scale of the target column, if the source is decimal type.
target_column_comment A description the target column field. This could include business defintion, allowed values, example value and contraint information (constaints not enforced). It will be added to applicable target systems such as AWS Glue.
target_column_tags Tags to apply to the target column and or table. Some tags to add are: partition_order if the column is a partition then should be numeric value, contains_pii: boolean value to indicate if the column contains pii.
query A query to apply to the table. This should be used when a new table is added in the source_table_name field, it can contain the join logic. The final query cell can additionally contain filter logic.