Mapping Specification
The mapping specification is at the core of data pipeliner. Here we will discuss how to create your own mapping specification. To aid the demonstration there is an example mapping specification you can download to follow along. There is also a mapping specification template for when your ready to create your own.
-
Example Specification
This mapping spec contains an example raw to curated data mapping for the Adventure Works Database, a sample database provided by Microsoft.
-
Template Specification
This is a template mapping spec, ready for mapping to be added defined.
Overview Of Sheets
The mapping spec is made up of several sheets. The first three have fixed names: _source_control
,
_source_tables
, _target_tables
. The _source_control
logs changes to the specification while
the _source_tables
and _target_tables
define values our input and output tables to save
repetition across the mappings. The mappings define the source to the target logic. There is no
upper limit on the number of mapping sheets which can be added, however if performance issues are
found it may be necessary to split the mappings over several files.
Source Control
- Sheet Name:
_source_control
- Sheet Fields:
Field | Description |
---|---|
Author |
Name of the author of the change. |
Version |
The version number, we recommend to use semantic versioning |
Change Description |
A description of the change made since the last version. |
The source control sheet is there so analysts can track the changes they make to the mapping specification over time. Some example changes could be: updating a column data type, adding a new table etc.
Source Tables
- Sheet Name:
_source_tables
- Sheet Fields:
Field | Description |
---|---|
database_name |
The name of the database in the source system, for example in AWS Glue it will be the Glue Database name. |
table_name |
The name of the table in the source system, for example in AWS Glue it will be the Glue Table name. |
bucket_name |
The name of the S3 bucket where the data is held. |
prefix |
The path part that comes before the table in the S3 hierarchy. For example, internal/db_name/table_name/<data files> would have prefix internal/db_name |
alias |
An alias for the table for use in SQL queries. This is to ensure consistent aliasing of the table across mappings. |
file_type |
The type of file, currently supported values are parquet , csv and json . |
file_options |
Spark read options in json form to apply when reading the data, for example {"compression": "snappy"} . |
Target Tables
- Sheet Name:
_target_tables
- Sheet Fields:
Field | Description |
---|---|
database_name |
The name of the database in the source system, for example in AWS Glue it will be the Glue Database name. |
table_name |
The name of the table in the source system, for example in AWS Glue it will be the Glue Table name. |
bucket_name |
The name of the S3 bucket where the data is held. |
prefix |
The path part that comes before the table in the S3 hierarchy. For example, internal/db_name/table_name/<data files> would have prefix internal/db_name |
alias |
An alias for the table for use in SQL queries. This is to ensure consistent aliasing of the table across mappings. |
file_type |
The type of file, currently supported values are parquet , csv and json . |
file_options |
Spark read options in json form to apply when reading the data, for example {"compression": "snappy"} . |
is_active |
A flag to determine if pipeliner should produce outputs for this tables, y to process or n to not process. |
Mappings
- Sheet Name: This should be the name of the target table
Warning
Excel has a maximum of 32 characters for a sheet name, so if your table is longer than this you should truncate from the start of the table name.
- Sheet Fields:
Field | Description |
---|---|
source_database_name |
The name of the database in the source system, for example in AWS Glue it will be the Glue Database name. This should match the entry in _source_tables sheet. |
source_table_name |
The name of the table in the source system, for example in AWS Glue it will be the Glue Table name. This should match the entry in _source_tables sheet. |
source_column_name |
The name of the column in the source table which will map to the taget column. Spark SQL expressions are sorted here such as max() |
source_data_type |
The data type of the source column, this should be the spark compatible data type. The Type suffix should be removed and lower case, for example string not StringType . |
source_length |
The length of the source column, this is not enforced but useful metadata. Ror example, the source comes from a relationable database as is char(25) type we can specify 25 . |
source_precision |
The precision of the source column, if the source is decimal type. |
source_scale |
The scale of the source column, if the source is decimal type. |
source_column_comment |
A description the source column field, this could explain why the field is chosen. |
target_database_name |
The name of the database in the target system, for example in AWS Glue it will be the Glue Database name. This should match the entry in _target_tables sheet. |
target_table_name |
The name of the table in the target system, for example in AWS Glue it will be the Glue Table name. This should match the entry in _target_tables sheet, and there should only be one target_table_name value per sheet. |
target_column_name |
The name of the column in the target table. This should be a unique column. |
target_data_type |
The data type of the target column, this should be the spark compatible data type. The Type suffix should be removed and lower case, for example string not StringType . |
target_length |
The length of the source column, this is not enforced but useful metadata. Ror example, the source comes from a relationable database as is char(25) type we can specify 25 . |
target_precision |
The precision of the target column, if the source is decimal type. |
target_scale |
The scale of the target column, if the source is decimal type. |
target_column_comment |
A description the target column field. This could include business defintion, allowed values, example value and contraint information (constaints not enforced). It will be added to applicable target systems such as AWS Glue. |
target_column_tags |
Tags to apply to the target column and or table. Some tags to add are: partition_order if the column is a partition then should be numeric value, contains_pii : boolean value to indicate if the column contains pii. |
query |
A query to apply to the table. This should be used when a new table is added in the source_table_name field, it can contain the join logic. The final query cell can additionally contain filter logic. |