In Computer Science data transformation refers to the process of converting some data structure from one format to another.
Since I joined BBC, and being a Platform Software Engineer, data transformation became part of my daily tasks. Currently I’m part of Programme Metadata team, which recently has been renamed to Universal Content exactly for this reason, we are providing other teams, solutions to build very easily Data Transformation Workflows.
What is a workflow?
A workflow is a system that feeds data from producer into other workflows or data storages (e.g. S3, Database) where a client would consume the data from.
Let’s take an example of a workflow:
- It fetches data from a REST API
- document-filter -> Filters out the non needed documents and send the others through to the next component
- document-converter -> converts the document in the required format
- s3-publisher -> publishes the consumed document to an S3 bucket and to the final topic
This was a simple example workflow, but sometimes you might need to have multiple workflows, or a single workflow which diverges in many others which can create some sort of octopus diagram:
How do I build a Data Transformation Workflow?
Input and Output
Before starting building a Data Transformation Workflow you need to know what your input and output would be like:
Source and Destination
Unless you’re having your own source of data, like a database you will have to fetch the data from a source, which can be an API, a Kafka Topic or S3 bucket. As well for the destination, the client can feed from a database which has an API in front, a Kafka Topic or a S3 bucket.
Workflow Architecture Design
Depending on how complex is your transformation, at this point you can start designing your workflow architecture which can be split in 3 steps:
- document fetcher
- data transformation
- document publisher
And let’s speak about each of them.
The Document Fetcher
Is the component that fetches your documents from the source.
Depending on how complex is your transformation, you might want to split the transformation in multiple steps/components.
The Document Publisher
The last step of the workflow is to publish the document wherever the client would consume it from. In the workflows that I’ve built I used either a Kafka Topic either a S3 bucket as destination for the transformed documents.