Data Transformation Workflows

In Computer Science data transformation refers to the process of converting some data structure from one format to another.

Since I joined BBC, and being a Platform Software Engineer, data transformation became part of my daily tasks. Currently I’m part of Programme Metadata team, which recently has been renamed to Universal Content exactly for this reason, we are providing other teams, solutions to build very easily Data Transformation Workflows.

What is a workflow?

A workflow is a system that feeds data from producer into other workflows or data storages (e.g. S3, Database) where a client would consume the data from.

Let’s take an example of a workflow:

Data Transformation Workflow Example Diagram
  • It fetches data from a REST API
  • document-filter -> Filters out the non needed documents and send the others through to the next component
  • document-converter -> converts the document in the required format
  • s3-publisher -> publishes the consumed document to an S3 bucket and to the final topic

This was a simple example workflow, but sometimes you might need to have multiple workflows, or a single workflow which diverges in many others which can create some sort of octopus diagram:

Complex Data Transformation Workflow Diagram

How do I build a Data Transformation Workflow?

Input and Output

Before starting building a Data Transformation Workflow you need to know what your input and output would be like:

Input and desired Output

Source and Destination

Unless you’re having your own source of data, like a database you will have to fetch the data from a source, which can be an API, a Kafka Topic or S3 bucket. As well for the destination, the client can feed from a database which has an API in front, a Kafka Topic or a S3 bucket.

Workflow Architecture Design

Depending on how complex is your transformation, at this point you can start designing your workflow architecture which can be split in 3 steps:

  • document fetcher
  • data transformation
  • document publisher

And let’s speak about each of them.

The Document Fetcher

Is the component that fetches your documents from the source.

Data Transformation

Depending on how complex is your transformation, you might want to split the transformation in multiple steps/components.

The Document Publisher

The last step of the workflow is to publish the document wherever the client would consume it from. In the workflows that I’ve built I used either a Kafka Topic either a S3 bucket as destination for the transformed documents.

Leave a Reply

Your email address will not be published. Required fields are marked *