Due to the overwhelming success of “Top 5 Things You Need to Know as a Data Engineer” (which is worth checking out if you haven’t already), here is the 10 pillars that make a successful data engineering team.
1. Re-usable Frameworks
Frameworks are the foundation of any effective data engineering system and come in lots of forms. Frameworks by definition should be applicable across the whole data engineering practice. These can be open-source such as Apache Beam pre-existing templates, or customised in-house patterns such as customised flex template beam pipelines.
At what point should someone create a bespoke framework rather than using existing one? Well, it depends. Measure the value of having a customised framework against the cost. If the opportunity cost exhibits negative returns, why bother going to the effort of creating and maintaining that framework?
Common frameworks include:
- Programmatic libraries — building on top of open-source libraries such as Dask DataFrames, custom Airflow operators
- Data pipeline templates — Apache beam templates, common Airflow DAGS
- Automated scripts with variables passed — substitutable SQL scripts (i.e. dbt macros), Makefile deployment scripts
Data pipeline patterns alongside clear guidance on expectations, standards and deliverables will make it much easier for others to digest and consume your framework. Follow agile practices to clearly define the scope of a given framework. Scope may change overtime, but of course ensure that you’re not building something monolithic (not necessarily a bad thing, but a nightmare to manage).
“Almost all the successful microservice stories have started with a monolith that got too big and was broken up.
Almost all the cases where I’ve heard of a system that was built as a microservice system from scratch, it has ended up in serious trouble.”
Martin Fowler (2015)