Data is the most valuable asset for modern businesses. For any organization to extract valuable insights from data, that data needs to flow freely in a secure and timely manner across its different platforms (which are producing and consuming the data). Data pipelines that connect these sources and targets need to be carefully designed and implemented, else data consumers may be frustrated with data that is either old (refreshed several days back ) or simply incorrect (mismatched across source and target). That could lead to bad or inaccurate business decisions, slower insights, and lost competitive advantage.
The business data in a modern enterprise is spread across various platforms and formats. Data could belong to an operational database (e.g., Mongo, Oracle, etc.), cloud warehouses (e.g., Snowflake), data lakes and lakehouses (e.g., Databricks Delta Lake), or even external public sources. Data pipelines connecting this variety of sources need to establish some best practices so that the data consumers get high-quality data delivered to where the data apps are being built. Some of the best practices that a data pipeline process can follow are:
- Make sure that the data is delivered reliably and with high integrity and quality. The concept of “garbage in, garbage out” applies here. Data validation and correction is an important aspect of ensuring that.
- Ensure that the data transport is highly secure and no data is in stable storage unencrypted.
- Data pipeline architecture needs to be flexible and able to adapt to a business’s future growth trajectory. Addition of a new data source should not lead to rewrite of the pipeline architecture. It should merely be an add-on. Otherwise, it will be very taxing on the data team’s productivity.
A frequent mistake that data teams make is to underestimate the complexity of data pipelines. A do-it-yourself (DIY) approach only makes sense if the data engineering team is large and capable enough to deal with the complexities of high-volume, high-velocity and variety of the data. It would be wise to first evaluate if using a data pipeline platform would suffice the needs before rushing to implement something in-house. There are several platforms available in the market today in the ETL/ELT/reverse ETL space.
Another pitfall is to implement a vertical solution that caters to only the first use case instead of architecting a solution that would be flexible enough to add new sources and targets without a complete rewrite. Data architects should think holistically and design solutions that are flexible and can work with a variety of data sources (relational, unstructured, etc.).
The third mistake data pipeline creators often make is to avoid any sort of data validation until a data mismatch occurs. When a mismatch occurs, it is already too late to implement any form of data validation or verification. Data validation should be a design goal of any data pipeline process from the very outset.
About the Author
Rajkumar Sen is the founder and chief technology officer at Arcion, the cloud-native, CDC-based data replication platform. In his previous role as director of engineering at MemSQL, he architected the query optimizer and the distributed query processing engine. Raj also served as a principal engineer at Oracle, where he developed features for the Oracle database query optimizer, and a senior staff engineer at Sybase, where he architected several components for the Sybase Database Cluster Edition. He has published over a dozen papers in top-tier database conferences and journals and is the recipient of 14 patents.
Sign up for the free insideBIGDATA newsletter.
Join us on Twitter: https://twitter.com/InsideBigData1
Join us on LinkedIn: https://www.linkedin.com/company/insidebigdata/
Join us on Facebook: https://www.facebook.com/insideBIGDATANOW