Cover art for podcast Data Engineering Podcast

Data Engineering Podcast

259 EpisodesProduced by Tobias MaceyWebsite

Weekly deep dives on data management with the engineers and entrepreneurs who are shaping the industry

episodes iconAll Episodes

The Importance Of Data Contracts As The Interface For Data Integration With Abhi Sivasailam

January 23rd, 2022


Data platforms are exemplified by a complex set of connections that are subject to a set of constantly evolving requirements. In order to …

Building And Managing Data Teams And Data Platforms In Large Organizations With Ashish Mrig

January 23rd, 2022


Data engineering is a relatively young and rapidly expanding field, with practitioners having a wide array of experiences as they navigate their careers. Ashish Mrig currently leads the data analytics platform …

Automated Data Quality Management Through Machine Learning With Anomalo

January 15th, 2022


Data quality control is a requirement for being able to trust the various reports and machine learning models that are relying on the …

An Introduction To Data And Analytics Engineering For Non-Programmers

January 15th, 2022


Applications of data have grown well beyond the venerable business intelligence dashboards that organizations have relied on for decades. Now it is being used to power consumer facing services, influence …

Open Source Reverse ETL For Everyone With Grouparoo

January 8th, 2022


Reverse ETL is a product category that evolved from the landscape of customer data platforms with a number of companies offering their own …

Data Observability Out Of The Box With Metaplane

January 8th, 2022


Data observability is a set of technical and organizational capabilities related to understanding how your data is being processed and used so that you can proactively identify and fix errors in your workflows. …

Creating Shared Context For Your Data Warehouse With A Controlled Vocabulary

January 2nd, 2022


Communication and shared context are the hardest part of any data system. In recent years the focus has been on data catalogs as the means for documenting data assets, but those introduce a secondary system of …

A Reflection On The Data Ecosystem For The Year 2021

January 2nd, 2022


This has been an active year for the data ecosystem, with a number of new product categories and substantial growth in existing areas. In an attempt to capture the zeitgeist Maura Church, David Wallace, Benn …

Revisiting The Technical And Social Benefits Of The Data Mesh

December 27th, 2021


The data mesh is a thesis that was presented to address the technical and organizational challenges that businesses face in managing their analytical workflows at scale. Zhamak Dehghani introduced the concepts …

Exploring The Evolving Role Of Data Engineers

December 27th, 2021


Data Engineering is still a relatively new field that is going through a continued evolution as new technologies are introduced and new …

Fast And Flexible Headless Data Analytics With Cube.JS

December 21st, 2021


One of the perennial challenges of data analytics is having a consistent set of definitions, along with a flexible and performant API endpoint for querying them. In this episode Artom Keydunov and Pavel Tiunov …

Building A System Of Record For Your Organization's Data Ecosystem At Metaphor

December 20th, 2021


Building a well managed data ecosystem for your organization requires a holistic view of all of the producers, consumers, and processors of …

Building Auditable Spark Pipelines At Capital One

December 13th, 2021


Spark is a powerful and battle tested framework for building highly scalable data pipelines. Because of its proven ability to handle large …

Deliver Personal Experiences In Your Applications With The Unomi Open Source Customer Data Platform

December 12th, 2021


The core to providing your users with excellent service is to understand them and provide a personalized experience. Unfortunately many …

Experimentation and A/B Testing For Modern Data Teams With Eppo

December 4th, 2021


A/B testing and experimentation are the most reliable way to determine whether a change to your product will have the desired effect on your business. Unfortunately, being able to design, deploy, and validate …

Data Driven Hiring For Data Professionals With Alooba

December 4th, 2021


Hiring data professionals is challenging for a multitude of reasons, and as with every interview process there is a potential for bias to …

Creating A Unified Experience For The Modern Data Stack At Mozart Data

November 27th, 2021


The modern data stack has been gaining a lot of attention recently with a rapidly growing set of managed services for different stages of the data lifecycle. With all of the available options it is possible to …

Doing DataOps For External Data Sources As A Service at Demyst

November 27th, 2021


The data that you have access to affects the questions that you can answer. By using external data sources you can drastically increase the range of analysis that is available to your organization. The …

Exploring Processing Patterns For Streaming Data Integration In Your Data Lake

November 20th, 2021


One of the perennial challenges posed by data lakes is how to keep them up to date as new data is collected. With the improvements in …

Laying The Foundation Of Your Data Platform For The Era Of Big Complexity With Dagster

November 20th, 2021


The technology for scaling storage and processing of data has gone through massive evolution over the past decade, leaving us with the …

Data Quality Starts At The Source

November 14th, 2021


The most important gauge of success for a data platform is the level of trust in the accuracy of the information that it provides. In order to build and maintain that trust it is necessary to invest in …

Eliminate Friction In Your Data Platform Through Unified Metadata Using OpenMetadata

November 10th, 2021


A significant source of friction and wasted effort in building and integrating data management systems is the fragmentation of metadata …

Business Intelligence Beyond The Dashboard With ClicData

November 6th, 2021


Business intelligence is often equated with a collection of dashboards that show various charts and graphs representing data for an …

Exploring The Evolution And Adoption of Customer Data Platforms and Reverse ETL

November 5th, 2021


The precursor to widespread adoption of cloud data warehouses was the creation of customer data platforms. Acting as a centralized repository of information about how your customers interact with your …

Removing The Barrier To Exploratory Analytics with Activity Schema and Narrator

October 29th, 2021


The perennial question of data warehousing is how to model the information that you are storing. This has given rise to methods as varied as …

Streaming Data Pipelines Made SQL With Decodable

October 29th, 2021


Streaming data systems have been growing more capable and flexible over the past few years. Despite this, it is still challenging to build reliable pipelines for stream processing. In this episode Eric Sammer …

Data Exploration For Business Users Powered By Analytics Engineering With Lightdash

October 23rd, 2021


The market for business intelligence has been going through an evolutionary shift in recent years. One of the driving forces for that change …

Completing The Feedback Loop Of Data Through Operational Analytics With Census

October 21st, 2021


The focus of the past few years has been to consolidate all of the organization’s data into a cloud data warehouse. As a result there have …

Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At Acryl Data

October 16th, 2021


The binding element of all data work is the metadata graph that is generated by all of the workflows that produce the assets used by teams across the organization. The DataHub project was created as a way to …

How And Why To Become Data Driven As A Business

October 14th, 2021


Organizations of all sizes are striving to become data driven, starting in earnest with the rise of big data a decade ago. With the never-ending growth in data sources and methods for aggregating and analyzing …

Make Your Business Metrics Reusable With Open Source Headless BI Using Metriql

October 8th, 2021


The key to making data valuable to business users is the ability to calculate meaningful metrics and explore them along useful dimensions. …

Adding Support For Distributed Transactions To The Redpanda Streaming Engine

October 6th, 2021


Transactions are a necessary feature for ensuring that a set of actions are all performed as a single unit of work. In streaming systems this is necessary to ensure that a set of messages or transformations are …

Building Real-Time Data Platforms For Large Volumes Of Information With Aerospike

October 2nd, 2021


Aerospike is a database engine that is designed to provide millisecond response times for queries across terabytes or petabytes. In this episode Chief Strategy Officer, Lenley Hensarling, explains how the …

Delivering Your Personal Data Cloud With Prifina

September 30th, 2021


The promise of online services is that they will make your life easier in exchange for collecting data about you. The reality is that they …

Digging Into Data Reliability Engineering

September 26th, 2021


The accuracy and availability of data has become critically important to the day-to-day operation of businesses. Similar to the practice of site reliability engineering as a means of ensuring consistent uptime …

Massively Parallel Data Processing In Python Without The Effort Using Bodo

September 25th, 2021


Python has beome the de facto language for working with data. That has brought with it a number of challenges having to do with the speed …

Declarative Machine Learning Without The Operational Overhead Using Continual

September 19th, 2021


Building, scaling, and maintaining the operational components of a machine learning workflow are all hard problems. Add the work of creating the model itself, and it’s not surprising that a majority of …

An Exploration Of The Data Engineering Requirements For Bioinformatics

September 19th, 2021


Biology has been gaining a lot of attention in recent years, even before the pandemic. As an outgrowth of that popularity, a new field has …

Setting The Stage For The Next Chapter Of The Cassandra Database

September 12th, 2021


The Cassandra database is one of the first open source options for globally scalable storage systems. Since its introduction in 2008 it has been powering systems at every scale. The community recently released …

A View From The Round Table Of Gartner's Cool Vendors

September 9th, 2021


Gartner analysts are tasked with identifying promising companies each year that are making an impact in their respective categories. For businesses that are working in the data management and analytics space …

Designing And Building Data Platforms As A Product

September 4th, 2021


The term "data platform" gets thrown around a lot, but have you stopped to think about what it actually means for you and your organization? …

Presto Powered Cloud Data Lakes At Speed Made Easy With Ahana

September 2nd, 2021


The Presto project has become the de facto option for building scalable open source analytics in SQL for the data lake. In recent months the community has focused their efforts on making it the fastest possible …

Do Away With Data Integration Through A Dataware Architecture With Cinchy

August 28th, 2021


The reason that so much time and energy is spent on data integration is because of how our applications are designed. By making the software be the owner of the data that it generates, we have to go through the …

Decoupling Data Operations From Data Infrastructure Using Nexla

August 25th, 2021


The technological and social ecosystem of data engineering and data management has been reaching a stage of maturity recently. As part of …

Let Your Analysts Build A Data Lakehouse With Cuelake

August 21st, 2021


Data lakes have been gaining popularity alongside an increase in their sophistication and usability. Despite improvements in performance and …

Migrate And Modify Your Data Platform Confidently With Compilerworks

August 18th, 2021


A major concern that comes up when selecting a vendor or technology for storing and managing your data is vendor lock-in. What happens if …

Prepare Your Unstructured Data For Machine Learning And Computer Vision Without The Toil Using Activeloop

August 15th, 2021


The vast majority of data tools and platforms that you hear about are designed for working with structured, text-based data. What do you do …

Build Trust In Your Data By Understanding Where It Comes From And How It Is Used With Stemma

August 10th, 2021


All of the fancy data platform tools and shiny dashboards that you use are pointless if the consumers of your analysis don’t have trust in …

Data Discovery From Dashboards To Databases With Castor

August 7th, 2021


Every organization needs to be able to use data to answer questions about their business. The trouble is that the data is usually spread across a wide and shifting array of systems, from databases to …

Charting A Path For Streaming Data To Fill Your Data Lake With Hudi

August 3rd, 2021


Data lake architectures have largely been biased toward batch processing workflows due to the volume of data that they are designed for. …

Adding Context And Comprehension To Your Analytics Through Data Discovery With SelectStar

July 31st, 2021


Companies of all sizes and industries are trying to use the data that they and their customers generate to survive and thrive in the modern …

Building a Multi-Tenant Managed Platform For Streaming Data With Pulsar at Datastax

July 28th, 2021


Everyone expects data to be transmitted, processed, and updated instantly as more and more products integrate streaming data. The technology …

Bringing The Metrics Layer To The Masses With Transform

July 23rd, 2021


Collecting and cleaning data is only useful if someone can make sense of it afterward. The latest evolution in the data ecosystem is the …

Strategies For Proactive Data Quality Management

July 20th, 2021


Data quality is a concern that has been gaining attention alongside the rising importance of analytics for business success. Many solutions …

Low Code And High Quality Data Engineering For The Whole Organization With Prophecy

July 16th, 2021


There is a wealth of tools and systems available for processing data, but the user experience of integrating them and building workflows is …

Exploring The Design And Benefits Of The Modern Data Stack

July 13th, 2021


We have been building platforms and workflows to store, process, and analyze data since the earliest days of computing. Over that time there have been countless architectures, patterns, and "best practices" to …

Democratize Data Cleaning Across Your Organization With Trifacta

July 9th, 2021


Every data project, whether it’s analytics, machine learning, or AI, starts with the work of data cleaning. This is a critical step and benefits from being accessible to the domain experts. Trifacta is a …

Stick All Of Your Systems And Data Together With SaaSGlue As Your Workflow Manager

July 5th, 2021


At the core of every data pipeline is an workflow manager (or several). Deploying, managing, and scaling that orchestration can consume a large fraction of a data team’s energy so it is important to pick …

Leveling Up Open Source Data Integration With Meltano Hub And The Singer SDK

July 3rd, 2021


Data integration in the form of extract and load is the critical first step of every data project. There are a large number of commercial and open source projects that offer that capability but it is still far …

A Candid Exploration Of Timeseries Data Analysis With InfluxDB

June 29th, 2021


While the overall concept of timeseries data is uniform, its usage and applications are far from it. One of the most demanding applications of timeseries data is for application and server monitoring due to the …

Lessons Learned From The Pipeline Data Engineering Academy

June 26th, 2021


Data Engineering is a broad and constantly evolving topic, which makes it difficult to teach in a concise and effective manner. Despite …

Make Database Performance Optimization A Playful Experience With OtterTune

June 23rd, 2021


The database is the core of any system because it holds the data that drives your entire experience. We spend countless hours designing the …

Bring Order To The Chaos Of Your Unstructured Data Assets With Unstruk

June 18th, 2021


Working with unstructured data has typically been a motivation for a data lake. The challenge is imposing enough order on the platform to …

Accelerating ML Training And Delivery With In-Database Machine Learning

June 15th, 2021


When you build a machine learning model, the first step is always to load your data. Typically this means downloading files from object storage, or querying a database. To speed up the process, why not build …

Taking A Tour Of The Google Cloud Platform For Data And Analytics

June 12th, 2021


Google pioneered an impressive number of the architectural underpinnings of the broader big data ecosystem. Now they offer the technologies …

Make Sure Your Records Are Reliable With The BookKeeper Distributed Storage Layer

June 9th, 2021


The way to build maintainable software and systems is through composition of individual pieces. By making those pieces high quality and flexible they can be used in surprising ways that the original creators …

Build Your Analytics With A Collaborative And Expressive SQL IDE Using Querybook

June 3rd, 2021


SQL is the most widely used language for working with data, and yet the tools available for writing and collaborating on it are still clunky …

Making Data Pipelines Self-Serve For Everyone With Shipyard

June 2nd, 2021


Every part of the business relies on data, yet only a small team has the context and expertise to build and maintain workflows and data pipelines to transform, clean, and integrate it. In order for the true …

Paving The Road For Fast Analytics On Distributed Clouds With The Yellowbrick Data Warehouse

May 28th, 2021


The data warehouse has become the focal point of the modern data platform. With increased usage of data across businesses, and a diversity of locations and environments where data needs to be managed, the …

Easily Build Advanced Similarity Search With The Pinecone Vector Database

May 25th, 2021


Machine learning models use vectors as the natural mechanism for representing their internal state. The problem is that in order for the models to integrate with external systems their internal state has to be …

A Holistic Approach To Data Governance Through Self Reflection At Collibra

May 21st, 2021


Data governance is a phrase that means many different things to many different people. This is because it is actually a concept that encompasses the entire lifecycle of data, across all of the people in an …

Unlocking The Power of Data Lineage In Your Platform with OpenLineage

May 18th, 2021


Data lineage is the common thread that ties together all of your data pipelines, workflows, and systems. In order to get a holistic …

Building Your Data Warehouse On Top Of PostgreSQL

May 14th, 2021


There is a lot of attention on the database market and cloud data warehouses. While they provide a measure of convenience, they also require …

Making Analytical APIs Fast With Tinybird

May 11th, 2021


Building an API for real-time data is a challenging project. Making it robust, scalable, and fast is a full time job. The team at Tinybird …

Making Spark Cloud Native At Data Mechanics

May 7th, 2021


Spark is one of the most well-known frameworks for data processing, whether for batch or streaming, ETL or ML, and at any scale. Because of …

The Grand Vision And Present Reality of DataOps

May 4th, 2021


The Data industry is changing rapidly, and one of the most active areas of growth is automation of data workflows. Taking cues from the DevOps movement of the past decade data professionals are orienting around …

Self Service Data Exploration And Dashboarding With Superset

April 27th, 2021


The reason for collecting, cleaning, and organizing data is to make it usable by the organization. One of the most common and widely …

Moving Machine Learning Into The Data Pipeline at Cherre

April 20th, 2021


Most of the time when you think about a data pipeline or ETL job what comes to mind is a purely mechanistic progression of functions that move data from point A to point B. Sometimes, however, one of those …

Exploring The Expanding Landscape Of Data Professions with Josh Benamram of Databand

April 13th, 2021


"Business as usual" is changing, with more companies investing in data as a first class concern. As a result, the data team is growing and …

Put Your Whole Data Team On The Same Page With Atlan

April 6th, 2021


One of the biggest obstacles to success in delivering data products is cross-team collaboration. Part of the problem is the difference in …

Data Quality Management For The Whole Team With Soda Data

March 30th, 2021


Data quality is on the top of everyone’s mind recently, but getting it right is as challenging as ever. One of the contributing factors is …

Real World Change Data Capture At Datacoral

March 23rd, 2021


The world of business is becoming increasingly dependent on information that is accurate up to the minute. For analytical systems, the only way to provide this reliably is by implementing change data capture …

Managing The DoorDash Data Platform

March 16th, 2021


The team at DoorDash has a complex set of optimization challenges to deal with using data that they collect from a multi-sided marketplace. …

Leave Your Data Where It Is And Automate Feature Extraction With Molecula

March 9th, 2021


A majority of the time spent in data engineering is copying data between systems to make the information available for different purposes. This introduces challenges such as keeping information synchronized, …

Bridging The Gap Between Machine Learning And Operations At Iguazio

March 2nd, 2021


The process of building and deploying machine learning projects requires a staggering number of systems and stakeholders to work in concert. …

Self Service Open Source Data Integration With AirByte

February 23rd, 2021


Data integration is a critical piece of every data pipeline, yet it is still far from being a solved problem. There are a number of managed …

Building The Foundations For Data Driven Businesses at 5xData

February 16th, 2021


Every business aims to be data driven, but not all of them succeed in that effort. In order to be able to truly derive insights from the …

How Shopify Is Building Their Production Data Warehouse Using DBT

February 9th, 2021


With all of the tools and services available for building a data platform it can be difficult to separate the signal from the noise. One of …

System Observability For The Cloud Native Era With Chronosphere

February 2nd, 2021


Collecting and processing metrics for monitoring use cases is an interesting data problem. It is eminently possible to generate millions or …

Making It Easier To Stick B2B Data Integration Pipelines Together With Hotglue

January 26th, 2021


Businesses often need to be able to ingest data from their customers in order to power the services that they provide. For each new source that they need to integrate with it is another custom set of ETL tasks …

Using Your Data Warehouse As The Source Of Truth For Customer Data With Hightouch

January 19th, 2021


The data warehouse has become the central component of the modern data stack. Building on this pattern, the team at Hightouch have created a platform that synchronizes information about your customers out to …

Enabling Version Controlled Data Collaboration With TerminusDB

January 11th, 2021


As data professionals we have a number of tools available for storing, processing, and analyzing data. We also have tools for collaborating …

Bringing Feature Stores and MLOps to the Enterprise at Tecton

January 5th, 2021


As more organizations are gaining experience with data management and incorporating analytics into their decision making, their next move is to adopt machine learning. In order to make those efforts …

Off The Shelf Data Governance With Satori

December 28th, 2020


One of the core responsibilities of data engineers is to manage the security of the information that they process. The team at Satori has a background in cybersecurity and they are using the lessons that they …

Low Friction Data Governance With Immuta

December 21st, 2020


Data governance is a term that encompasses a wide range of responsibilities, both technical and process oriented. One of the more complex …

Building A Self Service Data Platform For Alternative Data Analytics At YipitData

December 15th, 2020


As a data engineer you’re familiar with the process of collecting data from databases, customer data platforms, APIs, etc. At YipitData …

Proven Patterns For Building Successful Data Teams

December 7th, 2020


Building data products are complicated by the fact that there are so many different stakeholders with competing goals and priorities. It is …

Streaming Data Integration Without The Code at Equalum

November 30th, 2020


The first stage of every good pipeline is to perform data integration. With the increasing pace of change and the need for up to date analytics the need to integrate that data in near real time is growing. With …

Keeping A Bigeye On The Data Quality Market

November 23rd, 2020


One of the oldest aphorisms about data is "garbage in, garbage out", which is why the current boom in data quality solutions is no surprise. …

Self Service Data Management From Ingest To Insights With Isima

November 17th, 2020


The core mission of data engineers is to provide the business with a way to ask and answer questions of their data. This often takes the form of business intelligence dashboards, machine learning models, or …

Building A Cost Effective Data Catalog With Tree Schema

November 10th, 2020


A data catalog is a critical piece of infrastructure for any organization who wants to build analytics products, whether internal or external. While there are a number of platforms available for building that …

Add Version Control To Your Data Lake With LakeFS

November 3rd, 2020


Data lakes are gaining popularity due to their flexibility and reduced cost of storage. Along with the benefits there are some additional …

Cloud Native Data Security As Code With Cyral

October 26th, 2020


One of the most challenging aspects of building a data platform has nothing to do with pipelines and transformations. If you are putting your workflows into production, then you need to consider how you are …

Better Data Quality Through Observability With Monte Carlo

October 19th, 2020


In order for analytics and machine learning projects to be useful, they require a high degree of data quality. To ensure that your pipelines …

Rapid Delivery Of Business Intelligence Using Power BI

October 12th, 2020


Business intelligence efforts are only as useful as the outcomes that they inform. Power BI aims to reduce the time and effort required to …

Self Service Real Time Data Integration Without The Headaches With Meroxa

October 5th, 2020


Analytical workloads require a well engineered and well maintained data integration process to ensure that your information is reliable and up to date. Building a real-time pipeline for your data lakes and data …

Speed Up And Simplify Your Streaming Data Workloads With Red Panda

September 29th, 2020


Kafka has become a de facto standard interface for building decoupled systems and working with streaming data. Despite its widespread …

Cutting Through The Noise And Focusing On The Fundamentals Of Data Engineering With The Data Janitor

September 22nd, 2020


Data engineering is a constantly growing and evolving discipline. There are always new tools, systems, and design patterns to learn, which …

Distributed In Memory Processing And Streaming With Hazelcast

September 15th, 2020


In memory computing provides significant performance benefits, but brings along challenges for managing failures and scaling up. Hazelcast …

Simplify Your Data Architecture With The Presto Distributed SQL Engine

September 7th, 2020


Databases are limited in scope to the information that they directly contain. For analytical use cases you often want to combine data across …

Building A Better Data Warehouse For The Cloud At Firebolt

September 1st, 2020


Data warehouse technology has been around for decades and has gone through several generational shifts in that time. The current trends in …

Metadata Management And Integration At LinkedIn With DataHub

August 25th, 2020


In order to scale the use of data across an organization there are a number of challenges related to discovery, governance, and integration …

Exploring The TileDB Universal Data Engine

August 17th, 2020


Most databases are designed to work with textual data, with some special purpose engines that support domain specific formats. TileDB is a data engine that was built to support every type of data by using …

Closing The Loop On Event Data Collection With Iteratively

August 10th, 2020


Event based data is a rich source of information for analytics, unless none of the event structures are consistent. The team at Iteratively …

A Practical Introduction To Graph Data Applications

August 4th, 2020


Finding connections between data and the entities that they represent is a complex problem. Graph data models and the applications built on top of them are perfect for representing relationships and finding …

Build More Reliable Distributed Systems By Breaking Them With Jepsen

July 28th, 2020


A majority of the scalable data processing platforms that we rely on are built as distributed systems. This brings with it a vast number of subtle ways that errors can creep in. Kyle Kingsbury created the …

Making Wind Energy More Efficient With Data At Turbit Systems

July 21st, 2020


Wind energy is an important component of an ecologically friendly power system, but there are a number of variables that can affect the …

Open Source Production Grade Data Integration With Meltano

July 13th, 2020


The first stage of every data pipeline is extracting the information from source systems. There are a number of platforms for managing data integration, but there is a notable lack of a robust and easy to use …

DataOps For Streaming Systems With

July 6th, 2020


There are an increasing number of use cases for real time data, and the systems to power them are becoming more mature. Once you have a streaming platform up and running you need a way to keep an eye on it, …

Data Collection And Management To Power Sound Recognition At Audio Analytic

June 30th, 2020


We have machines that can listen to and process human speech in a variety of languages, but dealing with unstructured sounds in our environment is a much greater challenge. The team at Audio Analytic are …

Bringing Business Analytics To End Users With GoodData

June 23rd, 2020


The majority of analytics platforms are focused on use internal to an organization by business stakeholders. As the availability of data …

Accelerate Your Machine Learning With The StreamSQL Feature Store

June 15th, 2020


Machine learning is a process driven by iteration and experimentation which requires fast and easy access to relevant features of the data being processed. In order to reduce friction in the process of …

Data Management Trends From An Investor Perspective

June 8th, 2020


The landscape of data management and processing is rapidly changing and evolving. There are certain foundational elements that have remained …

Building A Data Lake For The Database Administrator At Upsolver

June 2nd, 2020


Data lakes offer a great deal of flexibility and the potential for reduced cost for your analytics, but they also introduce a great deal of complexity. What used to be entirely managed by the database engine is …

Mapping The Customer Journey For B2B Companies At Dreamdata

May 25th, 2020


Gaining a complete view of the customer journey is especially difficult in B2B companies. This is due to the number of different individuals …

Power Up Your PostgreSQL Analytics With Swarm64

May 18th, 2020


The PostgreSQL database is massively popular due to its flexibility and extensive ecosystem of extensions, but it is still not the first …

StreamNative Brings Streaming Data To The Cloud Native Landscape With Pulsar

May 11th, 2020


There have been several generations of platforms for managing streaming data, each with their own strengths and weaknesses, and different …

Enterprise Data Operations And Orchestration At Infoworks

May 4th, 2020


Data management is hard at any scale, but working in the context of an enterprise organization adds even greater complexity. Infoworks is a …

Taming Complexity In Your Data Driven Organization With DataOps

April 28th, 2020


Data is a critical element to every role in an organization, which is also what makes managing it so challenging. With so many different …

Building Real Time Applications On Streaming Data With Eventador

April 20th, 2020


Modern applications frequently require access to real-time data, but building and maintaining the systems that make that possible is a complex and time consuming endeavor. Eventador is a managed platform …

Making Data Collection In Your Code Easy With Rookout

April 14th, 2020


The software applications that we build for our businesses are a rich source of data, but accessing and extracting that data is often a slow and error-prone process. Rookout has built a platform to separate the …

Building A Knowledge Graph Of Commercial Real Estate At Cherre

April 7th, 2020


Knowledge graphs are a data resource that can answer questions beyond the scope of traditional data analytics. By organizing and storing data to emphasize the relationship between entities, we can discover the …

The Life Of A Non-Profit Data Professional

March 30th, 2020


Building and maintaining a system that integrates and analyzes all of the data for your organization is a complex endeavor. Operating on a shoe-string budget makes it even more challenging. In this episode …

Behind The Scenes Of The Linode Object Storage Service

March 23rd, 2020


There are a number of platforms available for object storage, including self-managed open source projects. But what goes on behind the scenes of the companies that run these systems at scale so you don’t have …

Building A New Foundation For CouchDB

March 17th, 2020


CouchDB is a distributed document database built for scale and ease of operation. With a built-in synchronization protocol and a HTTP …

Scaling Data Governance For Global Businesses With A Data Hub Architecture

March 9th, 2020


Data governance is a complex endeavor, but scaling it to meet the needs of a complex or globally distributed organization requires a well considered and coherent strategy. In this episode Tim Ward describes an …

Easier Stream Processing On Kafka With ksqlDB

March 2nd, 2020


Building applications on top of unbounded event streams is a complex endeavor, requiring careful integration of multiple disparate systems …

Shining A Light on Shadow IT In Data And Analytics

February 25th, 2020


Misaligned priorities across business units can lead to tensions that drive members of the organization to build data and analytics projects without the guidance or support of engineering or IT staff. The …

Data Infrastructure Automation For Private SaaS At Snowplow

February 18th, 2020


One of the biggest challenges in building reliable platforms for processing event pipelines is managing the underlying infrastructure. At Snowplow Analytics the complexity is compounded by the need to manage …

Data Modeling That Evolves With Your Business Using Data Vault

February 9th, 2020


Designing the structure for your data warehouse is a complex and challenging process. As businesses deal with a growing number of sources and types of information that they need to integrate, they need a data …

The Benefits And Challenges Of Building A Data Trust

February 3rd, 2020


Every business collects data in some fashion, but sometimes the true value of the collected information only comes when it is combined with …

Pay Down Technical Debt In Your Data Pipeline With Great Expectations

January 27th, 2020


Data pipelines are complicated and business critical pieces of technical infrastructure. Unfortunately they are also complex and difficult …

Replatforming Production Dataflows

January 20th, 2020


Building a reliable data platform is a neverending task. Even if you have a process that works for you and your business there can be unexpected events that require a change in your platform architecture. In …

Planet Scale SQL For The New Generation Of Applications With YugabyteDB

January 13th, 2020


The modern era of software development is identified by ubiquitous access to elastic infrastructure for computation and easy automation of deployment. This has led to a class of applications that can quickly …

Change Data Capture For All Of Your Databases With Debezium

January 6th, 2020


Databases are useful for inspecting the current state of your application, but inspecting the history of that data can get messy without a …

Building The DataDog Platform For Processing Timeseries Data At Massive Scale

December 30th, 2019


DataDog is one of the most successful companies in the space of metrics and monitoring for servers and cloud infrastructure. In order to …

Building The Materialize Engine For Interactive Streaming Analytics In SQL

December 23rd, 2019


Transactional databases used in applications are optimized for fast reads and writes with relatively simple queries on a small number of records. Data warehouses are optimized for batched writes and complex …

Solving Data Lineage Tracking And Data Discovery At WeWork

December 16th, 2019


Building clean datasets with reliable and reproducible ingestion pipelines is completely useless if it’s not possible to find them and …

SnowflakeDB: The Data Warehouse Built For The Cloud

December 9th, 2019


Data warehouses have gone through many transformations, from standard relational databases on powerful hardware, to column oriented storage …

Organizing And Empowering Data Engineers At Citadel

December 3rd, 2019


The financial industry has long been driven by data, requiring a mature and robust capacity for discovering and integrating valuable sources of information. Citadel is no exception, and in this episode Michael …

Building A Real Time Event Data Warehouse For Sentry

November 26th, 2019


The team at Sentry has built a platform for anyone in the world to send software errors and events. As they scaled the volume of customers …

Escaping Analysis Paralysis For Your Data Platform With Data Virtualization

November 18th, 2019


With the constant evolution of technology for data management it can seem impossible to make an informed decision about whether to build a …

Designing For Data Protection

November 11th, 2019


The practice of data management is one that requires technical acumen, but there are also many policy and regulatory issues that inform and …

Automating Your Production Dataflows On Spark

November 4th, 2019


As data engineers the health of our pipelines is our highest priority. Unfortunately, there are countless ways that our dataflows can break or degrade that have nothing to do with the business logic or data …

Build Maintainable And Testable Data Applications With Dagster

October 28th, 2019


Despite the fact that businesses have relied on useful and accurate data to succeed for decades now, the state of the art for obtaining and …

Data Orchestration For Hybrid Cloud Analytics

October 22nd, 2019


The scale and complexity of the systems that we build to satisfy business requirements is increasing as the available tools become more …

Keeping Your Data Warehouse In Order With DataForm

October 15th, 2019


Managing a data warehouse can be challenging, especially when trying to maintain a common set of patterns. Dataform is a platform that helps you apply engineering principles to your data transformations and …

Fast Analytics On Semi-Structured And Structured Data In The Cloud

October 8th, 2019


The process of exposing your data through a SQL interface has many possible pathways, each with their own complications and tradeoffs. One of the recent options is Rockset, a serverless platform for fast SQL …

Ship Faster With An Opinionated Data Pipeline Framework

October 1st, 2019


Building an end-to-end data pipeline for your machine learning projects is a complex task, made more difficult by the variety of ways that …

Open Source Object Storage For All Of Your Data

September 23rd, 2019


Object storage is quickly becoming the unifying layer for data intensive applications and analytics. Modern, cloud oriented data warehouses and data lakes both rely on the durability and ease of use that it …

Navigating Boundless Data Streams With The Swim Kernel

September 18th, 2019


The conventional approach to analytics involves collecting large amounts of data that can be cleaned, followed by a separate step for …

Building A Reliable And Performant Router For Observability Data

September 10th, 2019


The first stage in every data project is collecting information and routing it to a storage system for later analysis. For operational data …

Building A Community For Data Professionals at Data Council

September 2nd, 2019


Data professionals are working in a domain that is rapidly evolving. In order to stay current we need access to deeply technical …

Building Tools And Platforms For Data Analytics

August 26th, 2019


Data engineers are responsible for building tools and platforms to power the workflows of other members of the business. Each group of users …

A High Performance Platform For The Full Big Data Lifecycle

August 19th, 2019


Managing big data projects at scale is a perennial problem, with a wide variety of solutions that have evolved over the past 20 years. One …

Digging Into Data Replication At Fivetran

August 12th, 2019


The extract and load pattern of data replication is the most commonly needed process in data engineering workflows. Because of the myriad sources and destinations that are available, it is also among the most …

Solving Data Discovery At Lyft

August 5th, 2019


Data is only valuable if you use it for something, and the first step is knowing that it is available. As organizations grow and data …

Simplifying Data Integration Through Eventual Connectivity

July 29th, 2019


The ETL pattern that has become commonplace for integrating data from multiple sources has proven useful, but complex to maintain. For a …

Straining Your Data Lake Through A Data Mesh

July 22nd, 2019


The current trend in data management is to centralize the responsibilities of storing and curating the organization’s information to a data engineering team. This organizational pattern is reinforced by the …

Data Labeling That You Can Feel Good About With CloudFactory

July 15th, 2019


Successful machine learning and artificial intelligence projects require large volumes of data that is properly labelled. The challenge is …

Scale Your Analytics On The Clickhouse Data Warehouse

July 8th, 2019


The market for data warehouse platforms is large and varied, with options for every use case. ClickHouse is an open source, column-oriented …

Stress Testing Kafka And Cassandra For Real-Time Anomaly Detection

July 2nd, 2019


Anomaly detection is a capability that is useful in a variety of problem domains, including finance, internet of things, and systems …

The Workflow Engine For Data Engineers And Data Scientists

June 25th, 2019


Building a data platform that works equally well for data engineering and data science is a task that requires familiarity with the needs of …

Maintaining Your Data Lake At Scale With Spark

June 17th, 2019


Building and maintaining a data lake is a choose your own adventure of tools, services, and evolving best practices. The flexibility and …

Managing The Machine Learning Lifecycle

June 10th, 2019


Building a machine learning model can be difficult, but that is only half of the battle. Having a perfect model is only useful if you are …

Evolving An ETL Pipeline For Better Productivity

June 4th, 2019


Building an ETL pipeline can be a significant undertaking, and sometimes it needs to be rebuilt when a better option becomes available. In this episode Aaron Gibralter, director of engineering at Greenhouse, …

Data Lineage For Your Pipelines

May 27th, 2019


Some problems in data are well defined and benefit from a ready-made set of tools. For everything else, there’s Pachyderm, the platform for …

Build Your Data Analytics Like An Engineer With DBT

May 20th, 2019


In recent years the traditional approach to building data warehouses has shifted from transforming records before loading, to transforming …

Using FoundationDB As The Bedrock For Your Distributed Systems

May 7th, 2019


The database market continues to expand, offering systems that are suited to virtually every use case. But what happens if you need something customized to your application? FoundationDB is a distributed …

Running Your Database On Kubernetes With KubeDB

April 29th, 2019


Kubernetes is a driving force in the renaissance around deploying and running applications. However, managing the database layer is still a separate concern. The KubeDB project was created as a way of providing …

Unpacking Fauna: A Global Scale Cloud Native Database

April 22nd, 2019


One of the biggest challenges for any business trying to grow and reach customers globally is how to scale their data storage. FaunaDB is a cloud native database built by the engineers behind Twitter’s …

Index Your Big Data With Pilosa For Faster Analytics

April 15th, 2019


Database indexes are critical to ensure fast lookups of your data, but they are inherently tied to the database engine. Pilosa is rewriting that equation by providing a flexible, scalable, performant engine for …

Serverless Data Pipelines On DataCoral

April 8th, 2019


How much time do you spend maintaining your data pipeline? How much end user value does that provide? Raghu Murthy founded DataCoral as a way to abstract the low level details of ETL so that you can focus on …

Why Analytics Projects Fail And What To Do About It

April 1st, 2019


Analytics projects fail all the time, resulting in lost opportunities and wasted resources. There are a number of factors that contribute to …

Building An Enterprise Data Fabric At CluedIn

March 25th, 2019


Data integration is one of the most challenging aspects of any data platform, especially as the variety of data sources and formats grow. Enterprise organizations feel this acutely due to the silos that occur …

A DataOps vs DevOps Cookoff In The Data Kitchen

March 18th, 2019


Delivering a data analytics project on time and with accurate information is critical to the success of any business. DataOps is a set of practices to increase the probability of success by creating value early …

Customer Analytics At Scale With Segment

March 4th, 2019


Customer analytics is a problem domain that has given rise to its own industry. In order to gain a full understanding of what your users are …

Deep Learning For Data Engineers

February 25th, 2019


Deep learning is the latest class of technology that is gaining widespread interest. As data engineers we are responsible for building and …

Speed Up Your Analytics With The Alluxio Distributed Storage System

February 19th, 2019


Distributed storage systems are the foundational layer of any big data stack. There are a variety of implementations which support different specialized use cases and come with associated tradeoffs. Alluxio is …

Machine Learning In The Enterprise

February 11th, 2019


Machine learning is a class of technologies that promise to revolutionize business. Unfortunately, it can be difficult to identify and …

Cleaning And Curating Open Data For Archaeology

February 4th, 2019


Archaeologists collect and create a variety of data as part of their research and exploration. Open Context is a platform for cleaning, …

Managing Database Access Control For Teams With strongDM

January 29th, 2019


Controlling access to a database is a solved problem… right? It can be straightforward for small teams and a small number of storage …

Building Enterprise Big Data Systems At LEGO

January 21st, 2019


Building internal expertise around big data in a large organization is a major competitive advantage. However, it can be a difficult process …

TimescaleDB: The Timeseries Database Built For SQL And Scale - Episode 65

January 14th, 2019


The past year has been an active one for the timeseries market. New products have been launched, more businesses have moved to streaming analytics, and the team at Timescale has been keeping busy. In this …

Performing Fast Data Analytics Using Apache Kudu - Episode 64

January 7th, 2019


The Hadoop platform is purpose built for processing large, slow moving data in long-running batch jobs. As the ecosystem around it has grown, so has the need for fast data analytics on fast moving data. To fill …

Simplifying Continuous Data Processing Using Stream Native Storage In Pravega with Tom Kaitchuck - Episode 63

December 31st, 2018


As more companies and organizations are working to gain a real-time view of their business, they are increasingly turning to stream …

Continuously Query Your Time-Series Data Using PipelineDB with Derek Nelson and Usman Masood - Episode 62

December 24th, 2018


Processing high velocity time-series data in real-time is a complex challenge. The team at PipelineDB has built a continuous query engine that simplifies the task of computing aggregates across incoming streams …

Advice On Scaling Your Data Pipeline Alongside Your Business with Christian Heinzmann - Episode 61

December 17th, 2018


Every business needs a pipeline for their critical data, even if it is just pasting into a spreadsheet. As the organization grows and gains more customers, the requirements for that pipeline will change. In this …

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

December 10th, 2018


Apache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the …

Apache Zookeeper As A Building Block For Distributed Systems with Patrick Hunt - Episode 59

December 3rd, 2018


Distributed systems are complex to build and operate, and there are certain primitives that are common to a majority of them. Rather then re-implement the same capabilities every time, many projects build on …

Set Up Your Own Data-as-a-Service Platform On Dremio with Tomer Shiran - Episode 58

November 26th, 2018


When your data lives in multiple locations, belonging to at least as many applications, it is exceedingly difficult to ask complex questions …

Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57

November 19th, 2018


Modern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for …

How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56

November 11th, 2018


A data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting …

Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55

November 5th, 2018


Business intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they …

Using Notebooks As The Unifying Layer For Data Roles At Netflix with Matthew Seal - Episode 54

October 29th, 2018


Jupyter notebooks have gained popularity among data scientists as an easy way to do exploratory analysis and build interactive reports. …

Of Checklists, Ethics, and Data with Emily Miller and Peter Bull (Cross Post from Podcast.__init__) - Episode 53

October 22nd, 2018


As data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products are built with a conscious consideration of ethics. Keeping ethical …

Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52

October 15th, 2018


With the growth of the Hadoop ecosystem came a proliferation of implementations for the Hive table format. Unfortunately, with no formal …

Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov - Episode 51

October 9th, 2018


One of the most complex aspects of managing data for analytical workloads is moving it from a transactional database into the data warehouse. What if you didn’t have to do that at all? MemSQL is a distributed …

Building A Knowledge Graph From Public Data At Enigma With Chris Groskopf - Episode 50

October 1st, 2018


There are countless sources of data that are publicly available for use. Unfortunately, combining those sources and making them useful in …

A Primer On Enterprise Data Curation with Todd Walter - Episode 49

September 24th, 2018


As your data needs scale across an organization the need for a carefully considered approach to collection, storage, organization, and access …

Take Control Of Your Web Analytics Using Snowplow With Alexander Dean - Episode 48

September 17th, 2018


Every business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions …

Keep Your Data And Query It Too Using Chaos Search with Thomas Hazel and Pete Cheslock - Episode 47

September 10th, 2018


Elasticsearch is a powerful tool for storing and analyzing data, but when using it for logs and other time oriented information it can become …

An Agile Approach To Master Data Management with Mark Marinelli - Episode 46

September 3rd, 2018


With the proliferation of data sources to give a more comprehensive view of the information critical to your business it is even more …

Protecting Your Data In Use At Enveil with Ellison Anne Williams - Episode 45

August 27th, 2018


There are myriad reasons why data should be protected, and just as many ways to enforce it in tranist or at rest. Unfortunately, there is …

Graph Databases In Production At Scale Using DGraph with Manish Jain - Episode 44

August 20th, 2018


The way that you store your data can have a huge impact on the ways that it can be practically used. For a substantial number of use cases, the optimal format for storing and querying that information is as a …

Putting Airflow Into Production With James Meickle - Episode 43

August 13th, 2018


The theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning …

Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42

August 6th, 2018


One of the longest running and most popular open source database projects is PostgreSQL. Because of its extensibility and a community focus …

Mobile Data Collection And Analysis Using Ona And Canopy With Peter Lubell-Doughtie - Episode 41

July 30th, 2018


With the attention being paid to the systems that power large volumes of high velocity data it is easy to forget about the value of data …

Ceph: A Reliable And Scalable Distributed Filesystem with Sage Weil - Episode 40

July 16th, 2018


When working with large volumes of data that you need to access in parallel across multiple instances you need a distributed filesystem that …

Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39

July 8th, 2018


Data integration and routing is a constantly evolving problem and one that is fraught with edge cases and complicated requirements. The …

Leveraging Human Intelligence For Better AI At Alegion With Cheryl Martin - Episode 38

July 2nd, 2018


Data is often messy or incomplete, requiring human intervention to make sense of it before being usable as input to machine learning …

Package Management And Distribution For Your Data Using Quilt with Kevin Moore - Episode 37

June 25th, 2018


Collaboration, distribution, and installation of software projects is largely a solved problem, but the same cannot be said of data. Every …

User Analytics In Depth At Heap with Dan Robinson - Episode 36

June 17th, 2018


Web and mobile analytics are an important part of any business, and difficult to get right. The most frustrating part is when you realize …

CockroachDB In Depth with Peter Mattis - Episode 35

June 11th, 2018


With the increased ease of gaining access to servers in data centers across the world has come the need for supporting globally distributed data storage. With the first wave of cloud era databases the ability to …

ArangoDB: Fast, Scalable, and Multi-Model Data Storage with Jan Steeman and Jan Stücke - Episode 34

June 4th, 2018


Using a multi-model database in your applications can greatly reduce the amount of infrastructure and complexity required. ArangoDB is a …

The Alooma Data Pipeline With CTO Yair Weinberger - Episode 33

May 28th, 2018


Building an ETL pipeline is a common need across businesses and industries. It’s easy to get one started but difficult to manage as new …

PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32

May 21st, 2018


Most businesses end up with data in a myriad of places with varying levels of structure. This makes it difficult to gain insights from across …

Brief Conversations From The Open Data Science Conference: Part 2 - Episode 31

May 14th, 2018


The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up …

Brief Conversations From The Open Data Science Conference: Part 1 - Episode 30

May 7th, 2018


The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up …

Metabase Self Service Business Intelligence with Sameer Al-Sakran - Episode 29

April 30th, 2018


Business Intelligence software is often cumbersome and requires specialized knowledge of the tools and data to be able to ask and answer …

Octopai: Metadata Management for Better Business Intelligence with Amnon Drori - Episode 28

April 23rd, 2018


The information about how data is acquired and processed is often as important as the data itself. For this reason metadata management systems are built to track the journey of your business data to aid in …

Data Engineering Weekly with Joe Crobak - Episode 27

April 15th, 2018


The rate of change in the data engineering industry is alternately exciting and exhausting. Joe Crobak found his way into the work of data management by accident as so many of us do. After being engrossed with …

Defining DataOps with Chris Bergh - Episode 26

April 8th, 2018


Managing an analytics project can be difficult due to the number of systems involved and the need to ensure that new information can be …

ThreatStack: Data Driven Cloud Security with Pete Cheslock and Patrick Cable - Episode 25

April 1st, 2018


Cloud computing and ubiquitous virtualization have changed the ways that our applications are built and deployed. This new environment requires a new way of tracking and addressing the security of our systems. …

MarketStore: Managing Timeseries Financial Data with Hitoshi Harada and Christopher Ryan - Episode 24

March 25th, 2018


The data that is used in financial markets is time oriented and multidimensional, which makes it difficult to manage in either relational or …

Stretching The Elastic Stack with Philipp Krenn - Episode 23

March 19th, 2018


Search is a common requirement for applications of all varieties. Elasticsearch was built to make it easy to include search functionality in projects built in any language. From that foundation, the rest of the …

Database Refactoring Patterns with Pramod Sadalage - Episode 22

March 12th, 2018


As software lifecycles move faster, the database needs to be able to keep up. Practices such as version controlled migration scripts and …

The Future Data Economy with Roger Chen - Episode 21

March 5th, 2018


Data is an increasingly sought after raw material for business in the modern economy. One of the factors driving this trend is the increase in applications for machine learning and AI which require large …

Honeycomb Data Infrastructure with Sam Stokes - Episode 20

February 26th, 2018


One of the sources of data that often gets overlooked is the systems that we use to run our businesses. This data is not used to directly provide value to customers or understand the functioning of the business, …

Data Teams with Will McGinnis - Episode 19

February 19th, 2018


The responsibilities of a data scientist and a data engineer often overlap and occasionally come to cross purposes. Despite these challenges …

TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18

February 11th, 2018


As communications between machines become more commonplace the need to store the generated data in a time-oriented manner increases. The market for timeseries data stores has many contenders, but they are not …

Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17

February 4th, 2018


One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have …

Dat: Distributed Versioned Data Sharing with Danielle Robinson and Joe Hand - Episode 16

January 29th, 2018


Sharing data across multiple computers, particularly when it is large and changing, is a difficult problem to solve. In order to provide a …

Snorkel: Extracting Value From Dark Data with Alex Ratner - Episode 15

January 22nd, 2018


The majority of the conversation around machine learning and big data pertains to well-structured and cleaned data sets. Unfortunately, that is just a small percentage of the information that is available, so …

CRDTs and Distributed Consensus with Christopher Meiklejohn - Episode 14

January 15th, 2018


As we scale our systems to handle larger volumes of data, geographically distributed users, and varied data sources the requirement to …

Citus Data: Distributed PostGreSQL for Big Data with Ozgun Erdogan and Craig Kerstiens - Episode 13

January 8th, 2018


PostGreSQL has become one of the most popular and widely used databases, and for good reason. The level of extensibility that it supports has …

Wallaroo with Sean T. Allen - Episode 12

December 25th, 2017


Data oriented applications that need to operate on large, fast-moving sterams of information can be difficult to build and scale due to the …

SiriDB: Scalable Open Source Timeseries Database with Jeroen van der Heijden - Episode 11

December 18th, 2017


Time series databases have long been the cornerstone of a robust metrics system, but the existing options are often difficult to manage in …

Confluent Schema Registry with Ewen Cheslack-Postava - Episode 10

December 10th, 2017


To process your data you need to know what shape it has, which is why schemas are important. When you are processing that data in multiple … with Bryon Jacob - Episode 9

December 3rd, 2017


We have tools and platforms for collaborating on software projects and linking them together, wouldn’t it be nice to have the same …

Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8

November 22nd, 2017


With the wealth of formats for sending and storing data it can be difficult to determine which one to use. In this episode Doug Cutting, …

Buzzfeed Data Infrastructure with Walter Menendez - Episode 7

November 14th, 2017


Buzzfeed needs to be able to understand how its users are interacting with the myriad articles, videos, etc. that they are posting. This lets them produce new content that will continue to be well-received. To …

Astronomer with Ry Walker - Episode 6

August 6th, 2017


Building a data pipeline that is reliable and flexible is a difficult task, especially when you have a small team. Astronomer is a platform …

Rebuilding Yelp's Data Pipeline with Justin Cunningham - Episode 5

June 18th, 2017


Yelp needs to be able to consume and process all of the user interactions that happen in their platform in as close to real-time as possible. To achieve that goal they embarked on a journey to refactor their …

ScyllaDB with Eyal Gutkind - Episode 4

March 18th, 2017


If you like the features of Cassandra DB but wish it ran faster with fewer resources then ScyllaDB is the answer you have been looking for. In this episode Eyal Gutkind explains how Scylla was created and how it …

Defining Data Engineering with Maxime Beauchemin - Episode 3

March 5th, 2017


What exactly is data engineering? How has it evolved in recent years and where is it going? How do you get started in the field? In this …

Dask with Matthew Rocklin - Episode 2

January 22nd, 2017


There is a vast constellation of tools and platforms for processing and analyzing your data. In this episode Matthew Rocklin talks about how …

Pachyderm with Daniel Whitenack - Episode 1

January 14th, 2017


Do you wish that you could track the changes in your data the same way that you track the changes in your code? Pachyderm is a platform for …

Introducing The Show - Episode 0

January 8th, 2017

  • Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
  • Go to to subscribe …
Loading ...

Listen to Data Engineering Podcast


A free podcast app for iPhone and Android

  • User-created playlists and collections
  • Download episodes while on WiFi to listen without using mobile data
  • Stream podcast episodes without waiting for a download
  • Queue episodes to create a personal continuous playlist
RadioPublic on iOS and Android
Or by RSS
RSS feed

Connect with listeners

Podcasters use the RadioPublic listener relationship platform to build lasting connections with fans

Yes, let's begin connecting
Browser window

Find new listeners

  • A dedicated website for your podcast
  • Web embed players designed to convert visitors to listeners in the RadioPublic apps for iPhone and Android
Clicking mouse cursor

Understand your audience

  • Capture listener activity with affinity scores
  • Measure your promotional campaigns and integrate with Google and Facebook analytics
Graph of increasing value

Engage your fanbase

  • Deliver timely Calls To Action, including email acquistion for your mailing list
  • Share exactly the right moment in an episode via text, email, and social media
Icon of cellphone with money

Make money

  • Tip and transfer funds directly to podcastsers
  • Earn money for qualified plays in the RadioPublic apps with Paid Listens