As data professionals we have a number of tools available for storing, processing, and analyzing data. We also have tools for collaborating …
As more organizations are gaining experience with data management and incorporating analytics into their decision making, their next move is to adopt machine learning. In order to make those efforts …
One of the core responsibilities of data engineers is to manage the security of the information that they process. The team at Satori has a background in cybersecurity and they are using the lessons that they …
Data governance is a term that encompasses a wide range of responsibilities, both technical and process oriented. One of the more complex …
As a data engineer you’re familiar with the process of collecting data from databases, customer data platforms, APIs, etc. At YipitData …
Building data products are complicated by the fact that there are so many different stakeholders with competing goals and priorities. It is …
The first stage of every good pipeline is to perform data integration. With the increasing pace of change and the need for up to date analytics the need to integrate that data in near real time is growing. With …
One of the oldest aphorisms about data is "garbage in, garbage out", which is why the current boom in data quality solutions is no surprise. …
The core mission of data engineers is to provide the business with a way to ask and answer questions of their data. This often takes the form of business intelligence dashboards, machine learning models, or …
A data catalog is a critical piece of infrastructure for any organization who wants to build analytics products, whether internal or external. While there are a number of platforms available for building that …
Data lakes are gaining popularity due to their flexibility and reduced cost of storage. Along with the benefits there are some additional …
One of the most challenging aspects of building a data platform has nothing to do with pipelines and transformations. If you are putting your workflows into production, then you need to consider how you are …
In order for analytics and machine learning projects to be useful, they require a high degree of data quality. To ensure that your pipelines …
Business intelligence efforts are only as useful as the outcomes that they inform. Power BI aims to reduce the time and effort required to …
Analytical workloads require a well engineered and well maintained data integration process to ensure that your information is reliable and up to date. Building a real-time pipeline for your data lakes and data …
Kafka has become a de facto standard interface for building decoupled systems and working with streaming data. Despite its widespread …
Data engineering is a constantly growing and evolving discipline. There are always new tools, systems, and design patterns to learn, which …
In memory computing provides significant performance benefits, but brings along challenges for managing failures and scaling up. Hazelcast …
Databases are limited in scope to the information that they directly contain. For analytical use cases you often want to combine data across …
Data warehouse technology has been around for decades and has gone through several generational shifts in that time. The current trends in …
In order to scale the use of data across an organization there are a number of challenges related to discovery, governance, and integration …
Most databases are designed to work with textual data, with some special purpose engines that support domain specific formats. TileDB is a data engine that was built to support every type of data by using …
Event based data is a rich source of information for analytics, unless none of the event structures are consistent. The team at Iteratively …
Finding connections between data and the entities that they represent is a complex problem. Graph data models and the applications built on top of them are perfect for representing relationships and finding …
A majority of the scalable data processing platforms that we rely on are built as distributed systems. This brings with it a vast number of subtle ways that errors can creep in. Kyle Kingsbury created the …
Wind energy is an important component of an ecologically friendly power system, but there are a number of variables that can affect the …
The first stage of every data pipeline is extracting the information from source systems. There are a number of platforms for managing data integration, but there is a notable lack of a robust and easy to use …
There are an increasing number of use cases for real time data, and the systems to power them are becoming more mature. Once you have a streaming platform up and running you need a way to keep an eye on it, …
We have machines that can listen to and process human speech in a variety of languages, but dealing with unstructured sounds in our environment is a much greater challenge. The team at Audio Analytic are …
The majority of analytics platforms are focused on use internal to an organization by business stakeholders. As the availability of data …
Machine learning is a process driven by iteration and experimentation which requires fast and easy access to relevant features of the data being processed. In order to reduce friction in the process of …
The landscape of data management and processing is rapidly changing and evolving. There are certain foundational elements that have remained …
Data lakes offer a great deal of flexibility and the potential for reduced cost for your analytics, but they also introduce a great deal of complexity. What used to be entirely managed by the database engine is …
Gaining a complete view of the customer journey is especially difficult in B2B companies. This is due to the number of different individuals …
The PostgreSQL database is massively popular due to its flexibility and extensive ecosystem of extensions, but it is still not the first …
There have been several generations of platforms for managing streaming data, each with their own strengths and weaknesses, and different …
Data management is hard at any scale, but working in the context of an enterprise organization adds even greater complexity. Infoworks is a …
Data is a critical element to every role in an organization, which is also what makes managing it so challenging. With so many different …
Modern applications frequently require access to real-time data, but building and maintaining the systems that make that possible is a complex and time consuming endeavor. Eventador is a managed platform …
The software applications that we build for our businesses are a rich source of data, but accessing and extracting that data is often a slow and error-prone process. Rookout has built a platform to separate the …
Knowledge graphs are a data resource that can answer questions beyond the scope of traditional data analytics. By organizing and storing data to emphasize the relationship between entities, we can discover the …
Building and maintaining a system that integrates and analyzes all of the data for your organization is a complex endeavor. Operating on a shoe-string budget makes it even more challenging. In this episode …
There are a number of platforms available for object storage, including self-managed open source projects. But what goes on behind the scenes of the companies that run these systems at scale so you don’t have …
CouchDB is a distributed document database built for scale and ease of operation. With a built-in synchronization protocol and a HTTP …
Data governance is a complex endeavor, but scaling it to meet the needs of a complex or globally distributed organization requires a well considered and coherent strategy. In this episode Tim Ward describes an …
Building applications on top of unbounded event streams is a complex endeavor, requiring careful integration of multiple disparate systems …
Misaligned priorities across business units can lead to tensions that drive members of the organization to build data and analytics projects without the guidance or support of engineering or IT staff. The …
One of the biggest challenges in building reliable platforms for processing event pipelines is managing the underlying infrastructure. At Snowplow Analytics the complexity is compounded by the need to manage …
Designing the structure for your data warehouse is a complex and challenging process. As businesses deal with a growing number of sources and types of information that they need to integrate, they need a data …
Every business collects data in some fashion, but sometimes the true value of the collected information only comes when it is combined with …
Data pipelines are complicated and business critical pieces of technical infrastructure. Unfortunately they are also complex and difficult …
Building a reliable data platform is a neverending task. Even if you have a process that works for you and your business there can be unexpected events that require a change in your platform architecture. In …
The modern era of software development is identified by ubiquitous access to elastic infrastructure for computation and easy automation of deployment. This has led to a class of applications that can quickly …
Databases are useful for inspecting the current state of your application, but inspecting the history of that data can get messy without a …
DataDog is one of the most successful companies in the space of metrics and monitoring for servers and cloud infrastructure. In order to …
Transactional databases used in applications are optimized for fast reads and writes with relatively simple queries on a small number of records. Data warehouses are optimized for batched writes and complex …
Building clean datasets with reliable and reproducible ingestion pipelines is completely useless if it’s not possible to find them and …
Data warehouses have gone through many transformations, from standard relational databases on powerful hardware, to column oriented storage …
The financial industry has long been driven by data, requiring a mature and robust capacity for discovering and integrating valuable sources of information. Citadel is no exception, and in this episode Michael …
The team at Sentry has built a platform for anyone in the world to send software errors and events. As they scaled the volume of customers …
With the constant evolution of technology for data management it can seem impossible to make an informed decision about whether to build a …
The practice of data management is one that requires technical acumen, but there are also many policy and regulatory issues that inform and …
As data engineers the health of our pipelines is our highest priority. Unfortunately, there are countless ways that our dataflows can break or degrade that have nothing to do with the business logic or data …
Despite the fact that businesses have relied on useful and accurate data to succeed for decades now, the state of the art for obtaining and …
The scale and complexity of the systems that we build to satisfy business requirements is increasing as the available tools become more …
Managing a data warehouse can be challenging, especially when trying to maintain a common set of patterns. Dataform is a platform that helps you apply engineering principles to your data transformations and …
The process of exposing your data through a SQL interface has many possible pathways, each with their own complications and tradeoffs. One of the recent options is Rockset, a serverless platform for fast SQL …
Building an end-to-end data pipeline for your machine learning projects is a complex task, made more difficult by the variety of ways that …
Object storage is quickly becoming the unifying layer for data intensive applications and analytics. Modern, cloud oriented data warehouses and data lakes both rely on the durability and ease of use that it …
The conventional approach to analytics involves collecting large amounts of data that can be cleaned, followed by a separate step for …
The first stage in every data project is collecting information and routing it to a storage system for later analysis. For operational data …
Data professionals are working in a domain that is rapidly evolving. In order to stay current we need access to deeply technical …
Data engineers are responsible for building tools and platforms to power the workflows of other members of the business. Each group of users …
Managing big data projects at scale is a perennial problem, with a wide variety of solutions that have evolved over the past 20 years. One …
The extract and load pattern of data replication is the most commonly needed process in data engineering workflows. Because of the myriad sources and destinations that are available, it is also among the most …
Data is only valuable if you use it for something, and the first step is knowing that it is available. As organizations grow and data …
The ETL pattern that has become commonplace for integrating data from multiple sources has proven useful, but complex to maintain. For a …
The current trend in data management is to centralize the responsibilities of storing and curating the organization’s information to a data engineering team. This organizational pattern is reinforced by the …
Successful machine learning and artificial intelligence projects require large volumes of data that is properly labelled. The challenge is …
The market for data warehouse platforms is large and varied, with options for every use case. ClickHouse is an open source, column-oriented …
Anomaly detection is a capability that is useful in a variety of problem domains, including finance, internet of things, and systems …
Building a data platform that works equally well for data engineering and data science is a task that requires familiarity with the needs of …
Building and maintaining a data lake is a choose your own adventure of tools, services, and evolving best practices. The flexibility and …
Building a machine learning model can be difficult, but that is only half of the battle. Having a perfect model is only useful if you are …
Building an ETL pipeline can be a significant undertaking, and sometimes it needs to be rebuilt when a better option becomes available. In this episode Aaron Gibralter, director of engineering at Greenhouse, …
Some problems in data are well defined and benefit from a ready-made set of tools. For everything else, there’s Pachyderm, the platform for …
In recent years the traditional approach to building data warehouses has shifted from transforming records before loading, to transforming …
The database market continues to expand, offering systems that are suited to virtually every use case. But what happens if you need something customized to your application? FoundationDB is a distributed …
Kubernetes is a driving force in the renaissance around deploying and running applications. However, managing the database layer is still a separate concern. The KubeDB project was created as a way of providing …
One of the biggest challenges for any business trying to grow and reach customers globally is how to scale their data storage. FaunaDB is a cloud native database built by the engineers behind Twitter’s …
Database indexes are critical to ensure fast lookups of your data, but they are inherently tied to the database engine. Pilosa is rewriting that equation by providing a flexible, scalable, performant engine for …
How much time do you spend maintaining your data pipeline? How much end user value does that provide? Raghu Murthy founded DataCoral as a way to abstract the low level details of ETL so that you can focus on …
Analytics projects fail all the time, resulting in lost opportunities and wasted resources. There are a number of factors that contribute to …
Data integration is one of the most challenging aspects of any data platform, especially as the variety of data sources and formats grow. Enterprise organizations feel this acutely due to the silos that occur …
Delivering a data analytics project on time and with accurate information is critical to the success of any business. DataOps is a set of practices to increase the probability of success by creating value early …
Customer analytics is a problem domain that has given rise to its own industry. In order to gain a full understanding of what your users are …
Deep learning is the latest class of technology that is gaining widespread interest. As data engineers we are responsible for building and …
Distributed storage systems are the foundational layer of any big data stack. There are a variety of implementations which support different specialized use cases and come with associated tradeoffs. Alluxio is …
Machine learning is a class of technologies that promise to revolutionize business. Unfortunately, it can be difficult to identify and …
Archaeologists collect and create a variety of data as part of their research and exploration. Open Context is a platform for cleaning, …
Controlling access to a database is a solved problem… right? It can be straightforward for small teams and a small number of storage …
Building internal expertise around big data in a large organization is a major competitive advantage. However, it can be a difficult process …
The past year has been an active one for the timeseries market. New products have been launched, more businesses have moved to streaming analytics, and the team at Timescale has been keeping busy. In this …
The Hadoop platform is purpose built for processing large, slow moving data in long-running batch jobs. As the ecosystem around it has grown, so has the need for fast data analytics on fast moving data. To fill …
As more companies and organizations are working to gain a real-time view of their business, they are increasingly turning to stream …
Processing high velocity time-series data in real-time is a complex challenge. The team at PipelineDB has built a continuous query engine that simplifies the task of computing aggregates across incoming streams …
Every business needs a pipeline for their critical data, even if it is just pasting into a spreadsheet. As the organization grows and gains more customers, the requirements for that pipeline will change. In this …
Apache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the …
Distributed systems are complex to build and operate, and there are certain primitives that are common to a majority of them. Rather then re-implement the same capabilities every time, many projects build on …
When your data lives in multiple locations, belonging to at least as many applications, it is exceedingly difficult to ask complex questions …
Modern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for …
A data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting …
Business intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they …
Jupyter notebooks have gained popularity among data scientists as an easy way to do exploratory analysis and build interactive reports. …
As data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products are built with a conscious consideration of ethics. Keeping ethical …
With the growth of the Hadoop ecosystem came a proliferation of implementations for the Hive table format. Unfortunately, with no formal …
One of the most complex aspects of managing data for analytical workloads is moving it from a transactional database into the data warehouse. What if you didn’t have to do that at all? MemSQL is a distributed …
There are countless sources of data that are publicly available for use. Unfortunately, combining those sources and making them useful in …
As your data needs scale across an organization the need for a carefully considered approach to collection, storage, organization, and access …
Every business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions …
Elasticsearch is a powerful tool for storing and analyzing data, but when using it for logs and other time oriented information it can become …
With the proliferation of data sources to give a more comprehensive view of the information critical to your business it is even more …
There are myriad reasons why data should be protected, and just as many ways to enforce it in tranist or at rest. Unfortunately, there is …
The way that you store your data can have a huge impact on the ways that it can be practically used. For a substantial number of use cases, the optimal format for storing and querying that information is as a …
The theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning …
One of the longest running and most popular open source database projects is PostgreSQL. Because of its extensibility and a community focus …
With the attention being paid to the systems that power large volumes of high velocity data it is easy to forget about the value of data …
When working with large volumes of data that you need to access in parallel across multiple instances you need a distributed filesystem that …
Data integration and routing is a constantly evolving problem and one that is fraught with edge cases and complicated requirements. The …
Data is often messy or incomplete, requiring human intervention to make sense of it before being usable as input to machine learning …
Collaboration, distribution, and installation of software projects is largely a solved problem, but the same cannot be said of data. Every …
Web and mobile analytics are an important part of any business, and difficult to get right. The most frustrating part is when you realize …
With the increased ease of gaining access to servers in data centers across the world has come the need for supporting globally distributed data storage. With the first wave of cloud era databases the ability to …
Using a multi-model database in your applications can greatly reduce the amount of infrastructure and complexity required. ArangoDB is a …
Building an ETL pipeline is a common need across businesses and industries. It’s easy to get one started but difficult to manage as new …
Most businesses end up with data in a myriad of places with varying levels of structure. This makes it difficult to gain insights from across …
The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up …
The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up …
Business Intelligence software is often cumbersome and requires specialized knowledge of the tools and data to be able to ask and answer …
The information about how data is acquired and processed is often as important as the data itself. For this reason metadata management systems are built to track the journey of your business data to aid in …
The rate of change in the data engineering industry is alternately exciting and exhausting. Joe Crobak found his way into the work of data management by accident as so many of us do. After being engrossed with …
Managing an analytics project can be difficult due to the number of systems involved and the need to ensure that new information can be …
Cloud computing and ubiquitous virtualization have changed the ways that our applications are built and deployed. This new environment requires a new way of tracking and addressing the security of our systems. …
The data that is used in financial markets is time oriented and multidimensional, which makes it difficult to manage in either relational or …
Search is a common requirement for applications of all varieties. Elasticsearch was built to make it easy to include search functionality in projects built in any language. From that foundation, the rest of the …
As software lifecycles move faster, the database needs to be able to keep up. Practices such as version controlled migration scripts and …
Data is an increasingly sought after raw material for business in the modern economy. One of the factors driving this trend is the increase in applications for machine learning and AI which require large …
One of the sources of data that often gets overlooked is the systems that we use to run our businesses. This data is not used to directly provide value to customers or understand the functioning of the business, …
The responsibilities of a data scientist and a data engineer often overlap and occasionally come to cross purposes. Despite these challenges …
As communications between machines become more commonplace the need to store the generated data in a time-oriented manner increases. The market for timeseries data stores has many contenders, but they are not …
One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have …
Sharing data across multiple computers, particularly when it is large and changing, is a difficult problem to solve. In order to provide a …
The majority of the conversation around machine learning and big data pertains to well-structured and cleaned data sets. Unfortunately, that is just a small percentage of the information that is available, so …
As we scale our systems to handle larger volumes of data, geographically distributed users, and varied data sources the requirement to …
PostGreSQL has become one of the most popular and widely used databases, and for good reason. The level of extensibility that it supports has …
Data oriented applications that need to operate on large, fast-moving sterams of information can be difficult to build and scale due to the …
Time series databases have long been the cornerstone of a robust metrics system, but the existing options are often difficult to manage in …
To process your data you need to know what shape it has, which is why schemas are important. When you are processing that data in multiple …
We have tools and platforms for collaborating on software projects and linking them together, wouldn’t it be nice to have the same …
With the wealth of formats for sending and storing data it can be difficult to determine which one to use. In this episode Doug Cutting, …
Buzzfeed needs to be able to understand how its users are interacting with the myriad articles, videos, etc. that they are posting. This lets them produce new content that will continue to be well-received. To …
Building a data pipeline that is reliable and flexible is a difficult task, especially when you have a small team. Astronomer is a platform …
Yelp needs to be able to consume and process all of the user interactions that happen in their platform in as close to real-time as possible. To achieve that goal they embarked on a journey to refactor their …
If you like the features of Cassandra DB but wish it ran faster with fewer resources then ScyllaDB is the answer you have been looking for. In this episode Eyal Gutkind explains how Scylla was created and how it …
What exactly is data engineering? How has it evolved in recent years and where is it going? How do you get started in the field? In this …
There is a vast constellation of tools and platforms for processing and analyzing your data. In this episode Matthew Rocklin talks about how …
Do you wish that you could track the changes in your data the same way that you track the changes in your code? Pachyderm is a platform for …
Connect with listeners
Podcasters use the RadioPublic listener relationship platform to build lasting connections with fans
Yes, let's begin connectingFind new listeners
Understand your audience
Engage your fanbase
Make money