Cover art for podcast Data Engineering Podcast

Data Engineering Podcast

419 EpisodesProduced by Tobias MaceyWebsite

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

54:25

Apache Zookeeper As A Building Block For Distributed Systems with Patrick Hunt - Episode 59

Summary

Distributed systems are complex to build and operate, and there are certain primitives that are common to a majority of them. Rather then re-implement the same capabilities every time, many projects build on top of Apache Zookeeper. In this episode Patrick Hunt explains how the Apache Zookeeper project was started, how it functions, and how it is used as a building block for other distributed systems. He also explains the operational considerations for running your own cluster, how it compares to more recent entrants such as Consul and EtcD, and what is in store for the future.

Preamble
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Patrick Hunt about Apache Zookeeper and how it is used as a building block for distributed systems
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Zookeeper is and how the project got started?
    • What are the main motivations for using a centralized coordination service for distributed systems?
  • What are the distributed systems primitives that are built into Zookeeper?
    • What are some of the higher-order capabilities that Zookeeper provides to users who are building distributed systems on top of Zookeeper?
    • What are some of the types of system level features that application developers will need which aren’t provided by Zookeeper?
  • Can you discuss how Zookeeper is architected and how that design has evolved over time?
    • What have you found to be some of the most complicated or difficult aspects of building and maintaining Zookeeper?
  • What are the scaling factors for Zookeeper?
    • What are the edge cases that users should be aware of?
    • Where does it fall on the axes of the CAP theorem?
  • What are the main failure modes for Zookeeper?
    • How much of the recovery logic is left up to the end user of the Zookeeper cluster?
  • Since there are a number of projects that rely on Zookeeper, many of which are likely to be run in the same environment (e.g. Kafka and Flink), what would be involved in sharing a single Zookeeper cluster among those multiple services?
  • In recent years we have seen projects such as EtcD which is used by Kubernetes, and Consul. How does Zookeeper compare with those projects?
    • What are some of the cases where Zookeeper is the wrong choice?
  • How have the needs of distributed systems engineers changed since you first began working on Zookeeper?
  • If you were to start the project over today, what would you do differently?
    • Would you still use Java?
  • What are some of the most interesting or unexpected ways that you have seen Zookeeper used?
  • What do you have planned for the future of Zookeeper?
Contact Info Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Educational emoji reaction

Educational

Interesting emoji reaction

Interesting

Funny emoji reaction

Funny

Agree emoji reaction

Agree

Love emoji reaction

Love

Wow emoji reaction

Wow

Listen to Data Engineering Podcast

RadioPublic

A free podcast app for iPhone and Android

  • User-created playlists and collections
  • Download episodes while on WiFi to listen without using mobile data
  • Stream podcast episodes without waiting for a download
  • Queue episodes to create a personal continuous playlist
RadioPublic on iOS and Android
Or by RSS
RSS feed
https://www.dataengineeringpodcast.com/rss

Connect with listeners

Podcasters use the RadioPublic listener relationship platform to build lasting connections with fans

Yes, let's begin connecting
Browser window

Find new listeners

  • A dedicated website for your podcast
  • Web embed players designed to convert visitors to listeners in the RadioPublic apps for iPhone and Android
Clicking mouse cursor

Understand your audience

  • Capture listener activity with affinity scores
  • Measure your promotional campaigns and integrate with Google and Facebook analytics
Graph of increasing value

Engage your fanbase

  • Deliver timely Calls To Action, including email acquistion for your mailing list
  • Share exactly the right moment in an episode via text, email, and social media
Icon of cellphone with money

Make money

  • Tip and transfer funds directly to podcastsers
  • Earn money for qualified plays in the RadioPublic apps with Paid Listens