February 18, 2025

AI and Data Movement: Trends and Best Practices with Estuary’s Daniel Pálma

No items found.

Listen to this article

Powered by NotebookLM
Listen to this article

In this episode of The Data Engineering Show, the bros sit with Daniel Pálma, Head of Marketing at Estuary, to delve into the intriguing world of data engineering and marketing. Daniel shares his transition journey into marketing from data engineering and how his technical proficiency has been leveraged to market to engineers. The conversation cuts across the importance of AI in data movement, the future of data engineering, real-time data integration challenges, and the evolution of data integration.

Listen on Spotify or Apple Podcasts


Intro/Outro - 00:00:04:

The Data Engineering Show is brought to you by Firebolt, the cloud data warehouse for low latency analytics. Get $200 credits and start your free trial at firebolt.io.

Benjamin - 00:00:15:

Hi, everyone, and welcome back to The Data Engineering Show. Today, we are super happy to have Dani join us from Estuary. Great to have you on the show, Dani.

Daniel - 00:00:23:

Yeah. Hi, everyone. Thanks for having me. I'm happy to be here.

Benjamin - 00:00:25:

Perfect. So your job title at Estuary says data engineer and marketing. Tell us about how these two things connect and what you spend your days doing.

Daniel - 00:00:35:

Yeah, sure. So I actually totally forgot to update my title on LinkedIn. It should say head of marketing as of three months ago.

Benjamin - 00:00:42:

Nice. Congratulations.

Daniel - 00:00:45:

Thank you. Thank you. Initially, when I joined almost a year ago, it was data engineering and marketing. It started as a content creator focused role with a lot of expertise in data engineering. So I come from the data engineering world. I worked as a data engineer for pretty much a decade before going into marketing, startups, enterprises, and most recently, like five years in consulting. So my job was to know the target customer, the target audience, which for Estuary is KC, our data engineers, know their problems and know, what they look for and provide them with useful content that ideally helps them on board to Estuary. But if not, still solves part of their problem, at least. And still, even as head of marketing, that's still my main goal to help Estuary users and prospects help in all kinds of data engineering issues.

Benjamin - 00:01:41:

Awesome. That's super cool. For the listeners who never heard about Estuary, you want to say a few sentences about what you guys actually do?

 Daniel - 00:01:48:

Yeah, for sure. So Estuary is a, data integration platform, specialize in real time data movement. So if you work in the data world, you are probably familiar with Fivetran or Confluent or Hevo. Like there was a lot of similar vendors working in this space. Estuary is an all-in-one platform, unified in the sense that we do both streaming real-time data integration and batch data integration as well for analytics use cases. So the shortest pitch is that if you have any kind of data movement challenge, like you want to move data from point A to point B on your own cadence in your own budget, then Estuary can do it for you.

Benjamin - 00:02:30:

Nice. That's awesome. Cool. So, moving from data engineering to marketing, tell us about that shift. What's difficult? What's harder than you expected? What's easier than you expected? Take us through that journey, basically.

Daniel - 00:02:45:

Yeah, it's an interesting journey for sure. Even before going full-time into marketing, I used to do a lot of content creation mainly. I was always writing blog posts for my own blog or for other vendors or other companies' blogs, mainly technical how-tos, tutorials, and yeah, just my thoughts about the data industry in general. So that was my introduction to marketing. And I started liking it more and more. And I started getting a little bit tired of the actual data engineering, like the actual individual contributor work. So I decided, like I saw that this role was open, and I thought that this is a perfect combination of data engineering skills that I can use and providing useful or creating useful content for fellow data engineers. So I decided to give it a go. And it turns out that I really like doing it. So I decided that it would be a good idea to do even more of marketing. And yeah, I've been learning a lot obviously, marketing as an industry or as a vertical is, it was fairly new to me. So there's been a lot of things that I had to catch up on. But yeah, so far, it's been an amazing experience.

Benjamin - 00:03:56:

How much do you think your data engineering background actually helps you now in terms of being a marketer, right? Because obviously, these are very technical products, kind of like with a lot of moving pieces, a lot of complexity. Tell us a bit about that.

Daniel - 00:04:09:

Yeah, a ton, honestly. I couldn't imagine doing this without my data engineering background. I think the biggest challenges in data or marketing to data engineers or data practitioners is the same challenge that marketers face when marketing for or to software engineers or anybody in tech, pretty much, that people in tech hate being marketed to. And I know that because I used to work as an engineer, and I hated being directly marketed to. And I think it's super easy to see through like, lame marketing attempts. And that's where the biggest challenge is to actually market your product in a way that is not like in your face and you know, if you're running ad, this is the best service that you can ever imagine and solves all your problems. But to actually show how it solves their issue and provide useful content. So it's a very fine line between being obnoxious and being useful. And if you find that sweet spot, then that's what I call successful marketing to engineers.

 Eldad - 00:05:07:

Maybe another way of putting it is looking at marketing versus product marketing, right? So if you drill into product marketing, you can say, okay, let's start product marketing out of marketing. Or you can say, let's start marketing out of product. If you're into data engineering, and you've been educating mostly, right? If you've been involved, not just in solving problems real time or building projects or delivering projects, but you're actually educating about data engineering, then Being able to market that education, right? Turns that into marketing. And it's a good thing. It comes bottom up. It comes from data engineering up to marketing versus the other way. Therefore, it's authentic. And it's all about solving actual problems at scale. And it's super interesting to see more and more people get into marketing by pulling it into their domain. So you're a data engineer, you pull in marketing, and then that becomes something else, something bigger. Tell us how you apply in your daily work. How do you take that approach to your customers that are always having challenges moving data? How does it work?

 Daniel - 00:06:13:

Yeah, so at S3, we are focused on real-time data integration. That's where a lot of the technical mode of the platform is. And historically, real-time streaming data movement has been very complex and hard to implement and expensive as well. So that's where the education aspect comes in, that a lot of data engineers don't know much about real-time use cases or how to implement real-time data pipelines or streaming data pipelines because they come from the batch analytics world, like where you have a daily report or a weekly report and you need the data to be ingested every day, and that's good enough. But as you go into those use cases where you need to react on something in as real time as possible, like fraud detection or calculating customer facing metrics, like obviously there's no chance to wait for the next day to calculate those. So we have to do a lot of education on how these real time pipelines work and how they are different compared to the traditional batch pipelines. And a lot of our content is centered around that. So we teach generic pipeline building on our blog and in our videos. But we also have to do a lot of education about the platform itself because it introduces a lot of terminology, a lot of concepts. So it's not built on the most popular real time frameworks like Kafka that most people are familiar with in the space. It's built on a completely new backbone, a new streaming backbone. So that introduces extra complexity. So yeah, it's hard. And it's one of the challenges that, we pretty much face every day in our content on how much we can focus on education, how much we can actually focus on the solution that we are trying to provide. So yeah, it's something that we definitely want to get better at and it will probably never stop.

Benjamin - 00:08:04:

So if you actually pitch Estuary, to someone is like, what are you better at as a tool than your competitors? Like what's your unique value proposition basically?

Daniel - 00:08:14:

Yeah. So there's multiple things and it really depends on the use case because, as I said, we. We do a lot of things. So we compete with big platforms who are specialized on one use case like real time or specialized on batch data integration. So depending on what's the prospect, let's say is looking for, that's how I usually tailor this value proposition. But in a sense, the biggest difference is that our platform is built for the cloud as opposed to other platforms who utilize a framework that was built for on-premise servers and databases a few decades ago. And because we are built for the cloud, it's pretty much infinitely scalable, way cheaper than alternatives because, we use object storage as our primary backend and we can move way faster than companies or organizations who use Kafka, for example, as their backend. So yeah, the shortest pitch I think is that it's faster, cheaper, and more efficient.

Benjamin - 00:09:15:

Gotcha. And how hard is it? We're like Firebolt, like we build data warehouse. So we think a lot about actually kind of people moving from other data warehouses to Firebolt. And migrating from one data warehouse to another is usually a big undertaking for organizations, right? Like different SQL dialects, different ecosystem integrations, and so on. So we're spending a lot of our time thinking out to make that as easy as possible. Like when you build a type of tool like Estuary, how hard is it there basically to make it easy to move, let's say, from 510 to Estuary?

Daniel - 00:09:46:

So moving from a competitor like Fivetran to S3 is we're trying to make it as easy as possible. Like the actual data pipelines, you can spin up an S3 data pipeline from, I don't know, a Postgres database to Snowflake in five minutes. And we make sure to get all of the data that is in your source and put it in your data destination without anything missing or without any duplicates. So in that sense, switching is super easy because there's nothing that Fivetran does that we can't do. The complexity comes in like auxiliary things that Fivetran has, for example, a lot of very cool dbt packages that do a lot of transformations in the destination. And that's something that we currently don't have, but we have a lot of requests for something like that for my customers. But as for the platform itself, yeah, it's super easy. And we do a lot of these migrations from all kinds of competitors that are unhappy with price or latency or anything else.

Benjamin - 00:10:44:

Gotcha. And then like the longer term goal is basically taking more and more of the actual whole ELT pipeline, basically, that you don't just do the data movement into your sink, but can also orchestrate data transformations there. Is that right? Or like when you talked about the dbt packages, I'm not sure I understood that correctly.

Daniel - 00:11:02:

Oh, yeah. So those dbt packages specifically are open sourced by Fivetran and they are available on GitHub. And they are meant to be executed in the destination. But you know, they are tailored to the schema that Fivetran manages from their own connectors. And a lot of Fivetran customers use those as like the first layer in their dbt project. So everything downstream of those expects that structure. And obviously, when we are doing a migration from Fivetran to S3, our schema is a little bit different, which is a good thing in some cases, because that allows us to be like a fraction of the cost as we don't normalize the schema into many, many tables like Fivetran does. But this is one of the drawbacks that we have to reimplement some of those transformations. But for the end goal, as you said, it is to do everything data movement. So you have to move data from source to data warehouse, we can do that. And we can also trigger transformations. And we can also do movement from a data warehouse to SaaS tools or back to a transactional database like reverse ETL style.

 Eldad - 00:12:05:

I have a question on AI, actually. So if you think databases from an interface perspective, it's very easy, right? You write in English and that generates SQL queries and everyone is happy. But when you go to data movement, the story becomes more complicated because there's so much more context needed as you are actually moving between different systems, right? Data warehouse has the data, has the schema, has the SQL. It has everything baked in. But data movement is much more complicated from a migration or from an integration perspective. And, right, a big part of the challenge with ETL or ELT or any of those tools was always actually understanding context and migrating context and metadata and knobs and configurations. And you need to understand the source and the destination so well, right? So you can have your data transfer being amazing. But if you don't have understanding, about the source or destination, then you're in a challenge. And this is for, right, the tool, having the right tool that knows how to properly connect in and out. Becomes very relevant. So my question is, how is AI affecting the user experience, right? So I'm thinking ELT tools going forward, I would want to say here, listen, I have that source, I have that destination. Actually, sometimes I also have a different product called Fivetran or any other one of the great products there. Can you use those things to generate my, use my product, clone that and just run it? And that would go and really, right? Like fill in. All the workflows, do all the steps, and get you started very easily. Therefore, I'm kind of democratizing or simplifying data movement, enabling it for so many new use cases. It's not a person, just a person anymore, owning data movement by using a tool. It opens up many new use cases. And I was kind of curious to get your thought. As everyone is thinking, rethinking how AI can simplify and get them more productive. What's your take on movement and AI?

Daniel - 00:14:15:

Yeah, I think it's a big topic, both movement itself and then how AI can help there. So we are currently working on opening up the platform for AI-driven use cases. And there's a lot of things that we have to change for that. But the end goal is, or one of the goals is, that anyone should be able to spin up a data pipeline by typing or typing in English or saying some words, you know, instead of having to fill out a bunch of configurations. The hard part is that data movement is just super complex. There's so many edge cases that even if we cover all of those, the actual systems that we move data from and to are constantly innovating and changing as well. And that always introduces new edge cases and new features that we have to follow up. So the work for the connectors functionality will never stop. So enabling all of those to be used from like ChatGPT interface is, I think it's possible, but it's probably a little bit further away. Yeah, it's just a super hard problem. As for using AI to actually build data integration pipelines, it can work for up to a certain complexity. But after a while, the source and the destination systems just become so complex that I haven't yet seen an AI-generated data movement script or application cover all of the challenges.

Benjamin - 00:15:41:

Are you, like, for these types of people, also now customers shifting towards AI applications, are you seeing shifts in the type of data people are moving around? Or it's basically still the same data people are moving around to Power AI applications? And then it's more about also for you, how you can give human interfaces to, or plain text interfaces for people to spin up data pipelines easily.

Daniel - 00:16:07:

Yeah, so most of the AI-based use cases that we see are people capturing data from a source and moving it into a vector-based representation. Usually, it used to be a dedicated vector store, but nowadays people use their usual data warehouses because everybody started supporting vector data types. So vector stores are becoming less of a thing lately. But as for the sources, I think the most common source still is an operational database that people just want to get their data out of. So meaning a backend database for an actual web application, usually Postgres or Mongo or MySQL. So those kinds of guys. And in addition to those, we see more and more use cases of people trying to enable AI for their Salesforce data, for their HubSpot data, for their NetSuite data. So I think more and more enterprises realize that they have these cool, super strong services that store a lot of data, but it's locked inside. And they are trying to... Get more value out of it. And now it is so easy to spin up a chat application for these. But a prerequisite for those is that you have to get your data out of those, create those vector embeddings, and store it in a database that actually can plug into some of these RAG chat applications.

Benjamin - 00:17:22:

Right. Like we're obviously as a data platform also thinking a lot about that because we want to power these types of RAG AI applications. And one thing we've been seeing more and more is basically that people are trying to figure out how to make these AI applications work well on structured data. So if you look at like vector embeddings and vector search, like it was like original kind of RAG pipelines. But the type of like you mentioned, right, like Salesforce data, like and so on, like this is inherently structured. And actually building kind of powerful AI applications on top of that is something that we're also working to figure out with our prospects and customers. And I think it will be super interesting to see how that evolves over to like kind of 2025 as people really figure out how to build amazing AI data applications on top of not just unstructured kind of vector databases, but also structured data.

Daniel - 00:18:13:

Yeah, for sure. Like does Firebolt support some kind of vector embeddings generation internally? Or is the use case more of like not vectorized data serving as the backend in a chat application or an AI application?

Benjamin - 00:18:26:

We support vector search and all of that, right? As you said, all cloud data warehouses are now pushing to support also these more unstructured use cases. But what we're actually thinking the most about and seeing the most with customers and prospects as well is how can we make it work on structured data? In those cases, you don't want to take your 12-byte text and move it into a 20-kilobyte vector just to do nearest neighbor search, right? It's kind of like people still want to build efficient data applications on top of their structured data. And at least we currently don't think that vector search will be what's going to power these types of applications in the future. One thing that we're seeing, which I think is quite interesting, is a lot of text search use cases, right? Because all of a sudden, there's a lot of text-based human interfaces to data. And this is something that we, for example, have been focusing on.

Daniel - 00:19:17:

Yeah, I think that's super interesting. And it's weird that somehow the tech scene collectively decided that vector search or similarity search is the de facto way to build chat applications. But in fact, you just have to somehow stop things into a prompt and that's the end anyway. So any kind of search works. So yeah, I think that's definitely a way to go to expand the horizon and look into what actual search solution is the best for those.

Benjamin - 00:19:42:

If you look at database history, like this is a very common theme, right? It's like, okay, a new type of workload emerges, systems emerge that kind of try to redefine how to build these applications or power these applications. But in the end, and this is also what we believe now, okay, relational SQL databases, just absorb it.

Eldad - 00:20:01:

We've had a guest actually a few episodes ago. He had a very nice angle to it. He said, nobody cares about writing code. Everyone cares about being able to reason about it. So if AI generates the question, we want to reason about it. We want to be able to audit it. We want to be able to understand if it asks the right thing. So SQL is suddenly getting that AI love, not because people like SQL. People hate writing SQL, right? They use BI.

Benjamin - 00:20:28:

I love writing SQL, Daniel. Don't just say that.

Eldad - 00:20:32:

Of course, Benjamin, you love writing physical plans, but yes, and sometimes SQL on top. And people, but you know, Benji, people have been using all sorts of tools just to avoid writing SQL. And, for all the good reasons. But now that you have new kinds of tools generating SQL, it's just a great way to reason. We see many startups. We see, as Benji mentioned, startups that apply SQL on all sorts of data, not necessarily through a database. But really try to abstract different kinds of data through SQL without talking about the database in the middle, right? And the database becomes a component. So from an AI perspective, if you're an AI engineer, you're not necessarily thinking about the databases you're used to. You're thinking about databases, just access to great data. And the AI takes care of a lot of stuff. So again, this is very forward looking. And that episode was, I don't know, a few months back. We love catching up with our Daniels. So kind of see where those theories got us, see kind of what really happened. And it's always funny to see kind of, right, how we're so bad in predicting the future, yet are always right and, you know, following the right path. So I love it. I love to see how data movement and AI come together. By the way, I don't think it's a prompt saying move from here to there. I think it needs to be a conversation where you're actually trying to tweak. You said it. There is no way for something to just do it automagically. There's iteration and understanding and figuring out the data and the quality and the metadata, right? There needs to be, everything needs to fit. So from a user perspective, instead of going through a workflow, like a Visio kind of experience and learning how to do pop-ups and everything, taking that same thing and applying through a conversation means the user just becomes much smarter, right? So the same user now becomes a genius. They can use the right knobs. They can tune it. They can exploit the system. So if you guys are better at using specific features of a specific driver. And know and spend hours, days, months perfecting it to become better than everyone else, then your biggest challenge is exposing it to users. Because the user will need to drill in so much, so deep. To actually find that advantage. To the point that they will never find that advantage. And what we're seeing is with AI, users become geniuses. They are now able to figure out the knobs, to figure out the weights. And it's really, again, it's untouched territory. No one has figured it out. And that's what makes it interesting. But definitely my biggest takeaway is that it's not replacing users. It's making them super smart. If you serve it in the right way through that, and you have the right product, it has those knobs, then suddenly you see amazing results. And I think data movement, just as any part of the data funnel, they need efficiency. They need optimization. They need specifics. So I hope to see that GAC bridge being specific about solving problems and AI helping you get there. Of course, we're all generics, right? We don't have the time to drill into the details.

Daniel - 00:23:52:

Yeah, I think that's a very eloquent way to put it. Yeah, on the other side of the coin, I've seen people feel dumb when they are trying to solve a problem. They are met with this huge list of configuration options and they have no idea what to do, all this new terminology and everything. And that's definitely a gap that AI can fill and actually empower a user by asking the right questions and providing a friendly interface. So yeah, for sure, I can definitely see that happening.

Benjamin - 00:24:19:

Great. So connecting industry trends, right, we talked a lot about AI now. I saw one of your recent LinkedIn posts about whether Iceberg is the new Hadoop. Tell us, maybe let's close.

Eldad - 00:24:29:

What an insult. What an insult to Iceberg.

Benjamin - 00:24:32:

What an insult. No.

Daniel - 00:24:34:

To what?

Benjamin - 00:24:35:

Let's maybe kind of close out on this. And especially, I think, for you as a data movement tool, right? Like, I guess Iceberg is also changing how companies and organizations and data engineers think about data movement between different platforms. It's like, yeah, how's Estuary? How are you thinking about Iceberg these days?

Daniel - 00:24:53:

So we see Iceberg as something that is very popular. I think it's more popular than actually used. A lot of people are talking about it and not that many people are actually implementing it as like a lake house solution. But I think it will get there. Like, it's definitely on the track. There was the table format wars that kind of ended with all the different table formats like Hudi or Delta tables. And it kind of went in the direction of Iceberg, which I think is a good step for the industry to standardize on one solution, which gets more attention, more development work. Because it's open source. Many other contributors are able to put work into it. And because of this, it will become, in my eyes, the de facto solution in the near-term future for data lake houses. I think lake houses themselves are the obvious next step in the evolution of a lot of data stacks. For a lot of use cases. It is the peak of decoupling compute from storage, in my opinion, which is a good trend for flexibility, for cost management. And a lot of use cases are covered by it. So we see that a lot of people want to or are interested in migrating to Iceberg. Like they have a Snowflake database or data warehouse or...

Eldad - 00:26:14:

You know why they all want to migrate to Iceberg? Because they want to try Firebolt without having their boss snooping around and asking them questions. Is it secure or did we move data out? Right? Did we move? That's the worst nightmare, moving data out. You're right. Iceberg is an enabler waiting to explode, waiting to enable. And 2025, I am super confident we will see Iceberg going from big promise to being an enabler. And the reason is, again, that those two characters are, the reason is AI, is because... People have been spending so much on building the modern data stack and connecting all the tools and figuring everything out. Now AI comes and AI gravitates the persona. Now engineering is back in the house. Now companies take their best teams and say, you work on the next AI infrastructure stack while all the legacy runs. Some of them call cloud their warehouses like legacy. Okay. And the reason is, again, like focus and shifting. Now the most valuable data sits in those repositories because it is a business data. It is well-cleaned. It is accurate. It's what they call great data. It's fresh, accurate. Now just, you know, flipping it, the Iceberg magic button means that many other products can come in and use that data. If you're into AI and you want to build a new stack, you don't necessarily need to think, oh, I'm using Snowflake. So let's just go and register to Snowflake AI. You can use your own AI. You. You can use any AI. You can use, one of other 20 options but you need the data and you need that contract with the data warehouse, or in that case, Snowflake. So that clones the outcome of those previously modern data warehouses and allows a new stack to grow. And obviously when it comes to AI, efficiency is king, moving data is king, because in many cases don't need all the cleansing. It's AI. So sometimes people say, I'm willing to get it less structured, less cleansed, less modeled, because I can figure it out as part of the AI feature. So a lot of interesting things, kind of what it means about centralized schema management, but Iceberg is just in a neighbor. So it allows engineers and data engineers to kind of expand that experimentation without restarting, without losing their previous investment. So I think it's a great opportunity for the ecosystem to evolve around it. And that's the only thing we see over the last six months. Like a lot of focus on trying to figure it out. And I hope everyone will figure it out in 2025.

Daniel - 00:28:57:

Hopefully. I think the only issue is that it's too complicated currently. You have to deal with catalogs, maintenance, like all kinds of stuff that nobody wants to deal with and shouldn't have to deal with. So once those ironed out...

Eldad - 00:29:08:

A new hadouk. This is what Benji said.

Daniel - 00:29:11:

Exactly.

Benjamin - 00:29:12:

Exactly. As an engineering team building Iceberg support into a system, it is very complicated. Yes.

Daniel - 00:29:20:

Yeah, it's rough. Yeah. But once those are ironed out, it's going to be really the democratization of data for many, many organizations.

Benjamin - 00:29:26:

Awesome. Cool. Dani, it was amazing how you learned so seriously, like such an amazing kind of hearing your perspective from Estuary, connecting it to how we think about data warehousing. So many exciting things happening in tech right now. Any closing words from your side?

Daniel - 00:29:43:

I think this is pretty much the golden age for data-related work and data engineering. I think if anybody watching this is thinking about jumping into the data role, then I would say do it. A lot of people ask me if this is the right time. And yes, this is the right time. Like data is only going to get bigger and bigger and we need more and more data engineers and data analysts and other practitioners. So yeah, go for it.

Benjamin - 00:30:07:

Awesome. Thanks for being on the show, Dani. And we look forward to catching up later this year to see which of our predictions from today actually turned out.

Daniel - 00:30:15:

Yeah. Thank you so much. This was great.

Eldad - 00:30:17:

Thank you. Take care. Bye-bye.

Daniel - 00:30:18:

Bye-bye.

Intro/Outro - 00:30:22:

The Data Engineering Show is brought to you by Firebolt, the cloud data warehouse for low-latency analytics. Get $200 credits and start your free trial at firebolt.io.

Read all the posts

Intrigued? Want to read some more?