DataOps & The Data Catalog

with Michele Goetz, VP, Principal Analyst, Forrester

Michele Goetz

VP, Principal Analyst, Forrester

With more than 20 years’ experience in data management, business intelligence, and analytics, Michele Goetz helps data leaders navigate the complexities of data. Her research covers AI, semantic technology, data management strategy, data governance, and data integration.

Satyen Sangani

Co-founder & CEO of Alation

As the Co-founder and CEO of Alation, Satyen lives his passion of empowering a curious and rational world by fundamentally improving the way data consumers, creators, and stewards find, understand, and trust data. Industry insiders call him a visionary entrepreneur. Those who meet him call him warm and down-to-earth. His kids call him “Dad.”

Satyen Sangani: (00:03)
Ops is, well, having a moment. The front office has sales ops and marketing ops. Engineers have DevOps. Data scientists run ML ops and it seems like everyone is building an ops team. But why? Most of the time the needs for operations arise from the need to be able to drive efficiency within a function and to be able to react to the inevitable problems that arise from day to day. In data, cloud computing and machine learning have given rise to real-time 24/7 operations. And the result is that for many use cases, the days of “set it and forget it,” ETL, models, and dashboards are basically over. For these use cases — where you have to constantly iterate and constantly improve — you need a more continuous approach, much like the transition from waterfall to agile. Small wonder then that many data teams are embracing DataOps. So today we're sitting down with Michele Goetz, an expert in DataOps and an analyst at Forrester. Michele and I are going to explore this exciting new trend and the ways it'll impact the entire data intelligence market.

Producer: (01:22)
Welcome to Data Radicals, a show about the people who use data to see things that nobody else can. This episode features an interview with guest speaker Michele Goetz, vice president and principal analyst at Forrester. In this episode, she and Satyen discussed the origin of DataOps, the complexities of centralization, the evolution of the data catalog, and much more. This podcast is brought to you by Alation. Our platform makes data easy to find, understand, use, and govern so analysts are confident they're using the best data to build reports the C-suite can trust. The best part, data governance is woven into the interface so it becomes part of the way you work with data. Learn more about Alation at alation.com.

Satyen Sangani: (02:07)
Hello, everyone. Today I'm welcoming Michele Goetz, who is a vice president and principal analyst at Forest Research. Michele's work covers artificial intelligence, semantic technology, data management strategy, data governance, and data integration, and her research has been featured in the Wall Street Journal and Forbes. Michele, welcome to Data Radicals.

Michele Goetz: (02:27)
Thank you. It's good to be here.

Introduction to DataOps

Satyen Sangani: (02:28)
We'll start with the meat of what you've recently done, which is work around DataOps, and this is a new term, or at least a relatively new term. Tell us about what DataOps is and how you got interested in it.

Michele Goetz: (02:43)
DataOps is really the engineering and practices of designing and developing data capabilities, launching them out to production and ensuring that they're providing value and delivering on the outcomes that businesses expect and being able to use that data. It's the technical side of data, which I haven't always from a practice perspective concentrated on in my research. And yet in conversations, and certainly as we've been going through new thought processes, new ways of operating, new technologies, the mantra of the “business owns the data,” or “business owns data governance,” it really obfuscated the challenges that the IT organization had and was going through to keep up with the business demand to be a data-driven or an insight-driven organization. And it was causing a lot of sort of introspection within the data management organizations on what is really the best approach to democratizing data, to enabling the self-service, bringing shadow IT around data and analytics out into the open and use that as a formal operating model, but how do you collaborate and coordinate in there to assure your data and create clarity and value from it? And so it was really sort of backfilling what I felt was a gap in my own understanding of there's still this big IT thing sitting out there. It's the 500-pound gorilla that we need to talk about again, and that's where the DataOps research really came into being.

Satyen Sangani: (04:31)
Give us the history around this idea of DataOps. Ten years ago, when nobody was talking about DataOps, what were the things that existed that were DataOps-y?

Michele Goetz: (04:42)
You could see it sort of developing as organizations were advancing their analytics and their data science practices, and they were so demanding in terms of what they needed and the nuance of the data that it made sense for them to have their own environments and build their own pipelines and run their own analysis. Really, all they wanted was, “Could you give me a platform to make this easy to work?” And so I think that that was really the genesis, which the data management team just turned out to be spinning up and spinning down environments to help with all of this analysis and iteration. But the reality was when you get to the point where you have a great analytic model — whether that's machine learning, deep learning, wherever you're going to take that — eventually you want it in production. It's not good enough just to see a pretty picture on a screen.

You want that insight to drive an action in real time so you create a great customer experience or you automate and speed up capabilities in your business processes. So what happens with that model? How does that get deployed? And oftentimes the data doesn't look the same way in the analytic environment and how it flows within an operational environment, and so you have to have the data engineers come back in and ensure that you can deploy that model in a new ecosystem. And that's not easy. And I think what this sort of demonstrated was, first, there needed to be better collaboration, coordination, and communication between the data engineers or those technical data roles and the analyst data science roles so that data scientists and analysts weren't wasting their time in building models that couldn't be used.

And, then, a recognition that you need those technical teams to ensure that those models are continuously running at the performance levels and at the value that is expected. And so they're still having to think about, “Well, if I change something in a data flow or I have a new data source coming into this environment, that can have an impact on the model that's already there.” And so when you had ModelOps capabilities starting to evolve so that data scientists and machine learning engineers could see how those models were running.

There's always somebody picking up the phone or on Slack and getting to the data engineer and saying, "I know I've got a problem with my data. Go figure it out. What's going on?" And so DataOps was starting to give a way to work better together. And then the second thing is in order to operate in that more fluid, flexible, adaptable method to deploy data and insight capabilities, they had to be thinking much more about composable architectures. So rather than building monoliths, they had to break that up into different types of data capabilities and components that could be composed, strung together to speed up the ability to serve those data capabilities and the data itself out into the rest of the organization and the systems.

Satyen Sangani: (08:02)
Got it. So basically, I'm going to paraphrase and just make sure that I got it right. In the beginning, there was kind of IT, and they would just provide reports and data arbitrarily. Then all of a sudden this self-service trend comes along and all of a sudden IT's role changes pretty dramatically to being providers of tooling and infrastructure and basically raw data sets as it were. And now all of a sudden with all this self-service, you got a whole bunch of, as somebody that I know refers to it, analysts running around with sharp scissors, and they're able to do things and create things, which is awesome and wonderful and great, but then when the time comes to productionalize these models and these algorithms and these reports, there's real sort of engineering that needs to come into play. And this idea of both understanding those models and hardening them as well is really where the DataOps, in your view, that that discipline is now falling out. Did I get that right or did I miss it?

Michele Goetz: (09:04)
Yeah, no, I think that that's spot on. I think about it in sort of the TikTok paradigm. You have those that are producing content, you have those that are consuming content, and — oh, by the way — those that are consuming content are actually producing content as well. And it's the same thing that's happening within our organizations today. We're taking that same social media paradigm into the way that we create insight and the way that we consume insight. We need a more flexible way of working together to make that possible because technical skills span across the entire environment, and the goal is speeding up insights so that you get to those business outcomes faster and at scale. And you can't do that if you're only thinking about everything has to flow through the bottleneck of a data team.

The Adoption of DataOps

Satyen Sangani: (10:01)
What percentage of your customers have adopted a DataOps approach, either explicitly or implicitly?

Michele Goetz: (10:07)
I would say that the majority of the large enterprises that I work with and speak with regularly, there is DataOps somewhere within their organization. I think the better way to look at it is what percent, or the way to answer it I guess, is what percentage of organizations that's the only way that they operate? It's fully rolled out, and that's more in a 15 percent area. Those that have a strong center of gravity to progressing with DataOps, you're looking at somewhere between 50 and 60 percent of large enterprises.

Satyen Sangani: (10:51)
You mentioned something else today that was really cool, which is like, "Look, there's this idea that we need to be able to have an adaptable method to deploy insights and capabilities," and when you said that, the first thing that I thought of is this concept of data mesh. How does this DataOps trend play in with this idea of data mesh? I know you've thought a little bit about that and I'd love for you to draw the linkages because I think a lot of people might not have or maybe don't connect them as explicitly.

Michele Goetz: (11:18)
What's really important to understand is that data mesh is really much more of a strategic approach and practice to ensure that you get the most value out of your data and that that's scaled. Data mesh is comprised of four key principles. One is domain ownership. That is where you're really not only defining your data contextually and semantically, but you are able to understand who really is the subject matter expert for that. Who do you go to when you really need to understand that? Customer could be a domain, but really what it means is customer in the context of, say, they are a credit card customer or they're or a mortgage loan customer or they're a banking customer, for example, and there's going to be different ownership for that. We talk about federated computational governance, which — that's a mouthful, but what does that actually mean? It means governance as code.

So that where you need the controls, where those policies have to execute and fire isn't just back down at your database environment, but they're logically firing up where data is flowing to and how that data is consumed. You're looking at the self-service concept that we were talking about before where you're able to apply no-code/low-code capabilities so that anybody can work with the data.

And then lastly, the concept of data as a product, which is we build a lot of these components. I talked about composable architecture before. Well, your product could be a dashboard, it could be a model comprised with an API deployed within a business process. The product is defined by what is the business outcome that you're trying to achieve. It's a value metric. And then the components underneath, what did it take to create that? It's the data, it's the data models, it's pipelines, it's services, it's controls, it's all those things that go together. So data mesh is really resolving around the domain owner, federated computational governance, self-service, and data as a product. And if you bring the right subject matter expertise first, you're able to define the value ahead of what you're actually doing with the data versus starting with the data and trying to figure out and pushing it's way up and swimming toward “What do I want to do with this? What is the use case?” It's really being much more thoughtful and strategic about the understanding of why you know data is valuable in designing your data in that context.

How Does This Concept of Data Mesh Fit In With Dataops?

Satyen Sangani: (14:13)
To me, the running theme through all of this was this idea of decentralization that before you were in this world where we were going to bring everybody together in this single source of truth that would unite everybody and clean data would exist forevermore. And here there's almost like a capitulation to this notion that says, "Yeah, it's kind of messier than that, and it's constantly changing." How does this concept of data mesh then fit in with DataOps?

Michele Goetz: (14:41)
When I was describing DataOps and sort of the pressure points of why it's evolving the way that it's evolving, it is about that decentralization. Self-service is decentralization, but that can absolutely cause chaos. I think what organizations really have to address is not just decentralization for decentralization so you can do self-service. What the underlying theme of data mesh is when we talk about decentralization is really federation because there are standards either being developed in decentralized pockets or there are global standards that we have to adhere to because of regulatory pressures, for example.

When you've got the EU and their privacy acts, when you've got California and their privacy acts, those are going to be global standards and they have to be applied and somebody needs to manage them centrally and ensure that all of that is getting out. So there's constantly, with data mesh, this balancing act between “What does decentralization mean?” and decentralization isn't just everybody has the keys to the kingdom and they can do whatever they want and they maverick their data all over the place. It's a much more thoughtful, coordinated effort because one of those decentralized nodes is still a centralized core that [is] where global standards and global policies and scalable reusability has to occur. There's going to be a pod that concentrates on that and ensures you can harden your data for business benefit.

Satyen Sangani: (16:18)
People want centralization, but we've historically thought about centralizing the wrong things. We're trying to centralize the data and the tools and the infrastructure and all of those things are actually quite fine to be totally distributed. But the things that we need to centralize are the rules and the frameworks and the policies and the operating models and the coordination facilities because people are just not talking to each other on some level.

Michele Goetz: (16:44)
The hub-and-spoke is dead. It just doesn't work. The early footprints of our big data environments just didn't make a whole lot of sense, and it's really where data fabric starts to come up, which is handling the decentralization of data. It's handling the distributed nature of the environments that data is flowing into, flowing out of, how insights are being developed. And data fabrics tend to gravitate toward those hub-and-spoke models with that old thinking of centralization, whereas what it was intended to do, and I think what data mesh is helping us with is guiding the architectural principles and capabilities that help us to achieve what that decentralization principle is intending to do. You're going to have multiple clouds. I know even in our own Forrester data, we're seeing most organizations have two to three clouds, cloud environments. So data fabric is sort of the technical mechanism that's going to be applied once you have your data mesh strategy concepts and architecture established.

Is Data Prep and Data Cataloging the Same Thing?

Satyen Sangani: (18:09)
I want to take one little detour, which is, there was this idea, and you and I when we started 10 years ago talking about this entire space, one of the spaces that was really big was this concept of data prep. And you had gone into researching that area, deep into that area, and there was this question: Is data prep and data cataloging the same thing? Should they all be done by the same vendor? And I remember that time, what happened in that space? Because I think there's part of the fabric and I think also just maybe instructive to think about technologies one as investing in them.

Michele Goetz: (18:41)
So I think the value of those technologies was it embraced self-service. It knew that there was a different data experience to gather, prepare, and use data by analysts and data scientists. And so that was super positive. And I know what got me interested was in my analyst days like, "Oh, I wish I had a tool like that." I didn't want to code. I didn't want to use Excel. This was a really cool thing. But it created some confusion because was it a pipelining tool, like a no-code/low-code non-technical pipelining tool? Or was it just Excel on steroids and it saved everything for you? And I think once that started to sort itself out, what most data and analysts recognized was its functionality and that functionality is better served closer to the environment where you're working with the data. So what you saw is, if prep was being supported more from a pipeline perspective, then you've got, say, analytic companies like Qlik or Tableau, who said, "Well, I'm either going to build or I'm going to buy one of these data prep firms, so I have no-code/low-code data pipelining."

That's one way of handling it. And then on the other side was, “Well, I have to better support the munching of the data and speed up what my data scientists can do with the data so they're spending less time on preparing data and labeling it and classifying it and doing all those things and more time on the analysis itself, and so then you've got some of the other analytic firms like a DataRobot, for example, that kind of assumed those data prep vendors out there. So it went from about 2014, 2015 to 2019 with just about every major data prep player being acquired. And then eventually, I think it was 2020, which the last big player was acquired at that time. So I think it was more of an example of: Somebody had a really great idea to solve a very niche challenge, and that niche challenge was something that all either data management platforms or analytic platforms needed to incorporate and it was just easier to buy it and integrate it.

Satyen Sangani: (21:12)
So you and I first met — so the history here is really fun because I remember when I first, in 2015 when we launched the product and called it a catalog, I remember you were just like, I don't know if you were being nice. You were probably just being nice, but you were like, "Oh my god, this is totally different," and I just remember that conversation and even where I was so vividly, and I think that touches really well on DataOps. Now, some people might not know the history. You wrote a paper on machine learning data catalogs, and more recently you've written a set of papers around data catalogs for data operations. Are those two different things or are they — and then, by the way, there's also, I know you've also spoken about, other Forrester analysts have spoken about this idea of data catalogs for data governance. And so how do you see these things shaping out? Are there three different catalogs that an organizational will run or how does that work?

Michele Goetz: (22:09)
When I talk about machine learning data catalogs, that was the new big thing. Certainly Alation was one of those companies that really kind of launched that forward. And so what I think is sort of evolved there is, machine learning does something. What is it doing? You're training it to either support governance activities, you're training it to support analytic activities, or you're training it to support data management activities. And as machine learning data catalogs evolved their intelligence and usefulness for different types of tasks and responsibilities, that also means that it's gravitated toward different roles. And at Forrester we're really very focused on knowing who is the user of that environment and what are they trying to get out of that environment. And so when we — rather than just going holistically and saying, "Oh, we got these catalogs that have machine learning in them" — it's like we wanted to really help our clients understand why do you even care and what are you going to do with this? And so three scenarios: you've got a marketplace and an environment to help you prepare your data and share and publish your insights and analytics and models and things of that nature. But the data governance piece is still pretty significant. And there are very specific types of capabilities, user interfaces, and user experiences that catalogs provide and that the machine learning is trained for that to support those things. But the DataOps one was starting to gravitate back down to — well, data mesh was helping everybody do all these great things for data from a business perspective, but you still have to instrument the data and you need to do that at scale and you need to satisfy at scale the demands coming from the business and the self-service activities. And how do you keep that all straight? How do you stay coordinated?

How do you collaborate? How do you communicate with each other? You talked earlier about CICD and having multiple versions of a model or a pipeline or a data set out there. How are you going to manage all that from a technical perspective and not accidentally deploy something that has no business being out in the wild? And DataOps needs their own single pane of glass. They need to understand their data estate. They need to be effective and fast and at scale in data provisioning and automation, and data catalogs have evolved to provide stronger support for those data operations responsibilities, in particular, supporting data engineers.

Satyen Sangani: (24:55)
I agree that there's definitely a need for specialized interfaces for particular roles, and then there's the common interfaces. Do you think that there's a need for multiple different catalogs within an organization — one for governance, one for ops, one for more casual analytics users or data scientists — or how do you see that evolving?

Michele Goetz: (25:15)
If somebody wants to build the nirvana environment where it's like one catalog to the entire data world, fine. But I think that that's a little bit challenging and I think there's still some maturity in growing up that has to happen as we modernize. What I see more is that if the analyst and data steward community have a catalog environment that is tuned toward their particular needs and the data operations team and engineers have a catalog that's tuned for more of their needs, really the requirement is just that they can talk to each other. You need interoperability. So you keep things clean and focused for one side to the other so that you're not muddying up what that experience is and what the insight and intelligence is to do your job. You don't need to overcomplicate things, but sometimes you need to overcomplicate things depending on the level of sophistication.

So I don't necessarily see those worlds coming together anytime soon. I also see a lot of data catalog capabilities bubbling up inside data management tools. Your ingestion and pipelining tools have catalog functionality. Your data virtualization capabilities have catalog functionality. You've got Kafka products out there that have schema registries. That's another form of a catalog. And what ultimately you want to be able to do is whether you’re in your data consumer and stewardship environment or whether you're in your more technical environment, you should be able to tap into those tool-based metadata environments and make it easier to harmonize that and make sense of your data estate, depending on which side you're trying to come forward on.

I wouldn't necessarily [say] it's a true hub-and-spoke capability, but there is something to the fact that we are just going to be living within a multitude of catalog environments by the very nature that everything we build from a data perspective or any network or any piece of hardware that's out there has a metadata repository. And if you really want to understand how to tune data and insight for your business, you're going to have to find some manner, shape, or form to get at that metadata.

Satyen Sangani: (27:42)
You won't be surprised. I agree with some things there and I maybe might not see others exactly the same way. We talk a lot about the catalog being a platform and we have so many customers who come to us and basically say, "I need the metadata or the information in your catalog for X, or I need to populate information from name your privacy tool, name your data observability tool into Alation because I need to be able to coordinate across different rules." What I worry about in a sort of — and I think all catalogs aren't necessarily the same so I agree with the idea that there needs to be functional interfaces between catalogs. We integrate with AWS Glue, which is a catalog of sorts really for data engineers and certainly do the same with Tableau, which is a catalog for data analysts and people who are preparing dashboards.

I think the one thing that I would worry about, and I guess I'd love your feedback on, is to me the entire point of a catalog is, back to the beginning idea, is really collaboration and communication because you want to get that really ... the person that has that business question often will ask a question in a way that doesn't map to the data. The tradeoff seems to be how much am I giving somebody their perfect technical interface versus — or whatever their interface might be, whether it's technical or not — versus the overhead behind communication and semantics. That's the turnoff I see. I would love your perspective.

Michele Goetz: (29:07)
So certainly perfect is the enemy of the good. So I would definitely agree with you there.

Satyen Sangani: (29:14)
I say that all the time.

Michele Goetz: (29:15)
Yeah, so that Silicon Valley thing. I'm generally in agreement with that. I think that the challenge right now, which is why I don't see those worlds truly harmonizing yet, is — so, first of all, the ideas that you put forward, I'm 100 percent behind that and it's kind of where I've been trying to go. But the reality is there is still so much to understand and unpack around what the data operations needs are, and they're very different than what the data governance and analytics needs are.

And there is going to have to be some good research, good design thinking, good UI/UX efforts so that it doesn't have to be perfect. Many of the catalog vendors today are just starting to get reacquainted and understand data operations in the data engineers. And so I think, until you can find where that sort of sweet spot of the good versus the perfect is, it's then how do you start to harmonize those together? My ideal world would be like, I've got my single pane of glass, and it's basically a collaboration platform and we're moving forward. But I also know that we do not need another SharePoint brought down to our data.

How Do You Think the Market Will Evolve

Satyen Sangani: (30:52)
So, Michele, I guess maybe with all of that, how do you think about the market evolving over the next five or 10 years? Do you see consolidation? Do you see specialization? Are there markets that need to exist that don't exist? What are the big problems that are yet unsolved? And if you were giving advice to the general cataloging vendor landscape, where do we all need to focus our time and energy and effort?

Michele Goetz: (31:17)
I think it's gravitating toward the data mesh concept of data as a product. We're building capabilities because data is supposed to be useful and valuable and ensuring that we are orienting what we know about our data and the policies about our data in the construct of “How does that connect directly to our business outcomes or our business goals and objectives?” If you have data as a product at the top of the pyramid, everything starts to go together. So I'd really like to see the data catalog market and the environment start to really orient toward that, that you define everything by “What is the value you're trying to create and how are you representing that as an object?” I think that's number one.

Number two is: I think that there is going to be some consolidation or acquisition absolutely within the next five years. I think you're seeing an evolution of the data fabric vendors and what they're trying to incorporate within their platforms and transitioning from just being only about data virtualization, let's say, and moving into other forms of data services.

And they're going to need a catalog to make that possible. But on the other end, even from an analytics or AI perspective, I think there's definitely a requirement for more strength in having that marketplace of data and machine learning and apps and harmonizing that a little bit more. So I think that's going to drive some of the acquisition process, but I do see that there will be maybe two or three big players today that are going to survive and continue to push forward and creating much more of that single pane of glass, harmonizing those roles, creating that collaboration environment, having that overlay, and so on.

Satyen Sangani: (33:25)
My relationship with Michele goes way back. She was the first major industry analyst to understand and embrace machine learning data catalogs, which we invented here at Alation, so we often talk about market trends and how we see the world of data evolving. DataOps is clearly on the rise. One of the strongest proof points of this is the increased demand for data engineers — a role that, as Michele noted, is even more sought after than data scientists. It's easy to see why. The rise of self-service analytics has pushed IT departments to shift their focus. Today, IT must deliver trusted data, not prepackaged reports.

And yet the practice of DataOps produces, as you guessed it, more data. Where does that metadata live? I'd argue that it needs to be in one accessible platform like a data catalog, just not one that's specific to a single team. The last thing you'd want to do with an ops team is segment them off from the rest of your data community. So I'm super excited to see how the world of catalogs and DataOps plays out and even more excited that we have folks like Michele thinking about these trends. This is Satyen Sangani, CEO and co-founder of Alation. Thanks to Michele for a fascinating conversation, and thank you for listening.

Producer: (34:40)
This podcast is brought to you by Alation. Are you curious to know how data governance might actually be good for your business? This webinar with Gene Leganza, research director from Forrester, explains how to align people, process and technology for growth-oriented governance initiative. Check it out at alation.com/youtube.

Other Episodes You Might Like :

Frameworks and the Art of Simplification

Dave Kellogg

Advisor, Consultant, Thought Leader

Listen Now

Perfect is The Enemy of The Good

Ameen Kazerouni

CTO of Orangetheory Fitness

Listen Now

Asking the Right Questions

Frank Farrall

Strategy & Analytics Ecosystems and Alliances Leader, Deloitte

Listen Now

Building the Company You Wish You Could Buy From

Mike Capone

CEO of Qlik

Listen Now

Start with Story, End with Data

Ashish Thusoo

Founder of Qubole and Creator of Apache Hive

Listen Now

From the Outskirts to the Center

Jitendra Putcha

EVP & Global Head of Data Analytics and AI, LTIMindtree

Listen Now

Get Out of the Building!

Tricia Wang

Co-founder, Sudden Compass

Listen Now

The Scientific Integrity Crisis

Dr. Elisabeth Bik

Microbiologist, science integrity consultant

Listen Now

Data Governance: Any “Dummy” Can Do It!

Dr. Jonathan Reichental

Author and Founder, Human Future

Listen Now

Humanizing AI: Authentic Storytelling

Jepson (Ben) Taylor

Chief AI Strategist, Dataiku

Listen Now

Your RFP is Useless

Paul Leonardi

Duca Family Professor of Technology Management, UCSB

Listen Now

Premature Enumeration

Tim Harford

Host of Cautionary Tales and Author of The Undercover Economist

Listen Now

How Extreme Focus Launched the Modern Data Stack

George Fraser and Taylor Brown

Co-founders, Fivetran

Listen Now