Artwork for podcast Data Science Conversations
The future of LLMs, ELMs and the semantic layer
Episode 181st November 2023 • Data Science Conversations • Damien Deighan and Philipp Diesinger
00:00:00 00:34:50

Share Episode

Shownotes

In this episode Tarush Aggarwal, formerly of Salesforce and WeWork is back on the podcast to discuss the evolution of the Semantic layer and how that can help practitioners get results from LLMs.  We also discuss how smaller ELMs (expert language models) might be the future when it comes to consistent reliable outputs from Generative AI and also the impact of all of this on traditional BI tools.

Transcripts

DD: This is the Data Science Conversations Podcast with Damien Deighan and Dr. Philipp Diesinger. We feature cutting edge data science and AI research from the world's leading academic minds and industry practitioners so you can expand your knowledge and grow your career. This podcast is sponsored by data science talent and data science recruitment experts. Welcome to Data Science Conversations Podcast. My name is Damien Deighan, and I'm here with my co-host Philipp Diesinger. How's it going, Philipp?

PD: Good. Thanks, Damian.

DD: Great. And back on the podcast today is our good friend, Tarush Aggarwal. Welcome again Tarush.

TA: Hey, Damian, hey Philipp. Thank you so much for having me. Excited to be back.

ering from Carnegie Mellon in:

TA: Yeah. Again, Damian, Phillipp excited to be here. Thank you for having me. I

::

think it's a great question. I think obviously, unless you've been living under rock, sort of generative AI has been massively transformational. And when we think about data and generative AI, we haven't... you know, there haven't been a lot of very, very obvious use cases. Obviously AI has transformed areas like content, marketing in terms of coming up with content, areas like even code generation through things like copilot. You know, there are a lot of people with AI anxiety and really figuring out how does AI enter the data space. You know, there are a few companies working on a few different things. Obviously there's Text-to-SQL, which a bunch of different players are thinking about. It's this notion of, you know, being able to ask, you know, any question of your data and it's gonna use AI to sort of generate the SQL leader in order to run that query. Personally, I'm not very bullish on that. I think, you know, AI does extremely well in sort of data in certain domains, which are very closed ended. So like writing code or content and things like that. When you think about you know, answering a business question, this is sort of extremely open-ended. So I don't think AI has proven to be very effective over there. What I am very bullish on, and again, we're seeing a bunch of players stop to do this, is being able to go plug in generated AI or an ELM or, you know, a private LLM on top of your semantic layer. So, you know, being able to push this on top of your metric definitions and then being able to ask, you know, very, very open-ended questions like, what was revenue like last month? Where did it come from? Is this channel going up? Is this channel going down? Is this increase in revenue actually statistically significant, or is it just a sort of one-time blip? So, you know, I think using AI on top of the semantic layer is I think the most exciting application of data... is the most exciting application of AI in the data space today.

DD: So Tarush, can you define for us what you mean by the semantic layer?

TA: Yeah, absolutely. That's a great question. The semantic layer, you know, all those become popular in the last year or two. I mean, it's been around for... it's been around forever. But the semantic layer, you know at the very high level, you have raw data coming into sort of into your data warehouse or into any sort of data storage layer. And this data is messy, it's unstructured, it's typically not built in a way to answer business questions. What we typically do is we model this layer, we clean it, we join it, we sort of reframe it in a way which makes sense for the business. The semantic layer is essentially just a definition layer. It allows us to interpret what that data is. So semantic layer would contain what your

::

definitions are and this could be your definitions of revenue, of what you call an active user. This could be, you know, this sort of revenue from a particular channel. It allows us to figure out how to go compute one sort of business metric in a way which becomes very consistent. So if anyone wants to compute this metric, they refer to the semantic layer. And the semantic layer will sort of give us the sort of definition of that metric from a perspective of, you know, SQL or from a coding perspective. And then we can have all the consumers who want to consume what that metric is, instead of reading it directly from the raw data, they would read it from the semantic layer. And this means everyone would have the same definition of that metric. And if you ever wanted to change that, you would just change it in one place and it would propagate into all of your different consumers

PD: And Tarush. In what sense do you see LLMs basically stepping into this? Like what role would they play into the future to to navigate data or utilize the semantic layer to do so?

TA: I think that's very interesting because if you look at the sort of data space, right? Like many years ago now, the semantic layer was really part of BI, it was part of, you know, what we call BI tools or reporting tools. So these are your tableaus and more recently your lookers and open source one like Preset and Light Dash and Sigmas. So for a while the sim... you know, the sort of metrics used to exist inside your BI tool. And what's happened in the last few years is we realized that if they exist inside your BI tools, then every time you wanna do data science or every time you wanna push this data back inside your production systems, you're gonna have to duplicate this logic because they only exist inside reporting. So what happened in the last few years is we pulled out this sort of metrics layer from BI and we had it as a standalone layer. So now you have BBT, obviously one of the most famous modeling tools. We have a bunch of companies like Cube and Transform, which have built semantic layers. But semantic layers, sort of metrics in the semantic layer has become an individual layer and BI is one of the consumers of the semantic layer. So I think what becomes really interesting now as we think about the future is being able to go plug in an LLM or an ELM on top of your semantic layer and using interfaces like speaking to IT or Slack or you know, much more intuitive interfaces. Is that 100 X better experience than what we are doing today with sort of BI tools in building dashboards of self-service analytics? And I think obviously there's a lot of money in BI in there're like

::

massive, massive players. But, you know, one of the questions which I'm thinking about is, you know, are we approaching really the end of BI tools and are they just gonna be replaced by having, you know, an ELM or an LLM on top of your semantic layer? Because that is just a much, much, much better sort of interface. So I think that's really sort of one area which I think is becoming extremely interesting. And we're starting to see a few players, like, you know, they are BI tools like ThoughtSpot doing this, but we're seeing a bunch of new startups like Delphi Labs which I think is based out of the UK, which are, you know, sort of building sort of very, very sort of special products on top of your semantic layer and sort of very [inaudible 0:07:54] like that.

PD: Which makes a lot of sense. So the idea is basically that the end user directly leverages an LLM to kind of go around having to engage with a data science department or so, and just gets reports or visualizations insights from the data on demand when they need it through the LLM directly from the semantic layer and from the source data.

TA: Exactly. Right. You know, the whole thing about BI is you are building a dashboard to answer questions, and that just means that you need to know what those questions are. But the way you think about information is sort of gaining access to information, leads to asking better questions. You know, I get some information, I think about it, and then because of that I now have a new question. And when I have a new question, I don't wanna go to a data team and be like, "Great, I have a new question now." The issue with putting your LLMs or your ELMs on top of your raw data is that you aren't sure if the ELM or the LLM has just made up a sort of business definition, right? It's just hallucinating, which is why I like having it on the semantic layer so you can be assured that whenever it computes it, it's computed in the right way. So now I can go like very, very specific and you know, I'm not just asking what was revenue last month and what was sort of revenue by channel, but if I see some sort of channels going up and down and be like, you know, I see that this channel went up, is this sort of... is this significant. Is this been happening sort of periodically or, you know, I know revenue went up. What percentage of that revenue went up from existing customers or new customers? Or what percentage of that went up from existing customers in this new channel and, you know, compared to how much money we spend on this channel, is this channel worth it or am I better off using another channel? So highly, highly, highly contextual questions, which I'm just combining,

::

you know, I'm sort of building on top of things and I can start doing that. And that is just not possible in the old paradigm of building reporting inside BI tools.

PD: Are there any specific changes you would see already to happening to semantic layers to get them fit basically for LLMs? You already mentioned some topics could be like compliance, quality assurance topics, regulatory safety topics and so on. That could be part of that pipeline of course. But are there any specific changes directly to a semantic layer that you would anticipate?

TA: No. I think, you know, at a high level, like, you know, the semantic layer is best sort of pulling the semantic layer out from BI and having it as a standalone layer. It has been, you know, the work needed. It's not complicated to go go plug in an ELM or an LLM on top, I think two things come to mind. I think in general, the semantic layer is very, very early, right? These sort of DBT launched semantic layer last year and that was a big failure. And, you know, they sort of acquired transform and they're in some ways deprecating the semantic layer and in you replacing it, which sort of transform. So, you know, one of the biggest companies behind data model got it wrong. And you know, because of that, we've seen, you know, companies like Cube and like a bunch others launch semantic layer. So we're very, very early in the sort of adoption life cycle of semantic layer. So this is not something which is not a sort of very readily available stack for most companies. So there is work in actually going and deploying this layer, but now deploying this layer, all of a sudden went from a nice to have to this becomes truly interesting. Because if I am able to deploy this layer and I do it right, you know, truly it's not so much about BI tools, but it's truly that I can truly build an organization where we have... where anyone in the company can ask any question and that gets really interesting. Maybe some work needed on like, you know, on sort of security and privacy. And I think that opens up... I think first of all, you know, sort of something to keep in mind is that using a sort of public version of an LLM like using ChatGPT is probably not a good idea. Cause as soon as you plug something in there, it's now available for everyone, right? It's not linked, it's not your instance of it. I think they now have a sort of, now they have a version of ChatGPT, which can be deployed for your company or you can use like an open source ELM. Sort of the main difference in ELMs and LLMs is ELMs are built on top of your data sets. They go, they're open source and they're way more focused on your business. Whereas a large scale language model has got a lot more information. So I think either work, what you don't want to use is a

::

sort of public LLM. You either use a private LLM or your own ELM. But I think as we think about deploying these inside companies, you know, what BI tools have done really well through like many, many years is you can control the data, right? So, you know, this sort of very easy to understand example of this is a sales manager of a store is only seeing data for their store and their manager can see data for a bunch of stores and their manager can see it for the city and for the state and, you know, sort of all the way up. And we can filter all of these things inside a BI tool by having different layers of access control. Obviously, as we think about, you know, how do we do this? We, you know... there're a few different ways. You know, I think semantic layer in general are pretty new, but semantic layer is actually starting to build out access control as a role based access control and who can get access to what information. And then there's this other concept of do you want this in the semantic layer? Which probably makes sense in terms of it would be enterprise wided adoption, or do you want it on side your, you know, AI layer and your AI layer is better enforce this. So I think there's some open questions. I don't think any of these are really, really hard problems to solve, but you know, it shows that it's relatively at its infancy 'cause the best practices of how to do this haven't been very clearly defined as yet.

PD: This idea of having the LLM that kind of is able to navigate an enterprise data landscape is also something that has been, you know, brought up by Google and Microsoft a couple of times already. So it seems like they have potentially like a strong vision of, you know, pushing that into the future. And I think everybody can also follow that a little bit and see it. But how about setting this up from the side of the LLM, like if you have a GPT or bot or a model like that, what is required to connect it with the semantic layer to make sure that it can navigate the data landscape in a robust way?

TA: So I think, you know, obviously the first version of this is gonna be like, you know, be able to export data to a Google sheet and just go put this inside your own private LLM and then they'll ask questions on top of that, right? The issue over there is, again, you don't have a semantic layer on top. So, you know, there's always some fear that it's sort of hallucinating a sort of definition of your metrics. If you wanna do it inside the correct way, you know, I wouldn't recommend sort of going down that route. You know, you want to have it plugged on top of your semantic layer. Now, you know, either you have the expertise where you can use these APIs to actually go connect this on top of your semantic layer. You know,

::

sort of as I mentioned, there are some companies that are doing this, right? So Delphi Labs is an example. I think you can go sort of sign up for their sort of service online and you can just go connect your slack on top of your sort of semantic layer and then you can just sort of start, go asking questions that stack out of the box. And I know there are a few others, but I think, you know, what we're starting to see, and we've seen this more inside content, we've seen this more inside marketing, we've seen this inside writing code with copilot. These are some applications built on top of the generative AI sort of technology stack, right? So we are seeing more and more of these sort of these vertical applications built on top. And I think, you know, the next tier of these, the tier II of them, they're gonna be a ton of applications like Delphi, which are basically able to go connect your semantic layer and they're giving you a very easy, whether it's Slack or whether it's text-to-speech or whether it's another very sort of intuitive type interface. I think we'll start seeing a lot more of those.

PD: Obviously there's a lot of components of setting up an LLM, especially when you think about an enterprise level set up where you have huge amounts of data, many stakeholders and so on, lots and lots of complexity. So we're talking about, you know, prompt engineering, pipeline engineering, you know, fine tuning and so on so forth. There's a lot of things to consider to get an LLM to really work properly. Do you see a clear path already for the connection between the LLM and semantic layer in that sense in a very specific kind of heads down sense?

nothing to do with [inaudible:

00:17:34

LLMs.

PD: Can you talk a little bit about the advantages, disadvantages of ELMs versus LLMs?

ld an AWS to build [inaudible:

PD: Where would you draw the line between the private LLM and then ELM? So private and obviously you do a lot of work on adapting it to a certain type of data or business. You have a lot of prompt engineering, which can be quite extensive

::

already these days, right? You can define many different ages that are use case specific. You know, you can define all of these tools. You have APIs and so on and so you're doing a lot of work in the periphery of the LLM already to kind of adapting it and it kind of specializes more and more the more you do that, even though you're not fine tuning or pre-training or training the model at all, right? Do you think like ELM is something that has always to do with trainings or changing rates of the networks or where would you see the line?

TA: Look, I think that 99% of companies would be consumers of AI and 1% of companies will actually go create AI, right? And everyone is now saying AI driven, but only a few percentage of companies will actually go build the infrastructure for it and majority of companies will go use it, right? So if you go build it, obviously you have a much more comparative advantage. But having said that, you know, five years ago we saw this way where every single e-commerce company wanted to build their own recommendation engine. And the reality of it is like for 99% of these companies, they should not be building their own recommendations then you just go use a sort of recommendation tool for e-commerce. And that makes a ton of sense, right? So, you know, I see that same analogy inside ELMs versus LLMs. We're still sort of very, very early, right? So we haven't yet defined who the sort of category winners are in sort of many of these areas. So if you are able to obviously go do this yourself, you have, you know, you have more flexibility on what this looks like, you can, you know, sort of actually go customize it in a way which makes sense for your business. And given that it's not easy to do it either way, right? There's no like off the shelf solution, which does an incredibly, much, much, much better job. Again, you know, it depends on the problem statement, but four problem statements that we able to plug this on top of your company's own data sets. You know, it's not very, very clear that an LLM is 100 X better than an ELM. So we're just too early to actually go see what happens. I know that's not answering your question directly. I don't think there is a correct answer. I don't think anybody can foresee this at the moment. Yeah. So as you are rightfully pointing out, so all of these technologies are super new. I think the feeling that lots of people have regarding LLM is that it has been thrown out, you know, on the markets too soon. We've seen so many changes already just in the last couple of months and the pace of changes is just, you know, insane still. And as you pointed out already, the concepts and utilization of semantic layers is also something that is still evolving a lot. So this is, you know, very, very hard to predict, but of course, still

::

interesting that your point of view on it on the future. I think any predictions can only be wrong at this point, but nevertheless, you know, they're interesting to entertain a little bit.

PD: Exactly.

DD: So Tarush, one final question on the semantic layer. How does a company get started with deploying semantic layer? What advice have you got for them?

TA: You know, I think two of my favorite ones are, I think Cube is our favorite one right now. It's actually based on, LookML, you know, was the semantic layer for Looker. I think Cube is sort of largely based on that. It's an open source project transformers, DBT acquired too. It's got an open source semantic layer. So, you know, these are really good if you wanna deployed version, Cube has this sort of DVT as it deploys the new version now. I don't think you can use a managed version of Transform at the moment. It will be available later this year. So I think, you know, if you're interested in any sort of willing to deploy open source Cube to transform both work, if you want a cloud solution, I would go with Cube. And if you need more expertise, then sort of obviously, you know, at 5 X, we sort of assemble the entire day to day for you. So this is something which we could help out with as well.

DD: Yeah, I was gonna ask you about that. Is your platform at the forefront of this deployment of LLMs and or the semantic layer? Is there something that people can use there?

TA: So this is something which we're actively working on. You know, we have this going for a small subset of our customers who've been early adopters. But you know, in the next few weeks we are looking to basically go sort of to essentially go roll this out for like any 5 X customer. The issue again is not so much from being able to deploy the LLM and go be able to, you know, use things like Slack and tech and voice to basically go get this. At the moment, we're sort of constrained by how few companies have really invested inside a semantic layer. But again, you know, if whether you have no data stack or you have an existing stack, if these are some capabilities you need and you know you need to go deploy a semantic layer and then you get the tools at top of this, this is something which we can help with.

::

PD: Tarush, you mentioned already that the role of BI in businesses would probably change a lot. I think this is also something that's an expectation a lot of people agree on, and especially when you look into the industries a lot of BI teams or advanced analytics and data science teams, they are already anticipating this change and trying to prepare for it. What do you think are the biggest changes that will affect BI in the future?

TA: I don't think BI is going away, right? Like hundreds of billions of dollars of like market cap across, you know, your big players and they aren't in the business of putting their hands up and surrendering. I look at it more as an evolution, right? I think your very, very standard world of I need what BI does today is I want a report, I want a dashboard. I wanna let people in the world slice and dice data. It's just the fact that I think generated AI will just be much, much, much better at that because it can build on top of this and answer nuance questions, which BI dashboards that they need to... you know, you basically need to know the question beforehand to go do it. What I'm actually very bullish on, and I think what the evolution of BI is gonna be is it's gonna get a lot more niche. And I think, you know, if you look at the architecture, right, you have ingestion which pulls data from your sources, it puts it into your warehouse. You have a warehouse, you have a modeling or a semantic layer which cleans it. And then you have BI, which is essentially reporting. And then you typically have other layers on top, which is sort of recommendations and insights. And you know, until now, BI is really meant that this is a reporting layer, right? So reporting means that it doesn't ever change data. So if you wanna change data, you go back into your source system and you change it there. And then BI is a read-only sort of state, and that's gonna been the architecture. Now, I think with the role of traditional BI sort of in question where, you know, if you're just building dashboards, it's no longer that valuable enough. I think in some ways it's gonna force BI tools to get more niche, right? Like what is the BI tool for a FinTech company or what is the BI tool for e-commerce? And that's not just the metrics, but it's also like the insights and recommendations on top of that. So I think we'll have ingestion as a separate layer. You'd have data warehouses, and I think in some ways the parts of the modeling, parts of the reporting, parts of the recommendation all sort of merge. And this gets deployed for different industries. I am very bullish on this idea of like niche BI, which goes much deeper than just reporting. And I think if BI were to sort of go survive and it needs to lead an actual LLM or generative AI, then it would just need to be much, much more deeper. And the reason, and the one way we can get there is by having a lot of industry context.

PD: I don't see why there must be conflict necessarily, right? So BI, internal BI teams or even tools could also be the owners of specialized LLMs that could provide like, you know, on demand mini reports in the future.

TA: Yeah, but I think that that starts the question, the value of an enterprise BI tool, right? If like all of a sudden your enterprise tools cost hundreds of thousands and up to millions of dollars to go implement. And all of a sudden, you know, you can have a lightweight ELM deployed on top of your semantic layer and you can do this, you know, for 100 X cheaper. You know, ultimately these tools can only charge what they're doing because the lift of trying to do this yourself is very high. So they have a really valuable business and they can charge a lot of money. When all of a sudden anyone can answer questions on top of this semantic layer relatively easily, the value of your traditional BI tool starts to go down. And I'm not saying they still won't be able to charge some customers millions of dollars, but in general, now that there's an option which is 100 X better available, 100X cheaper, you really have a very long-term business. And that's really what sort of kills the industry. Not so much that like these large tools can't do this, they can, but will they still be able to justify 100 X sort of costs compared to like a much cheaper provider doing this because it becomes relatively easy to go do now. Does that make sense?

PD: It makes a lot of sense. I'm just thinking, you know, cost drivers in businesses can also be compliance, regulatory, safety, you know, ethical issues and so on. I think that's at the moment a little bit underdeveloped in the LLM space. But when this ramps up, that could also trip up the cost potentially for an order of magnitude or something like that.

TA: Yeah, exactly right. If you look at an enterprise tool today, it's not just, you know, enterprisey in terms of the technology, which in this case could be like slicing and dicing data and pivoting and like reading and having beautiful dashboards. It's also got everything you mentioned, the security and the compliance and the role-based access control and like, you know, audit logs and all of this, right? But now the mode has been both the visualizations and the dashboarding and the technology capabilities plus the security and compliance. All of a sudden now saying is hey, the new technology capability is actually quite easy. You know, the mode over there becomes a lot less defensible. So now it's kind of only the enterprise security piece, right? So essentially all I'm saying is that the sort of

::

barrier to entry of a new player who's, you know, building the next generation of a BI tool, all of a sudden, you know, this sort of barrier to entry has become a lot easier. And some of the competitive advantage which these larger tools had in BI is starting to fade away. With this sort of pressure on both sides, you know, the sort of open question is is are they going to be able to hold on to their valuations and their, you know, impact as much over the next few years?

PD: That makes a lot of sense now. In large organizations, a lot of that is still deployed or provided by internal teams for, you know, data science, advanced analytics and so on. In large global organizations, even individual, you know, markets or individual business functions have their own teams for that. How do you see those roles changed in the future? What kind of skill sets or talent or roles will they need in the future?

TA: Yeah, that's a great question. I think it's, you know, very, very clear now we've entered an era of do more with less, right? Like two years ago when we had this conversation, we had data scientists and data engineers and analytical engineers and I don't know, whatever new combination we made of this and data platform and all of this. And you know, I think last year, this year was interesting because the first time companies are like, "Hey, I'm not getting as much of an ROI from data and we need to basically do better with less," and do more with less has become a theme, right? And I think especially now and you know, you look at data engineers or analysts or BI engineers, you know, whichever sort of SKU you wanna look at it, folks like this are not have to go all the way and build dashboards and don't have to go to every analysis themselves. And, you know, their work is sort of become a lot easier because they just sort of going and figuring out what your semantic is and what your metric definitions are. And just by nature, the amount of scope of these data engineers, analysts, BI engineers, analytics engineers become a lot less. So on one, I think just financially just where we are with what's happened in the last 18 months. And then number two, with the advent of technology, I think from both sides, there's a lot of pressure inside businesses where we're gonna need smaller data teams in order to be as effective.

DD: So as we close the podcast Tarush, how can people find out more about what you're up to and what the latest advancements that are happening with your company, 5 X?

::

TA: Yeah, absolutely. You know, just for context as we speak about, do more with less, if you look at the entire data space, you know, we sort of deep dive into modeling and BI, it happens to be one of the most fragmented ecosystems, right? And the analogy which we use is all of these vendors that really selling car parts and imagine walking into Honda and instead of selling you a brand new Civic, they sell you an engine and you have to build your own car. That sounds a little crazy, but that's really what's happening in the data space. So what we're up to is, you know, we help you build an end-to-end platform across all of these different layers. We spoke about some of them today, like ingestion, warehousing, modeling, BI. But we help you build, you know, your end-to-end platform based on your use cases, your industrial size, your budget, so you can focus on day one and focusing on your ROI as to how you're gonna use this instead of worrying about building your car. So that's really, you know, what we're up to. Would sort of love to chat. You can reach us at 5 X, that's, you know, 5x.co or you can reach out to me personally. I'm sure we'll be sharing my LinkedIn and Twitter and my email is just tarush@5x.co.

DD: Awesome. Yeah, we will include links to Tarush's LinkedIn profile and 5 X website in the show notes. That concludes today's episode. Folks, before we leave you, I just want to quickly mention the forthcoming issue of our quarterly magazine, "The Data Scientist" issue v is an AI special issue. It's out in early September and it'll feature contributions from Philipp, Tarush and many others and companies such as AstraZeneca, Snowflake, Buyer and NatWest Bank to name but a few. It's packed full of insights like today into what enterprises are doing now and in the future with generative AI. And you can subscribe for free to get the next issue as soon as it is out. And the address for that is datasciencetalent.co.uk/media. Tarush, thank you so much for joining us today. As always, it was an absolute pleasure talking to you.

TA: Damien, thank you so much for having me on the show and hopefully we've added some value to your listeners.

DD: You certainly did. Thanks also to my co-host Philipp Diesinger and to you for listening. Do check out our other episodes at datascienceconversations.com and we look forward to having you with us on the next show.

Links

Chapters

Video

More from YouTube