Solving the GPU shortage: software to 2-4X existing GPU utilization

GPU shortage nvidia-a100

GPUs are rare and expensive right now. Every company doing AI model training needs more, and NVIDIA can’t build enough, especially of the NVIDIA H100 GPU. The result: a massive GPU shortage.

Run AI CEO and co-founder Omri Geller says he has a software solution to this hardware problem.

Subscribe to my YouTube channel here

The key: GPUs are mostly idle, even in high-demand settings. According to Geller, his software 2X to 4Xs your GPU capacity on existing hardware, simply by streamlining workloads and maximizing GPU usage time.

In this TechFirst, we chat about:
– the GPU shortage
– how many GPUs we need
– what Open AI is using right now
– whether OpenAI is getting dumber or not
– and much more

Subscribe to the audio podcast


Transcript: solving the GPU shortage

Note: this is an AI-generated transcript. It will have some errors. – Omri Geller

John Koetsier: Is OpenAI losing its advantage in the AI arms race, and is one of the biggest challenges in the global AI battle, the chips that power intelligence. Hello and welcome to Tech First. My name is Jon Koetsier. OpenAI and ChatGPT are pretty amazing. I use them regularly. But there’s a lot of competition out there and many think that GPT 4 is getting dumber, not smarter.

What’s going on and is it the global GPU shortage that’s impacting AI companies in general, as well as open AI specifically? To dig in, we’re chatting with the co founder of run. ai. They help you train and deploy large language models. His name is Omri Geller. Welcome Omri. 

Omri Geller: Hi, John. Nice to meet everyone.

Thank you. Really excited to be here. 

John Koetsier: Super pumped to have you. Thanks for taking the time. You’re in Las Vegas for a conference right now, which means that the time zones work. It’s pretty good. So that’s awesome. It’s still morning for you. Not evening. I just read this morning, the AI industry faces a severe NVIDIA H100 GPU shortage, stalling major projects and affecting tech giants bottom line.

What’s going on here? 

Omri Geller: So, we are actually in that scenario of GPU shortage for a few quarters already. AI is heavily based on compute power and GPUs and dedicated processors for AI are the engine for building and deploying those models. With the recent boom around chat GPT and large language models, we actually saw two things that are happening in the market.

One models like ChatGPT, large language models requires more and more compute power and it scales very quickly. So every organizations that want to, uh, build from scratch. Something like chat GPT for their own purposes needs a lot of GPUs in the same time. Of course, not every organization is going to build something like that, but many organizations would want to utilize capabilities like chat GPT and others.

In their business and therefore they need GPUs to deploy those models, uh, on top of them. So in one moment we faced a very significant growth in the demand for GPU compute, whether it is for building those models or for running them in production. And when everyone wants something very quickly the demand and supply is changing right now.

There is significantly. Uh, more demand than the supply in the market. 

John Koetsier: Is that impacting your business as well? Do you use them also? 

Omri Geller: So we are using GPUs, but run AI is in the business of actually helping other organizations to take advantage of their GPUs. So our software is actually, uh, improving the utilization of the GPU compute that our organization has.

So basically we can help the organizations that right now face GPU shortage. By providing them software that boosts the utilization and the availability of GPU compute within the organization, actually. It’s really helpful for our business because we’re helping organizations to get over their challenges of getting physical GPUs by providing more GPUs with our software.

So we’ll, 

John Koetsier: we’ll, we’ll turn to OpenAI in a second and what’s happening there and how many GPUs are using and whether they’re getting worse or not, but just let’s finish this. Topic for half a moment, do you make the GPUs 50 percent more efficient? Do you, do you, if you have a hundred GPUs, does, does your software make it, you know, as if you’re using 120 or how much of a boost are you getting?

Omri Geller: So it’s changing of course, but typically we see Two to four X more availability of GPU compute for the organization. There is a lot of idle times for GPU compute. And we front AI software. We knows how to actually mitigate that idle times of GPUs and use effectively significantly more GPUs within the organization.

It’s help, it’s helping both to reduce the overall GPU footprint that the organization is using. But at this point in time, everyone wants more and more GPUs. So if we run AI, they can actually effectively have much more GPUs in the organization, even if they don’t have it physically. 

John Koetsier: Two to four X is impressive.

Okay. Back to open AI. How many GPUs are they running? Does anybody 

Omri Geller: know? I don’t know, but it’s, uh, it’s in the, uh, vicinity of, you know, tens of thousands of GPUs. And I don’t know the exact number, but growing. All the time because of the high demand and, uh, the requirements. 

John Koetsier: What’s your thought? There’s been a lot of talk over the last month, maybe two months that GPT 4 is getting dumber.

OpenAI is not progressing. In fact, there’s, There’s some of that that’s definitely happening as a result of they don’t want to teach people to make bombs or other things like that. There’s some restrictions that they’re putting on their own AI system for safety purposes and, and, and other things like that.

Is that all that’s happening? Have they regressed? What’s your view? So, 

Omri Geller: you know, I don’t think that they have regressed. I think that we as people that are actually leveraging Chachapiti became more experienced and asking more complex questions and have more complex problems to solve. So, in general, it’s, you know, it’s an, it’s an, it’s a race, it’s a race.

And when Chachapiti was introduced, the initial requests from people were Some requests and right now people are asking much more complex, you know things and the requirements are went significantly higher from uh, models like chat GPT take into, uh, take into consideration the fact that it’s also a statistical model in the end, meaning, you know, it’s.

It’s improving, and it’s learning all the time, so we can expect that they will be better again, you know, in the near future, but again, a little bit afterwards, more, uh, harder, you know topics would be. Would be used for, uh, would be actually ask a chat GPT and, and vice versa. We’ll go back and forth between does it became dumber?

No, it’s just, we became smarter and it becomes smarter, but it’s like a race that we’re going and playing it. 

John Koetsier: Interesting. If you look at the GPUs that they’re using or anybody that’s building an LLM are using, what’s the usage for training versus the usage for actually delivering answers? 

Omri Geller: It’s a great question.

So one of the, one of the things that that we see is that In the, uh, in the training phase of a model, usually a single model to train, you know, can, can require from a few dozens of GPUs to a few thousands in shared GPT, few tens of thousands of GPUs. Just to train a single model and that specific task can take for, you know, whether it days, weeks or months.

So for example, chat GPT, there is good chance that it would be taking 20 for an organization to train it over a course of a few months. Now, in addition to that, it’s not just. It’s play and it works, you know, you need to iterate, you need to kind of like do a trial and error and until it gets to a stable model that actually works in production.

So that is the training part which again can span from, you know, a few dozens to, uh, a few tens of thousands of GPUs, uh, to train. When we’re going to deployment, it’s actually those models that are running in production. This is even more complex because it’s really depending on the amount of demand that your application has in production.

For example, running a single query of chat GPT is Typically between four to eight GPUs just to process that question, okay, which is very, very significant. Yes, it’s, it’s, it’s very significant. Now, if you have tens of users, hundreds of users, you know, 10, 000 users, millions of users, okay. It’s just scales, right?

So potentially over time, the amount of GPU compute that the organization would use in production. Can be significantly higher than the very large GPU footprint that is needed for the training part. Yeah. 

John Koetsier: It’s shocking. Uh, I think I saw a stat and I’m going to ask you for a sanity check on it. I think I thought I saw a stat a few weeks ago that it’s a thousand times more computationally expensive for OpenAI to answer something using GPT 4 than for Google to answer a query.

Is that, is that in the realm of possibility? 

Omri Geller: Absolutely. Absolutely. Yes. Now, those are things that we as you know, a community of AI, we need to take it into to think about it and solve it because obviously it can’t scale forever and use more and more compute power because it will just not work.

But yes, to your question, it’s definitely in that those numbers are reasonable and make sense. Yes, we 

John Koetsier: do see some projects. Some are open source to create LLMs, train them and then deploy them on a single CPU on a desktop, even on a mobile phone. Thoughts on that. I mean, are those getting smarter, better, more efficient?

Omri Geller: So there are basically many people that are doing many different things, right? And all of them are important for for actually pushing the boundaries of of a I one trajectory would go on use larger models, larger and more complex models that can leverage more and more compute power that potentially can also be smarter.

But as I said, this that doesn’t scale forever. Okay. And there are there needs to be methods to allow for improving the models without exponentially continue exponentially growing the compute power. Then there are a lot of efforts that are done in the industry to take models and actually make them run efficiently on smaller compute footprints.

Whether it’s smaller GPUs or even commodity CPUs, or as you mentioned on the cell phone, this is important. Typically you would get slightly less accurate results, but many times that’s good enough. And that that is extremely important because that will allow different organizations and different users to choose.

What’s right for them? Are they willing to give up some, give up on some accuracy and performance for getting better ROI, you know, on, and, and paying less, basically having more availability of compute and so on, they can do that. And there are many efforts today in the industry to do that. If they want to have, you know, the bigger, the best and everything.

So yes, then it will not work, but yes, this is happening and we see it in reality. So 

John Koetsier: OpenAI should be a customer of yours, theoretically, and there’s 50, 000 could act like 200, 000. 

Omri Geller: Theoretically, yes. 

John Koetsier: How does, how does your software make GPUs more efficient? Is it just increasing utilization? What’s going on?

Omri Geller: So there is a lot of technology that we’ve built on multiple levels, but basically you can think about it. As many applications are not always using the GPUs, I’ll give you an example. Let’s say that we are a application in production that is actually waiting to answer some queries. So organizations typically allocate GPUs that are waiting for those requests to get, for example, the query from chat GPT, but there are idle times.

There are times that the GPUs are not actually processing anything because they are waiting for requests in those times. Our software knows to take advantage of those GPUs and run other applications that are really needed at that specific time. And we know to do it in a very uh, quick. Very quickly to alternate between applications and demands from the different workloads that are running and put into our country’s consideration.

Also, priority priorities and policies from the organization. So we’ll know how to dynamically allocate the resources in a way that. For the end customer feels like that they have much more GPUs because without run AI GPUs are allocated statically to applications, to users and so on. 

John Koetsier: Contextualize this for me in kind of the history of compute, because you know, let’s say way back when.

Not so long ago. And in fact, some computers, even today, you have got one CPU, probably no GPU. Right. And you’ve got an operating system. It’s got a kernel. It sees its resources. It’s allocating time, clock time, and you know, there you go. Running, running jobs, probably not even concurrently originally, but then you get more sophisticated operating systems, more sophisticated compute allocation algorithms, so you can run multiple jobs.

All of a sudden we have. Uh, laptops with four cores, 16 cores, 32 cores, and we can distribute jobs all over and you got to get more sophisticated about handing out the work orders, right? Oh, who does what, when and how they get the, the, the, the, the memory allocated long term short term, all that stuff. Is this the next version of compute where you basically are running a supercomputer and you have to really have some smart software to allocate resources and jobs?

Omri Geller: So basically you just use the run AI pitch because everything that is being solved for the CPUs and the evolution that we had, you know, from exactly as I said, from the beginning where one CPU could run some workload to where we are today, where everything is so dynamic, you can, you have context switching, you have over, you know, over provisioning, you can run many applications in your laptop.

You don’t, it just, just happens, just works. All of that does not exist for the AI compute world. And more than that, the scale, when we talk about a computer, it’s actually a supercomputer, right? Like it’s, it’s a cluster of many, uh, GPUs that are connected. And then the problems of load balancing and managing the different applications that are running at scale becomes even more and more complex.

So that’s what RunAI builds. We’re building the capability to share a single GPU between many applications. And then to share many GPUs between many applications in a way that is transparent for the users. And that’s how we operate large AI compute clusters for organizations. 

John Koetsier: Almost feels like a cloud kernel or something like that.

Omri Geller: Partly it is 

John Koetsier: interesting. Okay, cool. So what’s the solution here? Obviously we want to use compute power that we have more efficiently. That’s great. Uh, but I guess NVIDIA just has to build more GPUs. I mean, and maybe we need some competition here, right? I mean, NVIDIA is the game in town. It’s pretty impressive, but they can’t build enough for the whole world.

Can they? 

Omri Geller: Right. And, and, you know, the solution will come from multiple directions. It’s not one thing. Okay. It’s like a combination of things that needs to be done. One thing is really to overcome the supply and demand issue with NVIDIA producing more. That’s great. But also having more vendors that are entering the market and supplying their chips which is happening as we speak.

Okay. We definitely, we’re seeing a progress there. So more availability of compute. Second thing is you mentioned the work that is being done in the industry on making models less compute intensive. So this is important. We need to focus on making sure that models can be more sustainable over time and and take it into consideration.

And there are many people that are working on those solutions and this is important. So second thing is to make models more. More sustainable over time. So this first thing that I mentioned is to have more compute power available. And the third thing is to have software like run AI. Okay. And that is actually.

Taking better use of the computer is already out there, right? It’s always a software and hardware play. You cannot fail forever. The software or only the hardware, it has to come together. And those three things over time will mitigate the the problem that we are facing right now. Uh, but it’s going to take some time.

It’s not, it’s not going to be solved tomorrow. So we’re expecting a significant time of, uh, challenges and demand and supply. 

John Koetsier: I had to smile for a second there because for a decade or two in the Wintel world, the, the, the, the phrase was always, you know, uh, what what Intel gives Microsoft takes away, right?

Or, or what, what Andy Grove gives Bill Gates takes away, right? You know, the software is like growing faster than the compute power and maybe just maybe in some of what you’re working on, and probably some competitors, right. Uh, maybe software starting to give back a little bit. Exactly. Thank you for this time.

TechFirst is about smart matter … drones, AI, robots, and other cutting-edge tech

Made it all the way down here? Wow!

The TechFirst with John Koetsier podcast is about tech that is changing the world, including wearable tech, and innovators who are shaping the future. Guests include former Apple CEO John Scully. The head of Facebook gaming. Amazon’s head of robotics. GitHub’s CTO. Twitter’s chief information security officer, and much more. Scientists inventing smart contact lenses. Startup entrepreneurs. Google executives. Former Microsoft CTO Nathan Myhrvold. And much, much more.

Subscribe to my YouTube channel, and connect on your podcast platform of choice: