AI and training data: If data is the new oil, where’s the refinery?

Where do Google, Microsoft, and IBM go for training data and data enrichment?

AI is driving innovation, competitive advantage, and speed to market … but what if you don’t have enough training data? And what if your data is raw, not enriched, and you have no metadata to help your AI engine make sense of it?

In this episode of TechFirst with John Koetsier, we chat with Wendy Gonzalez, President and CEO of Samasource, which supplies training data for Google, IBM, Microsoft, and a quarter of the Fortune 50.

Get the full audio, video, and transcript of our conversation below …

Subscribe to Techfirst: the new AI data refinery


Watch: the new AI data refinery

Subscribe to my YouTube channel so you’ll get notified when I go live with future guests, or see the videos later.

Read: the new AI data refinery

TF69: data refinery

John Koetsier: If data is the new oil, where’s the refinery? Welcome to TechFirst with John Koetsier.

AI is driving innovation, competitive advantage, and speed to market. But what if you don’t have enough training data?

And what if your data is raw, maybe it’s not enriched. Maybe you have no metadata to help your AI make sense of what data you have. Then I guess you turn to maybe somebody like Samasource, which supplies training data for Google, IBM, Microsoft, and apparently a quarter of the Fortune 50.

To find out where AI is going, and where maybe training data is headed as well, we’re chatting with Wendy Gonzalez who’s the President and CEO of Samasource. Welcome, Wendy! 

Wendy Gonzalez: Thank you, John. Really glad to be here. 

John Koetsier: Excellent. Happy to have you. You supply training data for some of the world’s top tech companies. And often when people think of AI, they think of computers, they think of code, they think of technology. They think of things that are horrifically complicated and complex, that aren’t understandable by maybe anybody, and especially not ordinary people who aren’t techie.

But you do training data. What is that? What’s that component? 

Wendy Gonzalez, CEO of Samasource

Wendy Gonzalez, CEO of Samasource

Wendy Gonzalez: Yeah. So training data is really the basis for AI. So at the end of the day, machines need to learn how to speak, see, and hear. And they do so much like a human learns how to speak, see, and hear.

What we do is training data, and that is really the labeled, structured data that teaches a machine or a computer how to do these things.

So, imagine you have a self-driving car application, it has cameras and sensors, it’s driving down the roadway, needs to be able to detect what is a pedestrian, what is a vehicle, what’s a bicycle, what’s a road sign, you know, what’s a drivable space. And that’s basically what we do.

John Koetsier: Excellent.

Wendy Gonzalez:  We provide an object for that structured labeled data.

John Koetsier: Good, good, good. I mean, that’s obviously easily forgotten when you think about AI, but if you don’t have the data and you don’t understand what the data is, you can’t really do anything with it, correct? 

Wendy Gonzalez: That’s exactly right.

I think what’s really kind of interesting that people maybe don’t realize, is that while all these investments are being made in AI, something like nearly a hundred billion is predicted over the course of the next few years, none of this AI can be realized without training data. 

John Koetsier: Yeah.

Wendy Gonzalez: And it’s actually the majority of the reason why AI projects don’t work well, is because they’re missing high quality structure data. 

John Koetsier: Excellent, excellent. And by the way, I full-screened you there and you mentioned right off the top that you didn’t have an office in the place where you’re staying, so you were in your daughter’s bedroom.

This is COVID times, this is what we do, right? So, I’m glad you’re able to join and you’ve got good lighting. So it’s all good, no worries. 

Wendy Gonzalez: Perfect, perfect. 

John Koetsier: There’s something cool about what you do. I mean, there’s a lot of companies that outsource and you do some outsourcing obviously, but you do some outsourcing with a purpose, and that kind of caught my eye as I checked your website. Can you talk a little bit about that? 

Wendy Gonzalez: Yeah, absolutely. So we aren’t your typical sourcing company. We actually do something called “impact sourcing.”

So we were founded in 2008 with a mission to move people out of poverty by giving them work. So the idea is that we recruit and work with people in very underserved communities.

We provide them digital skills training, and then we hire them in to work on these data labeling projects with the, you know, really powering the world’s leading technologies, provide them living wages and benefits. And it really creates a career path towards, you know, future growth. 

John Koetsier: Right, interesting. And I think the number that I saw on your site was 50,000? 50,000 people that you have given a living wage to, is that correct? 

Wendy Gonzalez: We have moved over 50,000 people out of poverty since we were founded in 2008. And to give you a sense of it, we are working in communities where people make less than $2 a day, which is the World Bank standard for poverty.

So it’s something we are extremely proud of, and it really powers everything we do, because people coming from these communities — these are not only folks who are incredibly talented, very bright, they really lack the opportunity for employment. And providing this environment, working on these very, very exciting technologies, we have great retention. And so as a result, not only are we lifting people out of poverty, but we have expert labelers, which is really neat. 

John Koetsier: Yeah. Talk about some of the areas that are hottest right now. I mean what are your customers really looking for in terms of data that they’re feeding you and asking you to explain what it is?

Wendy Gonzalez: So we see things in really a few different areas.

One in particular, we see a lot of growth in AR/VR, so augmented reality and virtual reality. And this could really include everything, it’s everything from faces, shirts, shoes, you name it, furniture. Really anything that can be detected, all the way to autonomous transportation. 

John Koetsier: Mm-hmm.

Wendy Gonzalez: So, you notice I don’t say self-driving cars, we see it in trains and planes, and, you know, some cities and how you manage traffic.

We’re also seeing a lot of really interesting growth, as you would imagine, also in e-commerce. So a lot of things I would describe as visual search, so how do you actually look up something and detect whether it’s a plaid shirt as an example, right. So how do you define plaid? We actually have to structure all that. 

John Koetsier: Interesting. I love that you said autonomous transportation. I mean, I think it was, I forget which airline it was, you probably remember, it was in the news this morning or yesterday, an airplane took off and flew somewhere and landed autonomously. That was pretty impressive.

I know of a startup in Russia that’s doing autonomous driving or transportation for farm machinery as well as trains, and other things like that. And of course, there’s so much demand for autonomous robots for delivery, last mile, right? So there’s so much that needs to be done there.

Wendy Gonzalez: Yeah, it’s incredible. We are working on a few projects specifically related to delivery robots, as an example.

John Koetsier: Interesting. Maybe go into a little more detail on some of the training that you might do for an autonomous vehicle. What needs to be labeled and how does that data get used? 

Wendy Gonzalez: Yeah, it’s pretty incredible. I wanna say … like everything ultimately in the scene needs to be labeled.

So what happens often is we’ll work with companies, they have both video data as well as sensor data, right? So there are cameras on the cars that are pulling in information at different angles, there may be LIDAR or radar. So LIDAR is kind of the light detection and it allows you to provide depth as well as being able to see things in a 2D fashion. And so we’ll get literally tons of these videos and highly complex scenes.

And there’s something called “semantic segmentation” to where literally, in some cases you are putting a label towards every pixel in the entire photo.

So you might be covering and outlining a vehicle, a parked vehicle, a drivable space, non-drivable space, traffic signs. Imagine how complex the scenarios are, because take all of that, apply it to local roads, freeways, all across the world and then put weather on top of it, you know. 

John Koetsier: Yes.

Wendy Gonzalez: So sunny, rainy, hail. So it’s quite complex. Yes. 

John Koetsier: Every pixel? 

Wendy Gonzalez: In some cases, yeah, that’s called semantic segmentation. Yeah. Some cases you have every single pixel.

So as you can imagine, you know, the level of accuracy and quality that’s really required is absolutely huge. Imagine a car getting on the road that cannot detect certain types of vehicles, or doesn’t know when the car is behind another car, right?

John Koetsier: Yes.

Wendy Gonzalez:  So it’s a little more complicated than you actually think. 

John Koetsier: Or a Tesla that sees a white semi basically in front of it in the road and doesn’t identify that as a vehicle or a barrier, and slams into it and the occupant dies. I mean, this is pretty serious stuff. 

Wendy Gonzalez: Yeah, safety, and you’ll hear me say often quality. Oftentimes what people talk about in terms of training data is quality, and what that really means is how accurate is that labeling and that segmentation, because that’s certainly one example.

You could take it in a number of different ways. Imagine if vehicles couldn’t detect different sizes and shapes of people as pedestrians, that would be pretty problematic. 

John Koetsier: Yes, yes. And we know we already have those issues with regard to face color, for instance, for a variety of different AI things, perhaps, hopefully not transportation-related, but certainly AI systems that are built to solve problems commercially for people.

So you’re pretty in tune, you’re providing training data for AI systems for 25% of the Fortune 50. You’re pretty in tune with what’s going on in AI and what’s changing. What’s in most demand? And how has that changed over the years that you’ve been in business? 

Wendy Gonzalez: Yeah, I think that, you know, ultimately the technology is still relatively new. And so if you think about the primary adoption of AI really started 10, 11 years ago. So what’s kind of amazing that people sort of don’t think about when it comes to machine learning is that it’s still relatively new, so there’s a lot of tools and infrastructure that I think are being built right now. It’s happening, the advancement’s happening very, very quickly.

So from the perspective especially what we do, is that gone are the days ‘Is there a car in this picture?’ you know, or ‘Is there a dog or a cat?’

John Koetsier: We would hope.

Wendy Gonzalez: Yeah we would hope, exactly. So it’s incredible how quickly things are advancing and so when I, yeah, maybe sound a little bit repetitive saying ‘quality,’ it’s because the smarter AI gets, the more you’re finding edge cases, right?

So, if everything was kind of completely labeled, well then, you know, we’d be in the singularity, John. Like all AI would be possible, right?

So what’s happening is you get into more and more edge cases, use cases, and I think one of the key things is that when you’re thinking about your AI application, one of the biggest questions that should be made, but that we also believe very strongly in, is do you have the right set of data? It’s not just about having all the volumes of data, but that example we were just talking about in terms of self-driving cars.

Imagine if you had a representation of hundreds of thousands of vehicles, but only like 10 motorcycles, right. Then you’ve got immediately kind of an inherent bias and so you have to really worry about not just to get the quality data, but do you have the right and most comprehensive representative data.

John Koetsier: Interesting, very interesting. So you need some human expertise at some level to say, hey, you know, you’ve got great data, but in a small subset, or you’re missing something else, or something along those lines? 

Wendy Gonzalez: Yeah, I like the word last mile. So you mentioned that with delivery robots. I feel it’s the same way with quality in terms of training data, is that the human context and judgment is so key.

Imagine, I mean you can probably look, there’s all these sorts of really, really interesting examples with data sets. It’s like if you have, you know, a computer vision algorithm can identify a chair if it’s standing upright, but if you tip it over or set it on a bed, you know, the machine’s like I don’t even understand the context of this, what does this really mean?

So that human judgment in context to help the algorithm get as smart as possible is absolutely critical. 

John Koetsier: Interesting, very interesting. What’s providing the most competitive advantage right now? 

Wendy Gonzalez: I think it is, ultimately it is about quality as well as classifications. So how do you understand what is the most comprehensive set of data? How do you understand what you’re missing, right?

So that example of unrepresentative data is a key one. So, classifications sort of analytics to understand what you’re not doing well at, and then having not only a solution that can sort of do the heavy lifting.

So we do have a platform, we try to do as much machine-learning assisted annotation as possible. We leverage our humans to get to that last mile of quality. 

John Koetsier: Yes. 

Wendy Gonzalez: And ultimately that is really the differentiation.

I would also say that diversity in your labeling workforce as well as continuity is important.

Oftentimes, a lot of companies in this space they use something, you know, they use the crowd, right? And so you have somebody completely different, you don’t necessarily know their background. So kind of visibility, security and trust is something that we find particularly important with the largest enterprises. But beyond that, if it’s kind of a revolving door, getting to those really difficult edge cases that we were just talking about, if you don’t have any continuity and the use cases get more and more complex, I think that’s also quite a challenge. 

John Koetsier: Yes, yes. Interesting. So we talked about COVID-19 already once, you’re not in your office, obviously.

Wendy Gonzalez: No.

John Koetsier: How has COVID-19, various lockdowns and quarantines, affected AI and training data, and maybe even investing?

Wendy Gonzalez: Yeah, it’s really interesting. I think we’re seeing slowdowns in some areas and growth in others. You know, we have a variety of customers. As you can imagine, those clients in like hospitality or manufacturing definitely suffered some slowdowns. We also have seen some delays in data collection. So when we were talking about representative data, if you can’t get out there and drive around on the street, or your country or city is in lockdown, that definitely has caused us some repercussions.

And then by the same token, we’ve seen strong growth in other areas. You know, AR/VR is just absolutely chugging along, you know, e-commerce is as well, delivery robots.

So, yeah, it depends on the industry, but as a whole, I think one of the things we’ve seen is that especially in the space we work on, is that there’s a massive amount of data that you need. And you know, we’ve actually seen for our business some level of growth — actually real growth, because you need that. I think if it is the new oil, yeah, I think that’s a very good analogy, it’s like, they need that flow of information to continue their progress. 

John Koetsier: Very interesting. Anything in robotics has to be pretty huge. I have a friend who is building and shipping these autonomous — I wouldn’t call it a barista, it’s like a box, you buy a box and there’s a coffee shop — a robotic coffee shop, and you put it down somewhere and somebody can get a coffee, presumably a hot chocolate, latte, whatever from the robot. 

And apparently it’s pretty good. So, but yeah, lots of training data needed for all those things. I don’t know how much data he needs annotated for that business though.

Wendy Gonzalez: Yeah, there’s all sorts of kind of thoughts over, you know, how much data do you really need. And if you’re trying to prove something, is it 10,000 images or 15,000 images? If you’re trying to get something to production and how big is it? Is it a 100 or 200? It really depends on the complexity of what you’re trying to solve for.

So, you know, in the case of [a] self-driving car, imagine you may need thousands, if not maybe in the millions, of like road miles driven to be able to address all the different scenarios.

So, that kind of thing.

John Koetsier: Mm-hmm. Actually a really good question here from Doug Bennett, he says, “Well the question is how long do you expect this gold rush to last?” I mean, we’ve seen an expansion, or should we say an acceleration in retail, you know, e-commerce, m-commerce parts of retail, four to six years or something like that according to Adobe data.

We’ve seen an acceleration of other trends in terms of remote work, other things like that. Do you expect this acceleration for the need for AI and AI training data to continue?

Wendy Gonzalez: Yeah, absolutely, because it’s pervasive across every industry. I mean, ultimately anticipating that the world starts to kind of get back to normal, there’s no limit to the adoption of machine learning AI.

So, like we’ve worked on everything from sustainable fishing, to reducing elephant poaching, to financial services classification, it’s incredibly pervasive. I mean, there are many estimates that say that, you know, 80% of consumers are already using some form of  AI-enabled product.

It’s only going to get, I think, more and more pervasive.

John Koetsier: Well, absolutely. I mean, if you use Facebook or if you use Google search or something like that, you’re touching AI, you’re getting benefits from AI in some way, shape or form, obviously, right?

The other one of my favorites was Google’s Project Loon.

They deliver internet via weather balloons, or balloons in the air in Kenya, and the balloons are self-driving in a sense, in that they have a machine learning algorithm that tells them to go up or down based on where the wind velocities are.

And somehow with a certain number of balloons, and some that get, you know, transitioned out of service and maybe repositioned manually, they managed to maintain coverage by going slower here, faster there, different directions. There’s a lot that we can use AI for. That kind of feeds into my next question for you.

Where do you see AI going next? What do you see being super hot over the next few years? 

Wendy Gonzalez: Yeah, so we are seeing, and I think we will continue to see, quite a bit in the AR/VR space. We’re going to continue to see tons of investments, because I mean, at the end of the day,  especially in our business of training data, you know, there’s no limited sort of what is included in reality or virtual or augmented reality. So the amount of objects and classifications is absolutely very, very large.

We definitely see a lot in healthcare. There’s an incredible amount that can be done in the healthcare and life sciences.

Even, for example, what’s happening obviously in this terrible pandemic. We worked on a project with Mila, one of the leading AI research facilities, to create a chat bot basically that allowed people to identify symptoms. You know, imagine taking that same application to, for example, lung scans. There are many, many different examples there as well. We’ve worked on surgical robotics, you name it, so we see a lot there too. 

John Koetsier: Very, very interesting. Wendy, I want to thank you for spending some time with us. 

Wendy Gonzalez: Thank you, John, appreciate it. 

John Koetsier: Excellent. Well, for everybody else as well. Thank you for joining us on TechFirst. My name is John Koetsier. Really appreciate you being along for the show.

You’ll be able to get a full transcript of this podcast in about a week at You’ll see the story on Forbes shortly after that. Of course, the full video will be available on my YouTube channel. And thanks for joining, maybe share with a friend.

Until next time …  this is John Koetsier with TechFirst.