New York high school student builds AI framework to predict air pollution with 92% accuracy

air pollution AI high school kid

A high school student at Jericho High School in New York has built an AI framework that can predict air pollution levels with 92% accuracy using neural networks, random forest, and other techniques.

That … could be better accuracy than most weather forecasters.

In this episode of TechFirst with John Koetsier, we chat with Richard Ren about his framework, including how he learned to code, why he got into AI and machine learning, what data he’s using, what technologies he’s implementing, and what data is most predictive of high pollution levels.

Watch, listen, and get the full transcript below!

Subscribe to the TechFirst podcast

Watch: grade 11 student builds AI for air pollution

Subscribe to my YouTube channel so you’ll get notified when I go live with future guests, or see the videos later.

Read: grade 11 student builds AI for air pollution

John Koetsier: A high school student has built an AI framework that can predict air pollution levels with 92% accuracy. Welcome to TechFirst with John Koetsier. 

Knowing when it’s a high pollution day is pretty important for a lot of different people. If you have asthma, maybe you’ve got trouble breathing, maybe you’re particularly sensitive to particulate matter in the air. Well, can AI help us predict those pollution levels, what they’re going to be? And how the heck is a high school student building machine learning models?

To find out, we’re joined by Richard Ren, who is a grade 11 student at Jericho High School in New York. Richard, welcome!

Richard Ren: Hey, it’s great to be here, John.

John Koetsier: Hey, excellent to have you. We’re going to enjoy this here. The first thing I thought when I saw what you were doing, the model that you had built and everything like that was, hey, you’re predicting pollution better than the meteorologists are predicting weather. Is that correct? 

Richard Ren:  Weather prediction models have really increased in accuracy. So, but yeah, I mean, I think it’s great that I’m able to predict pollution using these machine learning methods so reliably. 

John Koetsier: Talk a little bit about prediction of weather. I mean, you obviously got deep in the subject if you’re going to predict pollution levels and everything like that. How solid are weather predictions these days and what’s the general level of accuracy? 

Richard Ren: Yeah, my specialty is in pollution prediction, but from what I gather from like my literature search and so on and so forth, they’ve really improved over the last 20 years. So in my mind, there’s been like two major things. First, is that you’re starting to see a lot more data being collected. You’re starting to see the rise of big data and as well as these data sets being made publicly available, right, all of the data that I used was publicly available. 

John Koetsier: Yeah. 

Richard Ren: And I think that’s an absolutely incredible accomplishment that was able to come to fruition because of new technologies like the internet. So now it seems like the bottleneck is not really in the amount of data we have, but rather the methods for forecasting say, weather or pollution, right?

So right now the NOAA is trying to incorporate machine learning methods into their more theoretical frameworks so that way they can try to predict air pollution as well as weather far more accurately. 

John Koetsier: Mm-hmm. So you got to 92% accuracy. How long did it take to get there? 

Richard Ren: Took a long time, I would say maybe around a year. So it started off with like a Beijing dataset in which I tried to predict air pollution — more specifically PM2.5 pollution, so that’s particulate matter 2.5 microns or less in width — and I did it solely based on weather.

And that’s sort of the pattern that you see in literature, they only take one element that’s extremely important while ignoring some of the other elements, or they only use one machine learning model, right.

And so from that, so I was able to get maybe an 80 to 85% accuracy, but it’s sort of like the 80/20 rule, like 20% of the effort gets the first 80% and then after it’s the last 20% is the most difficult. 

John Koetsier: Yes.

Richard Ren: So for that I made two major modifications. First, instead of just using — so in that model, I used a very simple regression analysis, but generally you sort of want to try to incorporate more machine learning methods. So each machine learning method is special in that it has its upsides, it has its downsides.

So upsides of neural networks, they tend to be quite accurate, especially for deep learning, but they require so much data, right?

Random forests tend to be robust but they also overfit, so they might conform too much to that dataset while ignoring overall trends. 

John Koetsier: Sure.

Richard Ren: So you’re seeing like these individual strengths and weaknesses. So if you try to take a multilateral approach and use multiple machine learning methodologies, you can get a more accurate result.

John Koetsier: Cool. So let’s dive into that in a moment. I want to get into the details of which technologies you’re using, how you implemented them, and how they all improve the accuracy levels. But maybe let’s start here as well, how did you get into AI?

Maybe even, how did you start to code, where’d you start learning that?

Richard Ren: Yeah, the great thing about getting into AI — so essentially I started because my grandparents and my other extended family they live near like these large cities in China, Beijing, Shanghai, where they have like huge levels of air pollution, right? And so I created a regression analysis for them to use essentially, you know, just being able to see air quality a few days in advance.

That’s such a simple thing, but that’s also such an important thing just being able to plan ahead. Maybe I shouldn’t go on Wednesday because AQI is like 130, I don’t know, right? 

John Koetsier: Yeah. 

Richard Ren: Especially if you’re like my relatives in China tend to be a little bit older and air pollution is like this horrible disease, like it slowly messes with your lungs and cardiovascular system. So if you’re vulnerable, you don’t want to get exposed to that. 

John Koetsier: I was in Shanghai actually — I should say it correctly, Shanghai — a number of years ago, but I was very fortunate to come in right after a typhoon. So I saw Shanghai when it was amazing and clean and beautiful and clear and everything like that. But the air pollution can get really, really bad. 

Richard Ren: For sure.

John Koetsier: So you built it to help your grandparents. Where did you go from there? 

Richard Ren: Yeah. So as it sort of goes with these things, like you start with a tech project, you know, you go on Stack Overflow or you go on some other resource like YouTube and after you just start getting into the field, right? And the great thing about the internet is that it’s gotten, it gives you so much potential, it gives you so much opportunities. If you wanted to, you could probably get the equivalent of a computer science degree just by taking these online courses at Coursera, right?

You have the world’s top universities, Harvard, Yale, Stanford, MIT, name them, they’re putting up courses online for free. It’s just great. 

John Koetsier: It’s like free money!

Richard Ren: Yeah, it’s free real estate. And so you just take advantage of it and you just acquire as much information as you can. There are great YouTube channels as well that just, they have like hour-long tutorials on just how to build AI, or like, you know, teaching you how neural networks work and sort of the linear algebra behind that. And you just sort of take advantage of them and you read up on literature, and from there you’re able to identify a gap in the literature and hopefully fill it. 

John Koetsier: Excellent, excellent. So let’s talk about your project and what you built. You talked about open datasets and you’re doing it for a variety of cities, or a number of cities now, I believe. What data are you using? 

Richard Ren: Right now I’m doing it for one city as a sort of starting prototype. That’s what I did for my conference paper. The city that I chose was Los Angeles, California and it was Los Angeles for a very specific reason.

California, ever since like the 2018’s they’ve been plagued by this problem of wildfires, right? They made the news and although they might not make the news right now, there are still wildfires going on right now in 2020, just the Apple fire …

John Koetsier: Wow.

Richard Ren: … caused like people, residents had to evacuate. So that’s one of the areas in this country in which — well, I guess you live in Canada — but one of the areas in the United States where you see air, sorry, yeah, where you see that you need these sort of air prediction models. So that way you can hopefully make sure that people can avoid the health risk to them and their family. 

John Koetsier: Mm-hmm. So you picked L.A., and where are you getting your datasets from? 

Richard Ren: They’re all publicly online. And that’s the amazing thing, you just have all these like publicly available datasets. So the ones that I use specifically, there’s a data set by some Russian company called Reliable Prognosis 5. They host weather data, so I was just able to get data from 2016 to mid-2018 from there for Los Angeles, California. And for the pollutant information, you can find that on the Environmental Protection Agency’s website, they host an air quality systems database.

John Koetsier: Nice.

Richard Ren: So you can just download, you know, any pollutant and just get that information.

John Koetsier: Nice, by city and by pollution level and all that stuff. So then you can start looking at the weather and you can start looking at its correlation with air pollution on a variety of different levels. What AI technologies did you implement and which ones did you find were most effective? 

Richard Ren: Yeah. So in my prototype I used three main models. First is neural network. Second is random forest. Third is logistic regression, right? Each one has their advantages, each one has distinct disadvantages, that’s sort of the reason why I chose those three. And I was able to find that the combined model was able to have more accuracy than any of the constituent models. I found that the most effective were random forest. So random forest and any modification thereof, as well as neural networks falling close behind.

John Koetsier: Mm-hmm. So you said you found a gap in the literature when you studied this before actually building your systems, your frameworks. And the gap in the literature [was that] most people were implementing AI to predict weather or to predict pollution were using a single methodology. 

Richard Ren: Yeah.

John Koetsier: But you found that by implementing multiple you can get better predictability?

Richard Ren: Exactly. That’s, you know, you couldn’t have worded it better. I mean, you see all this, like, oh, there’s like a Bayesian model to predict air quality index using weather in Hong Kong, right. And it’s all very, very, like it only uses one method.

It only uses one predictive factor, but by leveraging all of the major predictive factors, as well as multiple machine learning methods, you’re able to get, you know, 92% prediction accuracy.

John Koetsier: Cool. Did you find any particular data that was more predictive of pollution index levels? 

Richard Ren: Yeah. So my random forest networks are — so with neural networks, it’s sort of a black box, but thankfully I started random forest models and so they were sort of able to identify the top predictors for pollution and non pollution.

So the first one is Air Quality Index a day ago. So that is no surprise. 

John Koetsier: Yes, that’s understandable. 

Richard Ren: Yeah. So that’s, you know, if you’re on day B and you’re trying to predict day B’s AQI and you have day A’s air quality index, that’s obviously going to be very helpful. It sort of stays constant, then afterwards the next two, so number two and number three, were PM2.5 levels. 

John Koetsier: Yes.

Richard Ren: So particulate matter in the air of 2.5 microns or less, and a sulfur dioxide concentration on day A. So that seems to be major factors that tells you that pollutants are extremely important. And the fourth most major factor was air pressure at sea level. So higher pressure correlates with higher amounts of pollution.

John Koetsier: Interesting, interesting. Did you do anything with COVID-19 and lockdowns? I mean, a lot of us globally noticed that when we had COVID-19 and lockdown, quarantine, whatever you want to call it, our air got cleaner. People in India saw the mountains, the Himalayas for the first time, right? Cities in China were clear. Beijing air was, the sky was blue again. 

Richard Ren: Yeah.

John Koetsier: And did you do any work during that period of time? And what did you see if you did? 

Richard Ren: Yeah, so I haven’t actually run my data set on like any post-COVID data, but because of like the way that my framework is structured and because it takes into account not just weather, but also specific pollutant information, you’ll see a decrease in things like PM2.5 and sulfur dioxide. And in turn, you’ll see that the results will be able to accurately reflect the AQI. 

John Koetsier: Gotcha.

Richard Ren: I think it’s really interesting though, that air quality is obviously improving, but it’s only improving for as long as we can keep it, right. If we just return immediately back to our jobs once this pandemic is over, I’m not sure how big of an impact this will have in the long run.

John Koetsier: Yeah, exactly. So what does the future hold for you and your framework? I believe you’re building it out, you’re trying to make it publicly available. Is that correct? 

Richard Ren: Yeah. Right now I’m working on a web application and afterwards I’m planning to work on a mobile application. 

John Koetsier: Cool. And so any ETA for that?

Richard Ren: I would probably say in like three months, or two months, or one month — I’m not sure. 

John Koetsier: I guess it depends on how much homework you have to do. You’re still in grade 11, so there is, you know, teachers are annoying, they ask you to actually get some stuff done occasionally. 

Richard Ren: If any teachers are watching this, I do not endorse this message. I absolutely love your class. 

John Koetsier: Hahaha, wow. 

Richard Ren: So you know.

John Koetsier: What data would you love to have that you don’t have right now, that you think would be really, really interesting in helping AI predict pollution levels? 

Richard Ren: Probably the very interesting — a concept that I was actually thinking about was you’re starting to see the rise of keyword analytics, right?

So the whole idea of like, if you have let’s say, especially in Los Angeles, California, if you see all of a sudden that there’s a ton of news articles writing about wildfires, maybe you should take that into account new pollution, right?

Maybe that tells you that, oh, some stuff is about to go down, right? So that’s definitely something that I want to incorporate into my model. 

John Koetsier: Interesting. Very good. Maybe something even searching Twitter or something like that.

Richard Ren: Exactly.

John Koetsier: Or recent photos would be interesting as well. Well, very interesting stuff.  Very impressive, especially at your age what you’re doing, and I look forward to impressive things in the future. Are you looking, are you thinking of maybe doing a startup at some point in the future? Is that something you’re looking at or have you not looked that far out? I know you’re applying to colleges and universities right now. 

Richard Ren: Yeah. I never actually thought of doing a startup, but it seems like this idea is good enough that I would actually love to work with a bunch of friends and try to make this like a reality. Get some like automatically updating web and mobile API, right. I’ll make the data publicly available and hopefully like when you open the weather app you’ll be able to see, you know, not just tomorrow’s temperature, but also tomorrow’s pollution, right? Things like that.

John Koetsier: Yes. Interesting. Well, thank you so much for your time. 

Richard Ren: Yeah. Thank you. 

John Koetsier: Excellent. For everybody else, thank you as well. Thank you for joining us on TechFirst. My name is John Koetsier. I appreciate you being along for the ride. Hey, whatever platform you’re on, please like, subscribe, share, comment, all the above. If you’re on the podcast, you like it, hey, rate it, review it, that’d be a massive help. Until next time, this is John Koetsier with TechFirst.