Artwork for podcast Data Career Podcast: Helping You Land a Data Analyst Job FAST
131: 7 Resources to Find Amazing Datasets (FREE)
Episode 131 β€’ 16th October 2024 β€’ Data Career Podcast: Helping You Land a Data Analyst Job FAST β€’ Avery Smith - Data Career Coach
00:00:00 00:07:34

Share Episode

Shownotes

πŸ’Œ Subscribe To My Newsletter

Finding quality datasets doesn’t have to be hard. In this episode, we highlight seven must-know sources where you can easily grab free data for your next project. These resources are sure to inspire your work.

Get FREE access to 1M+ Datasets here: https://datacareerjumpstart.com/datasets

πŸ’Œ Join 10k+ aspiring data analysts & get my tips in your inbox weekly πŸ‘‰ https://www.datacareerjumpstart.com/newsletter

πŸ†˜ Feeling stuck in your data journey? Come to my next free "How to Land Your First Data Job" training πŸ‘‰ https://www.datacareerjumpstart.com/training

πŸ‘©β€πŸ’» Want to land a data job in less than 90 days? πŸ‘‰ https://www.datacareerjumpstart.com/daa

πŸ‘” Ace The Interview with Confidence πŸ‘‰ https://www.datacareerjumpstart.com//interviewsimulator

⌚ TIMESTAMPS

00:10 Kaggle

00:52 Data.World

01:23 Reddit's r/datasets

02:35 Awesome Datasets on GitHub

03:51 Google Dataset Search

04:57 Mendeley

05:42 UC Machine Learning Repository

πŸ”— CONNECT WITH AVERY

πŸŽ₯ YouTube Channel

🀝 LinkedIn

πŸ“Έ Instagram

🎡 TikTok

πŸ’» Website

Mentioned in this episode:

πŸ’Œ Join 10k+ aspiring data analysts & get my tips in your inbox weekly

If you enjoy this podcast, you're going to LOVE my newsletter. Every week, I send you 1 email jam-packed with tips, tricks, and resources. Don't miss it!

πŸ’Œ Subscribe To My Newsletter

Transcripts

Speaker A:

Here are some of the best resources to find free datasets for your next data project. And each one gets better and better as they go, leading up to number seven, which I think is the best.

If I ever need a dataset, Kaggle is the first place I'm checking, just because it has so many different data sets and is so easy to use. I love that it has a search bar right here that allows you to search for different things.

It also has filters that allows you to do it off of the min and max file size, the file types, different licenses, usability, and what it's actually about. For example, if you want to do a project on, say, pigs, you type in the word pig and see all the different data sets that pop up.

Guinea pig detection, worldwide meat consumption, farm animals, so on and so forth. Super simple, but very useful. And like I said, it's my number one resource when I'm trying to find data myself. Number two is data world.

This is a great database of different datasets that's very similar to Kaggle, where, as you can see, they have over 130,000 data sets available. I really like that they have these tags over here on the left hand side.

So for instance, if you're interested in animal husbandry, or if you're interested in the census data, or maybe you're more interested in geospatial data, you can just click there. Of course, they also have the search bar here at the top.

So if you're interested in, say, something like soccer, you can just search there and see the available data sets in the results. Number three is actually on Reddit. Did you know that there is a subreddit called r datasets?

And this is where you can go and ask for data sets and give data sets. It's basically a normal Reddit subthread with different conversations and forums going on.

Now, I will always say when I talk about Reddit, your mileage may vary based off of what you see on Reddit. Some stuff's going to be really useful and other stuff might not be as useful.

This one is obviously not as robust as Kaggle or data world because there's not really a search or filters to go through, but you can do different things like sort by the top of the last today or the last all time. It looks like the most upvoted post is right here from nine years ago where they said, I have publicly available Reddit comments for search.

1.7 billion comments at 250gb compressed. Any interest in this? So this is the type of data you will find on this subreddit. Let's look at one more.

Here's the second most popular I spent the last eight months during lockdown pouring my soul into a website that allows you to visualize virtually every us company's international supply chain. What products, how much, which factories, and where does Lululemon import from. Very cool.

Obviously a lot more human based and human interaction here, which, if that's your thing, check this one out. Okay, the next data set here is awesome. And that's literally the name. It's called awesome data sets or awesome data.

And it is basically a profile here on GitHub that has a bunch of repos of awesome data sets in this readme right here. I like this one because it has a table of contents with a bunch of different categories. So if you're into esports, there's esports data.

If you're into museums, there's museum data. That one sounds a little boring to me. But hey, if it's your thing, go for it. If you're into cybersecurity, there is different cybersecurity datasets.

Let's go ahead and click on one of these. For example, let's say we're interested in the museums. I want to see what type of museum data there is.

So it looks like there's the Cooper Hewitt collection database, or the Minneapolis Institute of Arts Metadatabase, or the Getty vocabularies. Super interesting. Let's look at one more. I chose physics.

One of the things I love about physics and science in general is a lot of those datasets are open source. So for instance, like NASA data is available to everyone. CERN has a bunch of really cool data sets available. I mean, just scroll through all of this.

Look how many different data sets that you can find. All the social network data sets, all these social sciences. There's literally so much I went through each one of these. This video would be very long.

Let's not do that and move to the next one, which is actually the biggest search engine of them all. And that is Google. Yes, not just regular Google, but Google has a specific feature that's a dataset search.

You need to go to datasetsearch dot research dot google.com. and once again, awesome search bar. And it works very similarly to normal Google.

Let's say I'm interested in maybe some sort of video game data, maybe something like Halo. We could type in Halo and see what pops up. So on the left hand side here, you'll have basically different data sets from different websites.

You can see, there's some from Kaggle. This one right here is Halo Infinite angel video game. I'm not super familiar with the video game Halo, but this is from hugging Face Co.

Which I actually know has a bunch of different data sets we can click on it. It would open up the data set, and here it is in the search engine.

We can also sort and filter by things like different topics, the usage rights, the download format, and of course if it's free or not, because Google actually does provide a lot of data set options that are paid, which is probably not for this video and not for you. The next data set is for all my super nerds out there, especially you in academia.

Whether you're trying to get a master's or a PhD, or maybe you're in school right now and you just want some nerdy data. This right here is academic data from Mendeley.

Now if you've never heard of Mendeley before, it's basically like an academic company that does like a lot of publishing and database type things. I used it a lot when I was publishing my first academic paper. That was a while ago, though.

What's cool is they actually have this data portal that allows you to look up all these different data sets that are being used in these academic papers. When I did my research, I was using machine learning to do something called fault detection.

So we can pull that up into the search bar right here and see all of these.

For datasets inside of fault detection, simply click on it right here and that should give you the data set on the next page and you would be able to download these. The last resource here is probably the most famous and maybe the most used in actual data science context.

And it's the UC machine learning repository. They recently just redid the website and it looks a lot better than it used to look.

This resource is going to have all sorts of different data sets that is mainly used for machine learning, but it doesn't have to be, and these datasets are very well known in the data world. This iris dataset, for example, is basically looking at the length and width of petals in different parts of a flower.

It's probably the most famous data set of all time.

But there's also some data sets on here that I think are really interesting, like this wine data set, student performance, online retail car evaluation. There's a lot. Once again, nice search bar on the left hand side, as well as all these different ways that you could filter this data set.

For example, the number of rows, the number of features Wow. 3.2 million features, 63 million rows. So there's definitely big data in this resource.

And there you have it, seven free resources to help you find your next dataset for your next data project.

If you want more websites and more resources that have free data sets, look at the description down below and I will send you a longer list with even better data sets.

Follow

Links

Chapters

Video

More from YouTube