In this week's episode, we chat with Michelle Race, Senior Technical SEO at DeepCrawl about all things XML sitemaps.
Where to find Michelle:
Twitter: https://twitter.com/shellyweb
---
Episode Sponsor
This season has been sponsored by NOVOS. NOVOS, the eCommerce SEO agency, has won multiple awards for their SEO campaigns including Best Global SEO Agency of The Year 2 years running. Trusted by over 150 global eCommerce brands including the likes of Bloom & Wild, Patch and Thread, NOVOS provides tech eCommerce SEO expertise with a creative edge. They have been named as one of 2021's best workplaces in the UK and with a diverse, gender-balanced team are a culture-first agency. The great news is that you can join them! They're hiring senior digital PR and SEO strategists. Visit thisisnovos.com or follow on Linkedin @thisisnovos
Where to find Novos:
Website - https://thisisnovos.com/
LinkedIn - https://www.linkedin.com/company/thisisnovos
Twitter - https://twitter.com/thisisnovos
Instagram - https://www.instagram.com/thisisnovos/
---
Episode Transcript
Areej: Hey everyone, welcome to a new episode of the Women in Tech SEO podcast, I'm Areej AbuAli. I am the founder of Women in Tech SEO. So, today's episode is one that I'm excited about. It's all about XML Sitemaps. And joining me today is the brilliant Michelle Race. She's a Senior Technical SEO at DeepCrawl.
Hey, Michelle!
Michelle: Hello. How are you?
Areej: I'm good. How are you doing?
Michelle: I'm good, thank you.
Areej: I was so excited when I saw that pitch come through because as geeky as it sounds, I love XML site maps. I love everything about them. So, I love the fact that you pitched that across. Thank you for that.
Michelle: No worries, I feel like they are often a bit forgotten about. So, I wanted to try and change that.
Areej: Yeah. So, can we get started by knowing a little bit more about you, how you got into the world of SEO?
Michelle: Yeah, sure. So, after uni I worked at a few agencies as a front-end web developer, just building website templates. My last agency then decided to stop building websites and my boss at the time said "Oh, why don't you look into technical SEO?" as they didn't currently have an expert. It was a bit daunting at first, but knowing things that HTML helped a lot with understanding the basic tags. And then over the years, I taught myself with lots of help from Google Help forums and documentation, and I was Head of technical SEO there for quite a long time. But then in November last year, I joined DeepCrawl as a Senior Technical SEO and I work in a team of amazing technical SEOs and it's so fun. I love it.
Areej: Yeah, you work with such awesome people, like some of the best women I know in tech SEO work there so I can imagine how amazing it is to be working alongside them.
Michelle: Yeah, I feel very lucky.
Areej: How did you find the whole, like, agency versus not agency life?
Michelle: Yeah, it was a big change and at DeepCrawl I get to work with large enterprise clients and there's a lot of differences. So, it's really good.
Areej: Yeah, I love that. I used to be agency side for a little bit over five years before I moved in-house, and I prefer in-house, and I can imagine how exciting it is in DeepCrawl getting to work with all kinds of clients as well.
Michelle: Yes, definitely
Areej: Awesome. And then like for women who are just starting in the industry, because with our audience base, we have women from all walks of life. Any advice you would give them?
Michelle: So, I found it a lot easier to understand technical SEO knowing things like HTML and how a page should be structured. So, I'd recommend trying to learn the basics of things like HTML and also set up a website, if you can, because it's really good for testing and then obviously join communities like Women in Tech SEO. And also, Twitter is a great resource to follow experts and ask questions and then definitely join the Google Webmaster hangouts because you just get so much useful information from those. Back when I was learning, it was mainly just the forums. So, the hangouts are just a really good way of being able to ask those very easily and quickly.
Areej: That's such good advice. I remember always reading a lot of the recaps of these hangouts. I think I probably felt a bit overwhelmed joining it. I wasn't sure what the setup was going to be. And if I had to make sure that I'm going to ask questions. But whenever I read the recaps or I hear about people who join them, I always hear a lot of positive things.
Michelle: Yeah, I'm a lurker on their hangouts, for sure.
Areej: Yeah, me too. So, we're here today to talk about the XML sitemap. I remember when I first discovered how important they were when I was just getting started in my technical SEO career. So, for people who are fairly new to the topic, what are XML site maps?
Michelle: So essentially XML site maps are a list of the important URLs in your website provided for search engines, and they help to speed up the discovery of URLs and helps Google to understand how your website is structured, a bit like a road map of your site. They normally sit at the root of a website, but you can name them anything you like. So sometimes they can be a bit hard to find. I normally check the robot's THT first via site map reference, but then you can also check the search console to see what has been submitted there. And then you can always try combinations like flashsitemap.xml but there's no requirement to link to the main robot. So sometimes they can be a bit tricky to find. There are a few guidelines to follow for site maps, so the maximum number is 50,000 URLs in the site map, and it has to be under 50 megabytes uncompressed. You also need specific tags to be valid. So, I'd pull up the documentation to make sure that you've included all the required ones as well.
Areej: Yeah, and we can make sure that we link as well to some of those main guidelines and documentation in the show notes just so that it's helpful for others who might be approaching the topic in a fairly new way. As I mentioned at the beginning, I was excited when I saw your pitch because in the last two years, even of doing Women in Tech SEO, I've never had a workshop pitch or even a podcast one that's come through specifically about this topic. So, I'd love to dig a bit deeper and know more about why you love XML sitemaps and why you feel they're very important.
Michelle: Well, I just think they're amazing resources, not just for search engines, but also technical SEOs. And sometimes I think they can be set up and quickly forgotten about or not checked or optimised because there's a lot of different ways that you can set them up. For search engines, the benefits are, if they're set up correctly, then an easy to access a list of your important URLs and I will be saying important URLs a lot in this. They can be used to easily highlight new or updated pages for search engines to crawl and recrawl. You can also provide search engines with a lot of information for search so you can include things like images, and you can also make special site maps so things like videos or Google news. But you have to make sure that you read the documentation carefully. So, there are some differences for Google News site maps, for example, they should only contain 1,000 URLs and only articles from the past two days. So, there are some differences in how you set them up to be aware of. And then for technical SEOs, they are really good for analysis, you can include site maps within crawls, and you can find issues like orphaned pages, errors and diagnose poor internal linking. But you can also submit them in the search console. And there are specific errors and excluded reports just for submitted URLs. And you can also analyse individual site maps and narrow them down in-depth on issues or trends, which is good for large websites when you just want a bit more focus on where to look at first.
Areej: Yeah, and I can imagine, the bigger the website is, the more challenging it can potentially be to look into site maps in more detail, like analyse them audit them and so forth, as opposed to a straightforward website that has very similar templates.
Michelle: Yeah. The amount of information when you just look at all discovered URLs compared to submitted URLs, you can just make sure that you look at errors just for your specific category sitemap, which may be more important than, say, your product one. And it helps find issues a lot quicker. So, you can see trends if there's an issue with a product template, if they are all set to no-index, that would show a lot faster, for example.
Areej: Yeah, and I think even though Google search console still gets a lot of heat in terms of it might not be as helpful or it's missing data, but a lot of the more recent in the past year or so updates have made it so helpful to know exactly what's wrong with one site map and so forth.
Michelle: Yeah. The fact that you can just see coverage for submitted site maps, I think is brilliant.
Areej: Yeah. And do you feel that every single website needs an XML sitemap?
Michelle: So not necessarily if it's a small website. Google says in the documentation that 500 URLs or fewer doesn't need one. But in my opinion, if it's simple to create one, especially if it's dynamic, then you may as well because you do get specific submitted reports for analysis and search console and you can provide that extra information like last modified days and you can include things like images. And the main benefit for websites is if your website is new or if it's large and has pages that may be hard to find by crawling. So, this just gives a clear list of them.
Areej: Yeah, definitely, and sometimes even when you're analysing a website or crawling it or so forth, just crawling it in terms of seeing what's in the site and what's not, yeah, you can get a lot of information from that and you can help you.
Michelle: Yeah, exactly, and if you just want to check your important pages for, I don't know, what the page titles are, you can easily just use that sitemap list to just analyse those pages. So having that list also benefits you as a technical SEO because you already have that important list for checking.
Areej: Yeah. So, what would you say are some of the things that should be included in an XML sitemap?
Michelle: So, your map should include your canonical URLs, so these should be the URLs you want to be indexed and it should contain a 200-status code. It should also be an absolute URL. So, this means it should contain the domain and the preferred protocols such as HTTP or HTTPS and W or none W. One thing to make sure of is that I've seen it in the past where a website can be available via W and none W and then a site map is generated for both versions. And then in those types of maps, you get the W version and the non-W in two site maps and that can be very confusing for Google. So normally you'd have a redirect to the preferred version, but that's just to be aware that sometimes you can get extra site maps generated that you may not realise. And then you want to make sure that your canonical URLs are also linked in the web crawl as well. They could be considered doorway pages if they are not linked, which obviously can get penalised. Your URL should also be UTFA encoded and escaped and also make sure that any useful information for search is included. So, you may want to include things like images or videos or make things like a Google News site map. You can also include your alternate versions so hreflang can be placed entirely within an XML sitemap. This can be a good solution if you have a lot of hreflang annotations because it will take up less room on the page and then make sure you only use one hreflang implementation method. That's the only thing I would say because I've seen it in the past where it is implemented both in the XML sitemap and in the head tags and then sometimes in the HP header as well. And although it won't be an error as such, if you have different hreflang's in each of these, it can send very confusing signals and it's often harder to analyse and fix it. So that is the only thing I would say about that. And then you can also include mobile alternates as well. If you have a separate mobile website, you can create that in your XML sitemap too.
Areej: That's such good advice about hreflang implementation. I know I've seen previously a few people kind of recommended it as one of the ways when it comes to internationalisation. So, thank you for adding the different caveats as well and what people need to be aware of and need to make sure of because this is one of the things that can easily go wrong.
Michelle: Yes. I've seen it in the past where there are different hreflang implementations in two different methods and it's just very hard to unpick it. And then Google gets very confused.
Areej: Yeah, yeah. And in terms of the things that we should make sure we avoid when it comes to XML site maps, what comes to mind there?
Michelle: So generally, you want to avoid non-indexable URLs in your site maps. So, examples of these would be URLs that are broken, no index, canonicalized or redirecting. Avoid putting pagination, paginated URLs, or session ID URLs and those are blocked by robots TXT, and then duplicate URLs as well. So, this isn't strictly an error, but you don't need to include a URL more than once in a sitemap or put the same URL in multiple site maps. There's no benefit to this and you're just increasing the size of your site maps. So, I would avoid that if possible. And then also don't include pages behind a login such as admin pages because Google won't be able to access those. There are exceptions, though, so it is recommended when migrating and in the Google documentation for site moves to upload an XML site map of your old URLs when you migrate. So, this is useful for a couple of reasons. It helps Google to crawl and find the redirects, but you can also track as the old URLs fall from the index as well. So, you can see the index count for those drops. And it's also the same, for instance, if you have a lot of no index pages that you're trying to remove from the index. So, if you want Google to see these faster, you can submit these via a separate site map temporarily and track their index status. Again, both of these shouldn't be long-standing site maps. They should be temporary, but they are really good for both of those reasons. And another thing to avoid is putting everything in the site map so not everything needs to go in, only include your important URLs. Would you want a user to find that URL in search, for example? So, some tools and plugins which automatically generate site maps will include everything by default. You should be selective and make sure that anything not necessary is not included. It doesn't mean that it won't be indexed. It just means that you're not showing it as a priority for Google. So, for example, if you have been blog category listing pages, you may not want those in there.
Areej: Yeah, that's excellent advice. I mean, everything from potentially paginated content or tags, e-commerce sites and all the different variations of filters or search or things like that. And you're right, especially with people who are relying on plugins that automatically come in with CMSs. And then what ends up happening is it just takes every single URL on your site and outputs it in the site map. So that's excellent advice.
Michelle: Yeah, it can increase the size of your site map as well. So that's why it's always good to just crawl your site map and just double-check what's there.
Areej: And in terms of, you know, you have a brand-new site that you're investigating or you've done some auditing to it, you can see that there are some problems with its XML site maps. What kind of tips do you have on how you can potentially improve it? What is the common type of things that might potentially come up that you can recommend to website developers and so forth to take a look at and improve?
Michelle: Yeah, sure. So, the first thing I would check is for any orphaned URLs. So, this is when a URL is found in an XML site map but not in the web crawl. So, you can find these by crawling the website with a tool like DeepCrawl or another tool like Screaming Frog and including your XML site maps within the Web crawl, if an orphaned URL highlighted is important, it can signal internal links and issues. Maybe the crawler couldn't find it for some reason that you need to investigate, or it could just not be linked at all. And in that case, I would be trying to link that within the website. Sometimes dynamic site maps can cause unintended URLs to be accessible and indexed. So, as I said earlier, some tools can include everything which isn't set to a draft or no index. So, this means a new page in the admin, which you don't think anyone can find because it isn't linked, is now shown in Google. And I do have a funny related story to this. At my last agency, I was reviewing orphaned URLs for a client, and I found a concerning page with a URL that was someone's name followed by a rude word, which I'm not going to repeat here. And I was shocked. And I thought maybe it's an angry ex-worker who's maybe been fired, but it turns out it was the boss. He made it for a laugh in the admin and he didn't realise that it would be indexed without being linked to. So, it was a very awkward situation where we found it and we then had to tell them about it. And it wasn't ranking for much. But it's a good reminder of why you should check what's in your site maps. Yeah, it was a strange one. But the more common things that I found find are when PPC landing pages are made and they're not linked on the website, but they're not set to no index, and that means they end up being indexed because of their setting that, you know, keeps them in the site map. And then you find out that your organic pages are competing with your PPC pages, which have accidentally been indexed. And that's a more common issue that I find for orphaned pages.
Areej: Yeah, that's really good advice. And I think it's one as well where we need to make sure that we're working quite closely with our PPC teams and we're doing these tricks and we're not just working in a silo, and it ends up that you don't notice that stuff if we're not keeping tabs on activity from both ends.
Michelle: Yeah, definitely, because you may not know that these pages are being created and in the first instance that know notice when you see them as the orphaned pages.
Areej: Yeah. And then so other than orphaned pages, what other common themes do tend to come across or important...