What you will learn?
[00:00:00] Mike Roberts:
So, let’s get on schedule because I hope to have time for Q&A at the end. I’m Mike Roberts, I’m the Founder and CEO of SPYFU. If you guys have never of SPYFU before, you can go to SPYFU.com, type in any website and see every keyword that they’ve ever bought on Google, every ad variation they’ve ever run, every organic ranking, all the backlinks. Basically, the idea is that you can spy on our competitors, learn their best secrets, and avoid their worst mistakes.
As a Founder, I wear a lot of hats. Literally done everything, from like wash the dog to like write the code, everything right. And as a Founder, the idea is to kind of get rid of these things so that you can. Overtime, you know your functional roles, you want to get rid of those so that you can focus on the big picture.
One of my last remaining jobs, one of the last remaining hats that I have is that of Technical SEO. And I kind of held on to this thing a little bit longer than really, really I should. It’s like overriping fruit in a way. Umm but I hold on to it because it helps me empathize with customers, our customers are oftentimes SEOs and umm so I feel like I’m in the trenches with them. Umm, but it’s pretty much time for me to not do that so, I posted a job, Technical SEO Over Lord and I got tons of super awesome, high quality like the best of the best. I got like 85 applicants and like maybe 25 are super great or at least really good. Big huge agencies, ones you’ve heard of, Fortune 500 Companies, actual industry speakers, people that I’ve seen or heard of before. I started interviewing people and I got to this certain subject, a certain topic and people just completely fell apart, they just ah like didn’t get this and ah get a certain topic and it surprising to me because to me It’s like a really fundamental part of Technical SEO and it applies to everybody in the entire world.
[00:02:15] Duplicate Content
Like if you produce content or do anything with a website, there’s a topic that you have to understand and it’s duplicate content, duplicate content you may have heard of. And I’m not talking about ordinary duplicate content that’s the type of stuff you think of, that I think of when I think of duplicate content is umm, well somebody copied my stuff and put it on their website. Right, so it’s not like basically straight up plagiarism, I’m talking about extreme duplicate content. [screeching guitar sound] This is trademarked by me, okay, and this is anytime 2 or more different URLs on your website return the exact same thing right. Exactly the same content, okay and that’s my definition.
But first, before I go into this like extreme duplicate content, I want to talk about why duplicate content sucks or why it’s bad in case you guys don’t know. So, the idea is Google is sending you all this authority, this link juices, AI goodness or whatever that causes you to go up in ranks and if you have multiple pieces of content that are literally the same god damn thing, then you’re diluting your content. Now that may seem like maybe not that big of a deal because more content is better, right, like who doesn’t want to have more pieces of content. Maybe you’ll dominate the serp with all the content. No, that’s not the way it works because the serp doesn’t happen in a vacuum.
Right, what happens is instead of having you know, fifty-six hundred eighteen AI awesomeness ranks or whatever the hell, everything basically, by the way, as a computer scientist, everything boils down to a — I mean boils down to an integer. Okay, everything is an integer, that’s an integer, it’s a number okay. So, the thing is, instead of having fifty-six hundred, you have four thousand and that basically makes your competitors crap content outrank you, that’s no good. Okay, so it’s super fixable, the trick is to take all of that authority and all that juice and direct it at one piece of content, the one that you want. And in doing so, it restores the appropriate order in the universe, and that being, your content on top of all your pitiful competitors’ content.
Now here’s the thing, is that every website in the entire world, all of them every single one all you know hundred million or whatever websites have tons of duplicate content. Lots of it, almost infinite amounts of duplicate content, and it’s either being addressed by you or somebody on your team or it’s not being addressed. You have it, it exists.
Okay, the problem with duplicate content as I see, or one of the problems is that people talk about it and kind of like, they give these examples that really suck, like for example, oh you might have like a calendar on your website and every one of those, everyday on the calendar has like a separate URL and then that’s duplicate content. Well, I don’t have a calendar and most people don’t have a calendar with a different URL, so that’s a stupid example.
[00:05:16] WWW vs Non-WWW Sites
Okay, I’m going to give you nine examples that apply to you for sure alright, and people are failing at this. We used to fail at it, everybody fails at it at some point. We’ll all not fail eventually. The first one is the obvious one, it’s like super, super, simple, is like you’ve got your w-w-w and you non w-w-w website right so www.SPYFU.com versus SPYFU.com so somehow you have to figure out not to have both of those happen because that would ordinally return the exact same content right.
[00:05:49] HTTP vs HTTPS
Then you’ve got your http versus your https. In the eyes of Google, http and https websites are URLs, different URLs. Those are fundamentally, look at them, they’re different URLs returning the same content, so it meets my extreme [screeching guitar sound].
[00:06:06] Trailing Slashes
Okay so also trailing slashes, it’s kind of like a tiny little thing, it’s a stupid, insignificant slash at the end but it makes those two URLs technically different in the eyes of Google. Ah case sensitivity like, generally as humans we think, “main purchase”, like www.SPYFU.com/mainpurchase and capital Main Purchase are pretty much the same thing. But, URLs are case sensitive, they’re built that way like Linux and crap like that, they’re case sensitive and it’s annoying but this is how computers work and it’s how Google works.
[00:06:39] Domain Name
One thing that’s not case sensitive that you kind of think might be, is like the host. So, you’re like your domain name actually exists in a completely different workaround and that’s called DNS and DNS is not you know case sensitive, they just made it not case sensitive so that you can type in capital, capital, capital W-W-W.SPYFU.COM like you’re screaming at it and everything will be fine, you’ll still get there. Same thing by the way with, with the http and the https, you don’t have to like worry about those ones. It seems maybe obvious or maybe not, I don’t know like, it, it, I don’t know. So the other thing, come on.
[00:07:17] Operational Environments
So the other thing, I built this thing on Google slides and so anyway, umm so here’s the deal you’ve got operational environments, so it’s really, this is usually like a little thing that your developers do. And you may not even know about it, that’s the crazy thing is that you don’t necessarily know about it. It’s infuriating but there’s a reason that sometimes you have like the same exact website on a different host.
So just like an example, if you have server firmware, you have lots of servers, you might want to be able to go to each individual server directly, and that is to like troubleshoot it or to make sure it doesn’t go down. There’s all kinds of reasons why you’d like to be able to do that from an operational perspective. Also, if you have like a QA environment, these things, sometimes you can put them behind a password but oftentimes it makes sense not to, to allow other hosts to do it. Also, if you have a content, see those links it says CDN – that means content distribution network. If you have a CDN, oftentimes you’ll have these extra little hosts that are out there that you don’t really think about.
[00:08:23] Marketing Parameters
Okay this is where it gets a little bit tricky, a little bit tricky, marketing parameters. Your UTM sort stuff, that’s actually duplicate content. Okay, who doesn’t believe me? Anybody not believe me? Come on go ahead. Who wants to Google this right now? Yeah [laughs] Yes, google it, it’s true. You think it’s not true, it’s true okay! These, anytime, anytime you change that even though like you’re trying to track your stuff from bitly or whatever and you’re doing it like UTM source equals slideshares, cause I’m going to take this thing, I’m going to post it on slideshare and I want to track that, that’s going to be a duplicate URL issue if you don’t deal with it.
Same thing with the affiliate tracker, you think okay well you think about the UTM stuff and all that you know campaign tracking, at least I’m in control of it, well this stuff you’re not. You have an affiliates that are linking to you, they are using their own stuff so you can’t really control, you know I talk to people and a lot of times they’re like ah what I’m going to do about um you know for example the case sensitive thing is that I’m just going to be really careful. Fine, you have a tiny website and nobody links to it but the thing is the instance that somebody starts linking to you, you’re totally out of control. You can’t control case – – people linking to the wrong case and people you know linking with their own tracking parameters or any random crap that want to put in it okay. And, and, and that creates duplicate content issues.
Google just follow those paths. And here’s like maybe kind of the most technically complicated one in here and that is the order of parameters that come after the question mark. These are clear parameters and realistically almost all the time those will return, well at least in our case one hundred percent of the time, but almost every website in the entire world doesn’t care about the order of those things. They’re just going to say well what’s the parameters? Is this parameter in there? So the website can deal with those parameters but to google, those are completely different URLs, duplicate content once again. So there’s obviously lots of nuance types of duplicate content like we’ll get into those, briefly and that’s what I want the Q and A for. In case you guys have like some questions, I’ll definitely try to answer them. Um but these ones hopefully you all agree, definitely apply to every website everywhere. But those ones are like somewhat theoretical and I just want to like go into one like kind of actually very tangible real-life story of one of these interviews I did.
[00:10:58] Interview with a Lead Technical SEO
So this is the guy and he’s one of the, the lead technical SEO, the head Technical SEO at a Fortune 500 company and we were relatively deep into the interview and I was talking to him, I mean he did pretty well. In fact he actually got many of those other things, kind of right, kind of right, kind of like just not a lot of depth of knowledge. Then we started talking about you know a specific issue that he had, and what they were doing is that they had, ah they had a blog and then they had some of the, quite a bit of the content that they put on their blog ended up being duplicated in their main site right. They took some of that editorial content and they like kind of cross posted it if you will.
And, so, I was like okay what are you going to do about it and you know, what’s your solution? And he was like well, my solution is, you know the only solution is to get rid of the blog. I’m like shit, that’s harsh. Like, also it’s a huge Fortune 500 company, like that sounds like it’s never going to get done. And he’s like yeah, it’s never going to get done probably but it is my solution and I put it on paper and therefore, ah it would maybe ah ever get done, that’s what we’re going to do. And I asked him, is there any way that you could potentially solve this problem umm you know at least temporarily until your super great solution kind of like comes through. And he’s like I don’t know, I don’t know.
Alright, well here’s the thing guys, there’s a stupid, simple, ridiculously easy and like the right way to solve this really, pretty big problem. I mean remember, it’s a big like a large amount of content, for a really big company. That they’re probably going to end up spending like a few million dollars trying to get rid of their blog. Migrating content over, and all of the bureaucracy right. And like what I’m talking about is, probably cost like $8 or $9 to fix. It’s like nothing, it’s super easy like you can do it in ten minutes.
Okay so of the 23 people that I interviewed, 2 could fully handle the extreme. Alright so it’s like, this isn’t even on brand for us, like why am I even giving this presentation, I just like thought well, you know what let’s just bring it out there and maybe we’ll improve the uh you know, everyone’s knowledge a little bit.
So, when I start to drill down, when I really try to like reflected back on why, where did people actually fail, most people, got like 2 of the 3 sort of I guess, pillars or legs of the stool or whatever you like, of how to deal with duplicate content.
One of the tools that we use is 301s. Almost everybody understands 301s, how to do redirects and that solves some portions of those problems that I had. Ah and many people actually understood and could appropriately apply like index control umm most people were on, onboard with robots.txt and what it does. Umm fewer people were like understanding like what meta no index stuff does.
But it’s the third leg that gave people the biggest problems and like I’m talking like even if they understood it, it was so shallow that I’m like there’s an endemic problem in the entire world and I think it’s because the damn thing is called canonicalization. It is the stupidest word that you could ever possibly, I know that everybody in the room is like, really got like a great vocabulary. You guys are all super smart, look you’re here. But if you were to name, as a person who knows words, you would never name anything canonicalization. It’s a stupid name. Like if you want somebody to use your product, like let’s launch a product, let’s call it canonicalization. I was telling my wife about this and I was like, yeah, yeah, yeah canonicalization, she was like conical. No, no, no it’s not conical, it’s canonical. You can’t even say it, it’s a horrible. It’s a horrible, horrible word but, and the crazy thing is, it didn’t just come from nowhere.
There’s somebody to blame, there’s somebody to blame. It actually came from the IETF which is the Internet Engineering Task Force. Umm same, same entity that, that Tim Berners Lee authored HTP1.1 and ah you heard it here first. The Godfather of the internet is [laughs] is a dick. Actually, it turns out Tim Berners Lee had nothing to do with it. It ah, it was actually ah real life geniuses named some Joachim Kupke and somebody Ohye and ah actually those are not their pictures. I found, there’s like actually, these were the first runs of these people and then I actually found their pictures and I was like dang those people look way more trustworthy and attractive than those guys, so I’ll leave them in here [laughs] as my avatars of badness.
[00:15:55] RFC 6596
Anyway, these are the guys that actually authored a document called The RFC 6596 which is where this term canonicalization and this whole concept of canonical, rel-canonical came from, while working at Google [laughs]. Okay, but here’s the smoking gun, the crazy things is, if you drill in and that’s for whatever reason when I was putting this thing together. I was like, I guess I’ll read the RFC, who wouldn’t? You read the RFC and the very first thing that they do, when they’re like talking about canonical is like they’re, the canonical or preferred version…why didn’t you just call it the preferred version? If they would have just called it rel preferred, everything would be fine. Nobody would have this problem because the truth is, this whole canonicalization thing is super straight forward. Right, hey Google, I can’t give you a redirect for now, for reasons, but this page is a duplicate. Please forward all the link juice and authority and whatever AI shit you got going on to this page. Here’s the page, best luck with yourself driving, Mike. That’s what we’re doing, we’re basically just saying, hey just forward all the link juice to this guy but I’m not going to give you a redirect cause, reasons.
So, let’s review the tools that we have. Number one, we got redirects. I’m not going to recommend anything other than 301s today. There’s 302s and crap like that but we don’t even need to talk about that. Ah index control, we got robots.txt and we got this little guy that you put in, at the page level. You put it in at the top of the page, it goes on the header. It’s called meta robots content equals no index. It basically tells Google, don’t put this in your index. Right, and then we have canonicalization also known as rel-canonical known. Now, what is not on the list? We’re not doing anything with site maps That’s not an answer, that’s not a good answer. We’re not going to anything with no follows. The being really, really careful thing, cool you should be really, really careful. Unfortunately, it doesn’t scale at all. Don’t, and it kind of just sounds bad. I’ll just be really, really careful forever and I’ll never make any mistakes for any reason. Also asking people to change their links, swear to God, I get this answer thirty percent of the time. Oh what if somebody links to you with like you know, the mixed case thing right. Oh well I’ll just reach out to them and see if they’ll change it. Yeah really that’s your solution. That’s a horrible solution. I have a million links.
Okay so now let’s go to the actual solution to these things okay. Umm ah who doesn’t want solutions? Alright so the www versus the non-www, ah 301 is the solution. You just redirect, you probably, you’re going choose one. Does anybody have any questions about what you would want to choose? It’s basically like, pick one. You want to be SPYFU.com or do you want to be www.SPYFU.com, just always do a global redirect on those things.
[00:18:56] Solutions: HTTP vs HTTPS
Http versus https, there’s two ways to do this. It used to be like maybe let’s say three or four years ago, where it made sense to have an http and an https version of your website. Umm because I don’t know like Google was just sort of starting to do the whole https and let’s just have both. Well, if you want to have that cause you’re conservative or something then you can go canonical. Umm but nowadays, I think it’s just perfectly safe to just pick one and the right to pick would be https.
[00:19:31] Solutions: Trailing Slashes
Trailing slashes, same solution or not the same solution but basically one to detect. You choose one, the more technically correct one is the trailing slash. Ah so if you want to be more technically correct, you could do that. But the other one is more beautiful [laughs]. So, we choose between beauty, and technical, it’s kind of like scientist turned artist. Look at your badge.
[00:19:55] Solutions: Case Sensitive
Okay so the case sensitive one, probably want to do almost always you’re going to want to do the 301. You just choose, most probably almost everyone is going to choose lower case, you could choose whatever you want. It doesn’t make any difference, you just have to choose a rule and stick with it. Ah it’s possible that your server or your application is case sensitive then that would be the case where you wouldn’t be, you would just be like actually my application is case sensitive. You think of like ah, sharing a Google Doc, you know that little long thing, you know how that’s like mixed case. If you change like an “x” in there to like a lower case “x” or upper case “X”, then it’s going to not work. And so ah, there’s quite a, there’s you know, there’s enough of those, you need to know your own application.
[00:20:42] Solutions: DNS
Again, DNS, you don’t have to do anything with that right, we, but, but you could because there’s actually some cases like your analytics, depending on the analytics package you choose. Umm or if you like putting things into a database or something like this. Ah where you might benefit from having everything go in as a single case. So, and also while you’re doing like the code necessary to like manage all your case sensitivity to make it all lower case, you could do it all just right there and it’s not going to hurt anything even though, you don’t have to deal with this, you could.
[00:21:21] Solutions: Operational Environments
Ah so your operational environments, are exciting! Okay, these are super exciting because they’re tricky, tricky, tricky! And they can bite you in a lot of different ways. Ah but the solution is always robots or no index. It’s one of the two or both. You might want to go with both because you definitely don’t want these operational environments to get indexed cause it’s kind of like, when those things get indexed everything on your entire website gets indexed cause right these are actually duplicates of your website. Not just a single page, the whole thing and that can really suck.
So here’s the secret trick, sometimes you don’t even know that there’s like one of these environments. You don’t know because you just like aren’t in the loop somehow. Somebody like your network guy or whatever is like guess what I’m going to do, I’m going to do another total version on the website. All I’m going to do is put in a DNS entry and point them to the thing. And nobody knows about it till randomly somehow, and who knows how, cause we don’t ever let these things out but somehow Google discovers this whole other website and it’s like, I found a new website and it starts crawling and all of a sudden, you never know about it until one day you see it in the serps and you’re like Oh my God, no wonder our, everything just like tanked at some point and the, and you just never knew why. And then you blame the poor guy, you don’t, you know who would, it’s my fault right.
Umm but there’s a secret to this okay. Oh the secret isn’t Google analytics and it’s not Google Search Console. I was really surprised. I tried to solve this problem and I was like this definitely seems like this definitely seems like the right choice for Google Search Console but you have to like know about the sub-domain, you have to know about the host to put it in there and in the loop there’s the new search console to see anything happening. So, the way to find these things, ah, look I’m really familiar with my own product right. I actually don’t know of any other way to find it but by using my product and I’ll show you how to do it.
[00:23:21] Using SPYFU
Ah so I put in SPYFU.com and then I’ll just do like an export like on all of the SEO keywords, all the keywords that we ranked on. And then I load that thing into Excel and then I do an exclude ah of all the domains that I expect. So for us that would be like SPYFU.com and Resource.SPYFU.com, our blog and our own site. And then what’s left over, and I literally did this when I was putting this spreadsheet together. This is why, like, this is why I need to replace itself, this type of thing is something that I do like maybe once a quarter, but I should do it like at least once a month or maybe once a week or something like this. Ah but there you go, I didn’t even know about CDN too and I don’t know how it go in here but apparently, I have some work to do.
Another trick is umm when if you, if you use robots.txt to make these things not get crawled ah the trick is to use a free tool called roboto.org. So, you can go to roboto, because what can happen is you can accidentally deploy the wrong robots.txt. We did this, and we lost, we lost, we deployed the ah like the “don’t index” version of our robots.txt to our main site and we lost five hundred thousand indexed pages in umm like two days. And then we got some of it back, but we got like four hundred thousand back so we didn’t get them all back. Well at least not until like nine months. It actually destroyed like all of the cool traction that we had for like nearly an entire year. It just kind of reversed a lot of things, which is kind of like a huge problem. So, ah the trick is use roboto.org, you put in the website, your website and it will track any changes and email you if your robots.txt changes and then ah, and then you’ll know, it’s pretty awesome.
[00:25:15] Solutions: Marketing Parameters
Okay so marketing parameters, the trick to marketing parameters is the rel-canonical right. So you basically do a rel-canonical ah for any time you see, you know any of these parameters. You basically want to say UTM source, you want to do all parameters because the course shrink could be anything right. What you want do is make it so that you’re excluding any parameters that you actually use from the rel-canonical. Ah same thing, same exact trick with the affiliate tracking. Umm rel-canonical and the parameter ordering this is a little bit, a little bit tricky. Umm if you, what you should, what you should do, the right solution here is a 301. Umm but the easy solution is the risk, the risk adverse solution is the rel- canonical right. Umm if you don’t know if you don’t want to take the risk about breaking something that’s unforeseen on your website, maybe don’t do the 301. Ah cause you can probably, it’s probably not like the most common issue and you can get away with the route canonical without any significant issue. It’s certainly not going to hurt the user by doing this right, and sometimes that’s what we want to think about.
Ah and then this one, the actual real-life solution or real-life problem that this guy had. The solution, the easy solution is just to re;-canonical ah like from the blog to all of the content that’s duplicated. So simple and legitimately right like probably $8, maybe $8.50, like so cheap. Way different than like migrating the, getting rid of the entire blog.
[00:26:56] Solutions: Nuance Cases
Okay, so let’s talk about the nuance cases. This is, people always want to get into like the weeds on this stuff. But I think that having a good foundation of like what we, what we are like the extreme cases, like what can we do, helps to build context about what the right solution should be. Umm so in general, Google wants you to do a 301. 301’s a stronger signal to the search engine. So we should start there. Ah but if, but canonical is safer okay. Basically, if you don’t want to take like a big risk, go with canonical. But one thing to consider is the user experience right. If you don’t know the right answer it’s like oh well, what would you like to happen to you as a user? You run the search, is there any reason that you would want to go to this page. If there’s no reason you’d want to go to that page, do a 301. If there’s a reason, that you might want to go to that content but it’s just not that ideal, then do the rel-canonical.
Keep in mind that Google doesn’t have to follow your recommendation for the rel-canonical. It’s kind of up to them. If you do a 301, it’s all under your control. Let’s look at some real-life examples that we actually did. So we did like a content audit last year. Ah we have like 10 years of content or something like on our blog. And I don’t know when the last time we did a content audit was. Ah so some of the things that we found were like we’d have an old older piece of content. We updated it so, we have umm this is, so this is like if you have something like every year you write like an updated version of the content and SEO trick, 2017 SEO tricks. You could, there’s a reason that you might want to go as a user and see your, the 2017 SEO tricks. But, probably you want to see the 2018 SEO tricks so we would rel-canonical, something like that. Conversely, if we have like a tutorial about SPYFU, here’s how you use SPYFU, well it turns out that we changed, we changed a lot about the site between 2014 and 2018. Huge amount of change such that if you look at the 2014 tutorial, it doesn’t look like the same website at all. Ah so that’s a case where we just like straight up 301 that guy. Umm every week we do a release where we update stuff. We fix bugs, we do things, we release new features. Umm there’s not really very much value in each of these individual release notes so what we wanna do is rel-canonical the, each individual release note to the top level release notes. So those are some kinda like nuance cases of things that you could do.
Umm questions? Anybody have something, yes?
[00:29:48] Question from the Audience
[00:29:56] Mike Roberts
[00:29.57] Questions from the Audience
[00:30.04] Mike Roberts
No, you don’t need to do, you definitely like…Okay so SPYFU has like a hundred million pages or like billions of pages potentially and like there’s no way we’re going to do that ah like manually right. Ah so yeah, you definitely have to it programmatically as you’re writing the page, as your programmers writing the page out. You basically just say what’s this page supposed to be, and I know it’s the homepage. So, I’m going to rel-canonical the homepage to my preferred version of the homepage which is you know umm parameter free. But I’m not going to 301 it, cause if I 301’d it, my tracking parameters wouldn’t fire, so that would be bad. Right, so you basically just inject that piece of, you know what it was, the rel-canonical, the meta note rel-canonical or link type route canonical.
[00:30:50] Question from the Audience
[00:30:55] Mike Roberts
Yeah, in fact on that page, you know on, you kind of like, you probably not going to have…If you have like let’s say a thousand pages on your site, you’re probably going to have you know, 10 types of pages. So basically, you need to create a rule or set of rules for every type of page potentially. You know, for us, we have like ah, you know, really 40 types of pages on billions of content.
Does that mean one minute? Oh okay, thank you.
So, yeah, so you basically create the rule that says this is the homepage, you know point to the preferred version of my homepage. If it’s another type, if it’s another page, you point to its preferred version.
Any other questions? Any specific examples of nuance route canonical or nuance duplicate content? Awesome we did it on schedule. Thanks! Oh if, nobody’s really going to like do this but if anybody had all the answers, or you know somebody that did, I actually am still looking for that, that Technical SEO. Thanks guys! [clapping]