In this video, we are going to learn how to uncover some powerful information from the search engines with a little help from NLP. Now, if you don’t have a technical background, don’t worry! I’ll walk you through the process step-by-step and have a free tool for you to make it super easy.
🔗 What is an entity?
🔗NLP SERP Analysis with SpaCy – Colab
As I said in the opener, we’re going to be looking at natural language processing and how we can leverage that to better understand the search engine results page. Now, you may not have a background in computer science or Python or any of that stuff, but it’s totally cool. We’re going to walk you through it step-by-step. We even have a Colab file where, really, all you have to do is add your keyword and push a few buttons and you’re going to extract a ton of really meaningful information. Before we get into the actual work that we’re going to go through in this video, I want to really just cover this once again.
We did a whole video on this, and I’ll link to that video about entities. But in this video, that’s exactly what we’re going to be extracting using spaCy’s NLP model. An entity, it’s a thing or a concept. It’s singular. It’s unique. It’s well-defined and it’s distinguishable. Entities are what Google is looking at when they’re trying to understand concepts. They understand entities. Entities have linked open data points. They have nodes that connect them to other entities. This is how Google gets meaning from our texts. It’s how they understand our text. Now, this is the foundation and the building blocks of our knowledge graph. A knowledge graph is a bunch of interconnected entities. In SEO, we know the power of linking. Linking is extremely important from both sites outside of our site, as well as the site linking internally.
Leverage Google Colab
By extracting these entities, we’ll have a better idea of the concepts that Google is looking at when they’re presenting search results. We’re going to go through a Google Colab file and I’m going to show you how you can leverage Google Colab even if you’re not a programmer. I’m not a programmer, but I am pretty decent at copy and paste and searching the internet to solve some problems. Through a number of different resources and connections, I’ve been able to make some of these tools in-house that really give us an edge and allow us to see what’s happening underneath the search results. Before we got started, I wanted to cover this really quickly. To learn more about entities, please check out the linked video. All right, here we are in Google Colab. Now Google Colab is a workspace that you can leverage to build software or to build tools.
In this case, we’re going to be leveraging Python. Now, if you have no background in Python, that’s totally okay. I’m going to give you access to this Colab file, which you can make a copy of and play around with in your own Google Colab file. It’s completely free to do that. Python is a programming language. If you built websites or you’ve done anything in that world, you could probably get the grasp of Python. I’m still learning Python. I’m not a coder by any means. I don’t even claim to be one. I’m good at copy and pasting, like I said before. There’s a couple of things that we’re going to do in this Colab file. We’re going to start by getting the results from Google, and then we’re going to scrape the results, get all of the data, and actually the content from the top results. Then we’re going to analyze that content and extract the most meaningful texts, the most meaningful terms and concepts.
Extract Entities From Top Five Pages
After we’re done with that, then we’re going to go into some further NLP, and we’re going to extract entities from our top five pages, and visualize that result. Then from there, we can use that data to help us inform our content, other things like that. To make this easier on everybody, I’m going to go ahead and zoom in a little bit here. That’s probably a little too far. The first thing we need to do is just run these cells. You don’t have to worry about any of the code in here. If you know Python, you want to manipulate it, go right ahead. But really, we get started by just clicking play on these play icons.
It’s going to install the necessary libraries and pour all the tools that we’re going to need to achieve this. We’ll go through these. Right here, we’re just installing Google and Trafilatura, which helps us scrape, and Google helps us get the information. Then we’ve got some pretty standard Python imports here with pandas, NumPy, pretty print, things like that. Next, we’re going to install the things that are going to do the most of the work, and these are the transformers. The transformers are what allow us to do SERP analysis, summarizing the SERPs, get question and answering, extract the content from the web. This is where the powerhouse comes in with TensorFlow and transformers. Again, you don’t need to know a ton about it, but that’s what those things do. Now here, we’ve got some things going on with queries and that kind of thing. This is going to pull the query. It’s going to look at what kind of results we want to bring.
There’s a little bit more input here, and you can read all the documentation if you’re interested, but we made this super easy. Really, you just have to go to this side and type in your query. For instance, we can put whatever we want in here and let’s, for the joy of this one, we’re just going to put semantic SCL. Once you’ve done that, now you run this query. Now this is going out and it’s fetching the top 10 results from Google. Here they are. This is the top 10 results from Google. Pretty easy, right? Now, we got to scrape the results. Now, Trafilatura, like I said, it’s going to go out into these pages above and it’s going to scrape all of the content for us and package it into one giant corpus of text.
To do this manually, it would take a lot of time. Thankfully, because of computer science, Python, codes like that, these packages that people built, you can do this relatively quickly. Hit the button and we’re off to the races. Now this is going to take some time, obviously, because it’s going to go out and it’s going to crawl all those sites, going to pull all the texts, and there you go. It’s pulled the 10 articles and we’re good to go.
Analyze the Information
The next step is actually analyzing it. We want to see the text. We’re going to see it, in this case, in a Scattertext, which is going to allow us to plot this on an HTML map and help us to see the importance of different terms based on ranking. Again, we don’t have to do anything fancy. We just need to hit the play button and this is going to start pulling that out. Notice right here, we’ve already started using spaCy. SpaCy is doing the NLP for us. We’ve got all this data and it’s actually in what’s called a data frame and it’s storing it. Now, we’re splitting that data frame. So we’re going to have one side of the results will be the top three and the other side will be the top four through 10.
Once we’ve run that and it’s grouped the results, now it’s time to create our visualization. This is helpful, like I said, to see what is the most important texts, topics within the top 10 results, but splitting it to what is most important in the top three versus what is most important between positions four through 10. This can take some time just depending on all of the visualization, all the data that’s being scraped. But once it’s done, it’s pretty, pretty, pretty cool with what you can do. All right, here we are. On the right-hand side, you’re going to see the top three and it’s going to show us the most meaningful terms. Then down here, we’re going to see positions four through 10 and it’s going to show us the most meaningful terms. Then it’ll give you characteristics throughout the entire corpus.
Over here, we can see terms on an axis. This axis right here, the lower is infrequent. Top is frequent. Four through 10, lower is infrequent. Over here all the way to the right is frequent. You’ll notice here in the top three, there are a couple of terms that they use that the rest of the sites aren’t using: target keyword phrase, target phrase, keyword, broader topic, target semantically. These are some interesting terms that are not being used as much in the other pages. Actually, if you click on this, this is what’s really cool, is it will actually show us how frequent these terms are. In the top 10, out of 25,000 terms, this term semantically was referenced 206 times. But when we went to the four through 10 and the top 25,000 terms, it was only referenced 17 times.
This is highly correlated here to the top three results. If you’re writing content, you might want to think about what this term means. Over here in the top, this is what everybody’s talking about: content, semantic search, results, data, query, page, SEO, ranking. All of these are relatively connected with both one through three, as well as four through 10. If we go down here, these are terms that are used a little bit more frequently within the four through 10, but not so much in the top three.
When you’re creating content, let’s say you’re on maybe position seven, how can I work some more of these terms? Am I counting or covering these terms up here as well? It also allows me to see how these are mentioned and how they’re phrased. We don’t copy and paste, but we can use this to help influence what we need to create when we’re building content. Let’s say you were trying to create content for this. You can start to get a lot of good information on the terms that are necessary for you right here within just this little corpus that we did, Schema.org. You can see all the things that are important. Now, if we go down further, it’s actually created here a way to extract the top 25 terms. You can go ahead and hit this cell and it will actually build that out and you can copy and paste the top 25 terms and you can see how frequently they’re being used within the top three or four through 10.
NLP in SEO
Now there’s a lot of tools out here that’ll do this today in the SEO world. NLP is becoming more and more used in the SEO space. Frase.io is an amazing content optimization tool. We will be doing a video on that. It’s a tool we use all the time. They actually do a lot of topic extraction, which makes it easy. But when you’re getting started and maybe you don’t want to buy a tool, Python is a great way to make some of these tools yourselves and also customize them to your specific uses and needs. As you can see, so far, we’ve run the top 10 in the search. We’ve got tons of information on the text that’s being used and how it’s being used, and then we also have a list of the top 25 terms.
Some of them are relevant, some of them may not be. I may be like, is that really relevant? Probably not. So take it with a grain of salt. This isn’t perfect. When we’re doing these things in this world, we’ve got to just look at it and then extract the mean that makes sense for us. If this makes sense, then yeah, let’s use it. If it doesn’t, then we toss it out. Now let’s do a little bit more natural language processing. The next thing we’re going to do is move into extracting entities. I’m actually going to take this just the top five results and just narrowing it down a little bit further. That way we don’t get overloaded with data, and then we are going to pull the content from the top five results. We already did that before, but now this time, we’re smashing it together, making a smaller corpus. Here’s all that fun content.
Now we’re going to extract the entities. This is where spaCy comes in. SpaCy makes NLP relatively easy. You can just check that out. I’ll put a link to the spaCy site too. Again, if you’re new to NLP or Python even, they’ve got great stuff on their website as well that can walk you through this to help explain what it is and what it does and also give you some practice. We hit the play button and spaCy’s off to the races for us. What it’s doing is it’s looking at all of the content in here and extracting it. What it’s doing is looking at all the content in full body, which is this up here, and it’s going to extract both the entities as well as the types. This is going to help us organize them.
Clean It Up
The next thing we’re going to do is clean it up. We’re going to remove duplicates because in this case, we don’t need to see the duplicates. We just want to see the entities themselves. Then I like to visualize the data. It can make our life a lot easier. This is a tool called Plotly, or a plugin called Plotly. Then we go ahead and run this. You can see it’s breaking it down by different types of entities. We’ve got numbers, we’ve got people, date, organizations, money, events, cardinality, product. The cool part about this again, is you can actually zoom in. Again, it’s not going to be perfect, like Google, Bing isn’t a person. BERT isn’t a person. So you’re going to want to go in there and you can fine-tune spaCy, learn a lot more, and learn how to fine-tune it.
It does a pretty good job on its own, but as you can see, BERT is an important entity, so is Bing, Google, and Bing. QUERY is an important one. We’ve got Quora showing up here. You can go further and further. Look at the different dates. Look at the organizations who are attached to this search liaison, from Twitter. We’ve got algorithms in Apple and all that other fun stuff. You can really explore and see the different concepts that are being used right here within the search results. Then once you’re done, you can go ahead and store all of this information for yourself. You can go through it and start to create content for the search engines, as well as you. Then you can start to create better content for the search engines as well as meeting your user’s expectations.
In order to mark up your tax and add that extra structure data, there’s a number of ways to do that. We’ve talked about it in a number of our videos. We also have some courses that you can take to learn how to do that at simplifiedsearch.net, and we’ll put those links here. We also talked about tools like WordLift, which will also allow you to do this and help automate this. But running a SERP analysis is really, really cool to do because it allows you to see underneath. This is the data within the structure layer that you’re getting some more visibility to that can help guide you when you’re building an SEO strategy. Let me know if you have any questions. I know this is something a little bit more technical this time, but honestly, if you make a copy of this Colab file, you just hit the play buttons and go along with it. I think you’ll find some really interesting insights. Please comment below. I’d love to hear what you think. Until next time, happy marketing.