In this video, we are going to learn how to uncover some powerful information from the search engines with a little help from NLP. Now, if you donât have a technical background, donât worry! Iâll walk you through the process step-by-step and have a free tool for you to make it super easy.
ð What is an entity?
ðNLP SERP Analysis with SpaCy â Colab
As I said in the opener, weâre going to be looking at natural language processing and how we can leverage that to better understand the search engine results page. Now, you may not have a background in computer science or Python or any of that stuff, but itâs totally cool. Weâre going to walk you through it step-by-step. We even have a Colab file where, really, all you have to do is add your keyword and push a few buttons and youâre going to extract a ton of really meaningful information. Before we get into the actual work that weâre going to go through in this video, I want to really just cover this once again.
We did a whole video on this, and Iâll link to that video about entities. But in this video, thatâs exactly what weâre going to be extracting using spaCyâs NLP model. An entity, itâs a thing or a concept. Itâs singular. Itâs unique. Itâs well-defined and itâs distinguishable. Entities are what Google is looking at when theyâre trying to understand concepts. They understand entities. Entities have linked open data points. They have nodes that connect them to other entities. This is how Google gets meaning from our texts. Itâs how they understand our text. Now, this is the foundation and the building blocks of our knowledge graph. A knowledge graph is a bunch of interconnected entities. In SEO, we know the power of linking. Linking is extremely important from both sites outside of our site, as well as the site linking internally.
Leverage Google Colab
By extracting these entities, weâll have a better idea of the concepts that Google is looking at when theyâre presenting search results. Weâre going to go through a Google Colab file and Iâm going to show you how you can leverage Google Colab even if youâre not a programmer. Iâm not a programmer, but I am pretty decent at copy and paste and searching the internet to solve some problems. Through a number of different resources and connections, Iâve been able to make some of these tools in-house that really give us an edge and allow us to see whatâs happening underneath the search results. Before we got started, I wanted to cover this really quickly. To learn more about entities, please check out the linked video. All right, here we are in Google Colab. Now Google Colab is a workspace that you can leverage to build software or to build tools.
In this case, weâre going to be leveraging Python. Now, if you have no background in Python, thatâs totally okay. Iâm going to give you access to this Colab file, which you can make a copy of and play around with in your own Google Colab file. Itâs completely free to do that. Python is a programming language. If you built websites or youâve done anything in that world, you could probably get the grasp of Python. Iâm still learning Python. Iâm not a coder by any means. I donât even claim to be one. Iâm good at copy and pasting, like I said before. Thereâs a couple of things that weâre going to do in this Colab file. Weâre going to start by getting the results from Google, and then weâre going to scrape the results, get all of the data, and actually the content from the top results. Then weâre going to analyze that content and extract the most meaningful texts, the most meaningful terms and concepts.
Extract Entities From Top Five Pages
After weâre done with that, then weâre going to go into some further NLP, and weâre going to extract entities from our top five pages, and visualize that result. Then from there, we can use that data to help us inform our content, other things like that. To make this easier on everybody, Iâm going to go ahead and zoom in a little bit here. Thatâs probably a little too far. The first thing we need to do is just run these cells. You donât have to worry about any of the code in here. If you know Python, you want to manipulate it, go right ahead. But really, we get started by just clicking play on these play icons.
Itâs going to install the necessary libraries and pour all the tools that weâre going to need to achieve this. Weâll go through these. Right here, weâre just installing Google and Trafilatura, which helps us scrape, and Google helps us get the information. Then weâve got some pretty standard Python imports here with pandas, NumPy, pretty print, things like that. Next, weâre going to install the things that are going to do the most of the work, and these are the transformers. The transformers are what allow us to do SERP analysis, summarizing the SERPs, get question and answering, extract the content from the web. This is where the powerhouse comes in with TensorFlow and transformers. Again, you donât need to know a ton about it, but thatâs what those things do. Now here, weâve got some things going on with queries and that kind of thing. This is going to pull the query. Itâs going to look at what kind of results we want to bring.
Thereâs a little bit more input here, and you can read all the documentation if youâre interested, but we made this super easy. Really, you just have to go to this side and type in your query. For instance, we can put whatever we want in here and letâs, for the joy of this one, weâre just going to put semantic SCL. Once youâve done that, now you run this query. Now this is going out and itâs fetching the top 10 results from Google. Here they are. This is the top 10 results from Google. Pretty easy, right? Now, we got to scrape the results. Now, Trafilatura, like I said, itâs going to go out into these pages above and itâs going to scrape all of the content for us and package it into one giant corpus of text.
To do this manually, it would take a lot of time. Thankfully, because of computer science, Python, codes like that, these packages that people built, you can do this relatively quickly. Hit the button and weâre off to the races. Now this is going to take some time, obviously, because itâs going to go out and itâs going to crawl all those sites, going to pull all the texts, and there you go. Itâs pulled the 10 articles and weâre good to go.
Analyze the Information
The next step is actually analyzing it. We want to see the text. Weâre going to see it, in this case, in a Scattertext, which is going to allow us to plot this on an HTML map and help us to see the importance of different terms based on ranking. Again, we donât have to do anything fancy. We just need to hit the play button and this is going to start pulling that out. Notice right here, weâve already started using spaCy. SpaCy is doing the NLP for us. Weâve got all this data and itâs actually in whatâs called a data frame and itâs storing it. Now, weâre splitting that data frame. So weâre going to have one side of the results will be the top three and the other side will be the top four through 10.
Once weâve run that and itâs grouped the results, now itâs time to create our visualization. This is helpful, like I said, to see what is the most important texts, topics within the top 10 results, but splitting it to what is most important in the top three versus what is most important between positions four through 10. This can take some time just depending on all of the visualization, all the data thatâs being scraped. But once itâs done, itâs pretty, pretty, pretty cool with what you can do. All right, here we are. On the right-hand side, youâre going to see the top three and itâs going to show us the most meaningful terms. Then down here, weâre going to see positions four through 10 and itâs going to show us the most meaningful terms. Then itâll give you characteristics throughout the entire corpus.
Over here, we can see terms on an axis. This axis right here, the lower is infrequent. Top is frequent. Four through 10, lower is infrequent. Over here all the way to the right is frequent. Youâll notice here in the top three, there are a couple of terms that they use that the rest of the sites arenât using: target keyword phrase, target phrase, keyword, broader topic, target semantically. These are some interesting terms that are not being used as much in the other pages. Actually, if you click on this, this is whatâs really cool, is it will actually show us how frequent these terms are. In the top 10, out of 25,000 terms, this term semantically was referenced 206 times. But when we went to the four through 10 and the top 25,000 terms, it was only referenced 17 times.
This is highly correlated here to the top three results. If youâre writing content, you might want to think about what this term means. Over here in the top, this is what everybodyâs talking about: content, semantic search, results, data, query, page, SEO, ranking. All of these are relatively connected with both one through three, as well as four through 10. If we go down here, these are terms that are used a little bit more frequently within the four through 10, but not so much in the top three.
When youâre creating content, letâs say youâre on maybe position seven, how can I work some more of these terms? Am I counting or covering these terms up here as well? It also allows me to see how these are mentioned and how theyâre phrased. We donât copy and paste, but we can use this to help influence what we need to create when weâre building content. Letâs say you were trying to create content for this. You can start to get a lot of good information on the terms that are necessary for you right here within just this little corpus that we did, Schema.org. You can see all the things that are important. Now, if we go down further, itâs actually created here a way to extract the top 25 terms. You can go ahead and hit this cell and it will actually build that out and you can copy and paste the top 25 terms and you can see how frequently theyâre being used within the top three or four through 10.
NLP in SEO
Now thereâs a lot of tools out here thatâll do this today in the SEO world. NLP is becoming more and more used in the SEO space. Frase.io is an amazing content optimization tool. We will be doing a video on that. Itâs a tool we use all the time. They actually do a lot of topic extraction, which makes it easy. But when youâre getting started and maybe you donât want to buy a tool, Python is a great way to make some of these tools yourselves and also customize them to your specific uses and needs. As you can see, so far, weâve run the top 10 in the search. Weâve got tons of information on the text thatâs being used and how itâs being used, and then we also have a list of the top 25 terms.
Some of them are relevant, some of them may not be. I may be like, is that really relevant? Probably not. So take it with a grain of salt. This isnât perfect. When weâre doing these things in this world, weâve got to just look at it and then extract the mean that makes sense for us. If this makes sense, then yeah, letâs use it. If it doesnât, then we toss it out. Now letâs do a little bit more natural language processing. The next thing weâre going to do is move into extracting entities. Iâm actually going to take this just the top five results and just narrowing it down a little bit further. That way we donât get overloaded with data, and then we are going to pull the content from the top five results. We already did that before, but now this time, weâre smashing it together, making a smaller corpus. Hereâs all that fun content.
Now weâre going to extract the entities. This is where spaCy comes in. SpaCy makes NLP relatively easy. You can just check that out. Iâll put a link to the spaCy site too. Again, if youâre new to NLP or Python even, theyâve got great stuff on their website as well that can walk you through this to help explain what it is and what it does and also give you some practice. We hit the play button and spaCyâs off to the races for us. What itâs doing is itâs looking at all of the content in here and extracting it. What itâs doing is looking at all the content in full body, which is this up here, and itâs going to extract both the entities as well as the types. This is going to help us organize them.
Clean It Up
The next thing weâre going to do is clean it up. Weâre going to remove duplicates because in this case, we donât need to see the duplicates. We just want to see the entities themselves. Then I like to visualize the data. It can make our life a lot easier. This is a tool called Plotly, or a plugin called Plotly. Then we go ahead and run this. You can see itâs breaking it down by different types of entities. Weâve got numbers, weâve got people, date, organizations, money, events, cardinality, product. The cool part about this again, is you can actually zoom in. Again, itâs not going to be perfect, like Google, Bing isnât a person. BERT isnât a person. So youâre going to want to go in there and you can fine-tune spaCy, learn a lot more, and learn how to fine-tune it.
It does a pretty good job on its own, but as you can see, BERT is an important entity, so is Bing, Google, and Bing. QUERY is an important one. Weâve got Quora showing up here. You can go further and further. Look at the different dates. Look at the organizations who are attached to this search liaison, from Twitter. Weâve got algorithms in Apple and all that other fun stuff. You can really explore and see the different concepts that are being used right here within the search results. Then once youâre done, you can go ahead and store all of this information for yourself. You can go through it and start to create content for the search engines, as well as you. Then you can start to create better content for the search engines as well as meeting your userâs expectations.
In order to mark up your tax and add that extra structure data, thereâs a number of ways to do that. Weâve talked about it in a number of our videos. We also have some courses that you can take to learn how to do that at simplifiedsearch.net, and weâll put those links here. We also talked about tools like WordLift, which will also allow you to do this and help automate this. But running a SERP analysis is really, really cool to do because it allows you to see underneath. This is the data within the structure layer that youâre getting some more visibility to that can help guide you when youâre building an SEO strategy. Let me know if you have any questions. I know this is something a little bit more technical this time, but honestly, if you make a copy of this Colab file, you just hit the play buttons and go along with it. I think youâll find some really interesting insights. Please comment below. Iâd love to hear what you think. Until next time, happy marketing.