A data science approach to optimize internal link structure


It’s important to optimize internal links if you care that the pages on your site have enough authority to rank for their target keywords. By internal link we mean pages on your website that receive links from other pages.

This is important because it is the basis by which Google and other searches calculate the page’s importance in relation to other pages on your website.

It also affects the likelihood that a user will discover content on your site. Content discovery is the basis of Google’s PageRank algorithm.

Today, we’re exploring a data-driven approach to improving a website’s internal linking for more effective technical SEO purposes. This is to ensure that the distribution of internal domain authority is optimized according to the structure of the site.

Improve internal link structures with data science

Our data-driven approach will focus on just one aspect of internal link architecture optimization, which is to model internal link distribution by site depth, and then target pages that lack links for their depth. specific site.


Continue reading below

We start by importing the libraries and data, cleaning up the column names before previewing them:

import pandas as pd
import numpy as np

# import Crawl Data
crawl_data = pd.read_csv('data/'+ site_filename + '_crawl.csv')
crawl_data.columns = crawl_data.columns.str.replace(' ','_')
crawl_data.columns = crawl_data.columns.str.replace('.','')
crawl_data.columns = crawl_data.columns.str.replace('(','')
crawl_data.columns = crawl_data.columns.str.replace(')','')
crawl_data.columns = map(str.lower, crawl_data.columns)

(8611, 104)

url                          object
base_url                     object
crawl_depth                  object
crawl_status                 object
host                         object
redirect_type                object
redirect_url                 object
redirect_url_status          object
redirect_url_status_code     object
unnamed:_103                float64
Length: 104, dtype: object
Andreas Voniatis, November 2021

The above shows an overview of the data imported from the Sitebulb desktop crawler application. There are over 8,000 rows and not all of them will be domain exclusive, as it will also include resource URLs and external outbound link URLs.

We also have over 100 columns which are superfluous to the requirements, so a selection of columns will be necessary.


Continue reading below

Before we get into this area, however, we want to quickly see how many site levels there are:

0             1
1            70
10            5
11            1
12            1
13            2
14            1
2           303
3           378
4           347
5           253
6           194
7            96
8            33
9            19
Not Set    2351
dtype: int64

So from the above we can see that there are 14 site levels and most of them are not in the site architecture but in the XML sitemap.

You may notice that Pandas (the Python package for data management) ranks site levels by number.

This is because the site levels at this point are strings as opposed to numeric. This will be adjusted in subsequent code, as it will affect the data visualization (“viz”).

Now we are going to filter the rows and select the columns.

# Filter for redirected and live links
redir_live_urls = crawl_data[['url', 'crawl_depth', 'http_status_code', 'indexable_status', 'no_internal_links_to_url', 'host', 'title']]
redir_live_urls = redir_live_urls.loc[redir_live_urls.http_status_code.str.startswith(('2'), na=False)]
redir_live_urls['crawl_depth'] = redir_live_urls['crawl_depth'].astype('category')
redir_live_urls['crawl_depth'] = redir_live_urls['crawl_depth'].cat.reorder_categories(['0', '1', '2', '3', '4',
                                                                                 '5', '6', '7', '8', '9',
                                                                                        '10', '11', '12', '13', '14',
                                                                                        'Not Set',
redir_live_urls = redir_live_urls.loc[redir_live_urls.host == website]
del redir_live_urls['host']

(4055, 6)
Sitebulb dataAndreas Voniatis, November 2021

By filtering the rows for indexable URLs and selecting the relevant columns, we now have a more streamlined data frame (think Pandas’ version of a spreadsheet tab).

Explore the distribution of internal links

Now we’re ready to visualize the data and get a feel for how internal links are distributed globally and by site depth.

from plotnine import *
import matplotlib.pyplot as plt
pd.set_option('display.max_colwidth', None)
%matplotlib inline

# Distribution of internal links to URL by site level
ove_intlink_dist_plt = (ggplot(redir_live_urls, aes(x = 'no_internal_links_to_url')) + 
                    geom_histogram(fill="blue", alpha = 0.6, bins = 7) +
                    labs(y = '# Internal Links to URL') + 
                    theme_classic() +            
                    theme(legend_position = 'none')

Internal links to url vs no internal links to urlAndreas Voniatis, November 2021

From the above we can see that most of the pages do not have links, so improving internal linking would be a significant opportunity to improve SEO here.

Let’s see some statistics at the site level.


Continue reading below

0             1
1            70
10            5
11            1
12            1
13            2
14            1
2           303
3           378
4           347
5           253
6           194
7            96
8            33
9            19
Not Set    2351
dtype: int64

The table above shows the approximate breakdown of internal links by site level, including the mean (mean) and median (50% quantile).

This is accompanied by site-level variation (std for standard deviation), which tells us how close the pages are to the site-level mean; that is, the consistency of the distribution of internal links with the mean.

We can assume from the above that the average by site level, with the exception of home page (crawl depth 0) and top level pages (crawl depth 1), ranges from 0 to 4 by URL.

For a more visual approach:

# Distribution of internal links to URL by site level
intlink_dist_plt = (ggplot(redir_live_urls, aes(x = 'crawl_depth', y = 'no_internal_links_to_url')) + 
                    geom_boxplot(fill="blue", alpha = 0.8) +
                    labs(y = '# Internal Links to URL', x = 'Site Level') + 
                    theme_classic() +            
                    theme(legend_position = 'none')

intlink_dist_plt.save(filename="images/1_intlink_dist_plt.png", height=5, width=5, units="in", dpi=1000)
Internal links to URL vs site level linksAndreas Voniatis, November 2021

The plot above confirms our previous comments that the homepage and pages directly linked to it receive the lion’s share of links.


Continue reading below

With the scales as they are, we don’t have much of a view of the distribution of the lower levels. We’ll modify this by taking a logarithm of the y-axis:

# Distribution of internal links to URL by site level
from mizani.formatters import comma_format

intlink_dist_plt = (ggplot(redir_live_urls, aes(x = 'crawl_depth', y = 'no_internal_links_to_url')) + 
                    geom_boxplot(fill="blue", alpha = 0.8) +
                    labs(y = '# Internal Links to URL', x = 'Site Level') + 
                    scale_y_log10(labels = comma_format()) + 
                    theme_classic() +            
                    theme(legend_position = 'none')

intlink_dist_plt.save(filename="images/1_log_intlink_dist_plt.png", height=5, width=5, units="in", dpi=1000)
Internal links to URL vs site level linksAndreas Voniatis, November 2021

The above shows the same distribution of links with the logarithmic view, which helps us to confirm the distribution means for the lower levels. It’s much easier to visualize.

Considering the disparity between the first two site levels and the remaining site, this indicates an asymmetric distribution.


Continue reading below

Accordingly, I will take a logarithm of the internal links, which will help normalize the distribution.

We now have the normalized number of links, which we will visualize:

# Distribution of internal links to URL by site level
intlink_dist_plt = (ggplot(redir_live_urls, aes(x = 'crawl_depth', y = 'log_intlinks')) + 
                    geom_boxplot(fill="blue", alpha = 0.8) +
                    labs(y = '# Log Internal Links to URL', x = 'Site Level') + 
                    #scale_y_log10(labels = comma_format()) + 
                    theme_classic() +            
                    theme(legend_position = 'none')

Log internal links to URL versus site-level linksAndreas Voniatis, November 2021

From the above, the distribution appears to be much less skewed, as the boxes (interquartile ranges) have a more gradual step change from site level to site level.

This allows us to analyze the data before diagnosing which URLs are underoptimized from an internal linking perspective.


Continue reading below

Quantify the problems

The code below will calculate the bottom 35th quantile (data science term for the percentile) for each site depth.

# internal links in under/over indexing at site level
# count of URLs under indexed for internal link counts

quantiled_intlinks = redir_live_urls.groupby('crawl_depth').agg({'log_intlinks': 
quantiled_intlinks = quantiled_intlinks.rename(columns = {'crawl_depth_': 'crawl_depth', 
                                                          'log_intlinks_quantile_lower': 'sd_intlink_lowqua'})
Exploration depth and internal linksAndreas Voniatis, November 2021

The above shows the calculations. The numbers don’t make sense to an SEO practitioner at this point, as they are arbitrary and for the purpose of providing a cutoff for sub-linked URLs at each site level.

Now that we have the array, we’re going to merge them with the main dataset to determine if the row-by-row URL is sub-linked or not.


Continue reading below

# join quantiles to main df and then count
redir_live_urls_underidx = redir_live_urls.merge(quantiled_intlinks, on = 'crawl_depth', how = 'left')

redir_live_urls_underidx['sd_int_uidx'] = redir_live_urls_underidx.apply(sd_intlinkscount_underover, axis=1)
redir_live_urls_underidx['sd_int_uidx'] = np.where(redir_live_urls_underidx['crawl_depth'] == 'Not Set', 1,


We now have a data block with each URL marked as sub-linked under the ‘sd_int_uidx’ column as 1.

This allows us to sum the number of sub-linked site pages by site depth:

# Summarise int_udx by site level
intlinks_agged = redir_live_urls_underidx.groupby('crawl_depth').agg({'sd_int_uidx': ['sum', 'count']}).reset_index()
intlinks_agged = intlinks_agged.rename(columns = {'crawl_depth_': 'crawl_depth'})
intlinks_agged['sd_uidx_prop'] = intlinks_agged.sd_int_uidx_sum / intlinks_agged.sd_int_uidx_count * 100
  crawl_depth  sd_int_uidx_sum  sd_int_uidx_count  sd_uidx_prop
0            0                0                  1      0.000000
1            1               41                 70     58.571429
2            2               66                303     21.782178
3            3              110                378     29.100529
4            4              109                347     31.412104
5            5               68                253     26.877470
6            6               63                194     32.474227
7            7                9                 96      9.375000
8            8                6                 33     18.181818
9            9                6                 19     31.578947
10          10                0                  5      0.000000
11          11                0                  1      0.000000
12          12                0                  1      0.000000
13          13                0                  2      0.000000
14          14                0                  1      0.000000
15     Not Set             2351               2351    100.000000

We now see that despite site 1’s depth page having an above average number of links per url, there are still 41 pages that are sub-linked.

To be more visual:

# plot the table
depth_uidx_plt = (ggplot(intlinks_agged, aes(x = 'crawl_depth', y = 'sd_int_uidx_sum')) + 
                    geom_bar(stat="identity", fill="blue", alpha = 0.8) +
                    labs(y = '# Under Linked URLs', x = 'Site Level') + 
                    scale_y_log10() + 
                    theme_classic() +            
                    theme(legend_position = 'none')

depth_uidx_plt.save(filename="images/1_depth_uidx_plt.png", height=5, width=5, units="in", dpi=1000)
Under URL linked from site levelAndreas Voniatis, November 2021

With the exception of XML sitemap URLs, the distribution of sub-linked URLs looks normal, as indicated by the almost bell shape. Most of the sub-linked URLs are found at site levels 3 and 4.


Continue reading below

Exporting the list of sub-linked URLs

Now that we have a handle on sub-linked URLs by site level, we can export the data and come up with creative solutions to fill in the gaps in site depth as shown below.

# data dump of under performing backlinks
underlinked_urls = redir_live_urls_underidx.loc[redir_live_urls_underidx.sd_int_uidx == 1]
underlinked_urls = underlinked_urls.sort_values(['crawl_depth', 'no_internal_links_to_url'])
Sitebulb dataAndreas Voniatis, November 2021

Other data science techniques for internal linking

We briefly covered the motivation for improving a site’s internal linking before exploring how internal linking is distributed on the site by site level.


Continue reading below

Next, we proceeded to quantify the extent of the sub-binding problem both numerically and visually before exporting the results for recommendations.

Of course, site level is only one aspect of internal linking that can be statistically crawled and analyzed.

Other aspects that could apply data science techniques to internal links include and are obviously not limited to:

  • Authority at the offsite page level.
  • Anchor the relevance of the text.
  • Research intent.
  • Find the user journey.

What aspects would you like to see covered?

Please leave a comment below.

More resources:


Continue reading below

Featured Image: Shutterstock / Optimarc

Source link


About Author

Comments are closed.