It’s important to optimize internal links if you care that the pages on your site have enough authority to rank for their target keywords. By internal link we mean pages on your website that receive links from other pages.
This is important because it is the basis by which Google and other searches calculate the page’s importance in relation to other pages on your website.
It also affects the likelihood that a user will discover content on your site. Content discovery is the basis of Google’s PageRank algorithm.
Today, we’re exploring a data-driven approach to improving a website’s internal linking for more effective technical SEO purposes. This is to ensure that the distribution of internal domain authority is optimized according to the structure of the site.
Improve internal link structures with data science
Our data-driven approach will focus on just one aspect of internal link architecture optimization, which is to model internal link distribution by site depth, and then target pages that lack links for their depth. specific site.
Advertising
Continue reading below
We start by importing the libraries and data, cleaning up the column names before previewing them:
import pandas as pd import numpy as np site_name="ON24" site_filename="on24" website="www.on24.com" # import Crawl Data crawl_data = pd.read_csv('data/'+ site_filename + '_crawl.csv') crawl_data.columns = crawl_data.columns.str.replace(' ','_') crawl_data.columns = crawl_data.columns.str.replace('.','') crawl_data.columns = crawl_data.columns.str.replace('(','') crawl_data.columns = crawl_data.columns.str.replace(')','') crawl_data.columns = map(str.lower, crawl_data.columns) print(crawl_data.shape) print(crawl_data.dtypes) Crawl_data (8611, 104) url object base_url object crawl_depth object crawl_status object host object ... redirect_type object redirect_url object redirect_url_status object redirect_url_status_code object unnamed:_103 float64 Length: 104, dtype: object
The above shows an overview of the data imported from the Sitebulb desktop crawler application. There are over 8,000 rows and not all of them will be domain exclusive, as it will also include resource URLs and external outbound link URLs.
We also have over 100 columns which are superfluous to the requirements, so a selection of columns will be necessary.
Advertising
Continue reading below
Before we get into this area, however, we want to quickly see how many site levels there are:
crawl_depth 0 1 1 70 10 5 11 1 12 1 13 2 14 1 2 303 3 378 4 347 5 253 6 194 7 96 8 33 9 19 Not Set 2351 dtype: int64
So from the above we can see that there are 14 site levels and most of them are not in the site architecture but in the XML sitemap.
You may notice that Pandas (the Python package for data management) ranks site levels by number.
This is because the site levels at this point are strings as opposed to numeric. This will be adjusted in subsequent code, as it will affect the data visualization (“viz”).
Now we are going to filter the rows and select the columns.
# Filter for redirected and live links
redir_live_urls = crawl_data[['url', 'crawl_depth', 'http_status_code', 'indexable_status', 'no_internal_links_to_url', 'host', 'title']] redir_live_urls = redir_live_urls.loc[redir_live_urls.http_status_code.str.startswith(('2'), na=False)] redir_live_urls['crawl_depth'] = redir_live_urls['crawl_depth'].astype('category') redir_live_urls['crawl_depth'] = redir_live_urls['crawl_depth'].cat.reorder_categories(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', 'Not Set', ]) redir_live_urls = redir_live_urls.loc[redir_live_urls.host == website] del redir_live_urls['host'] print(redir_live_urls.shape) Redir_live_urls (4055, 6)

By filtering the rows for indexable URLs and selecting the relevant columns, we now have a more streamlined data frame (think Pandas’ version of a spreadsheet tab).
Explore the distribution of internal links
Now we’re ready to visualize the data and get a feel for how internal links are distributed globally and by site depth.
from plotnine import * import matplotlib.pyplot as plt pd.set_option('display.max_colwidth', None) %matplotlib inline # Distribution of internal links to URL by site level ove_intlink_dist_plt = (ggplot(redir_live_urls, aes(x = 'no_internal_links_to_url')) + geom_histogram(fill="blue", alpha = 0.6, bins = 7) + labs(y = '# Internal Links to URL') + theme_classic() + theme(legend_position = 'none') ) ove_intlink_dist_plt

From the above we can see that most of the pages do not have links, so improving internal linking would be a significant opportunity to improve SEO here.
Let’s see some statistics at the site level.
Advertising
Continue reading below
crawl_depth 0 1 1 70 10 5 11 1 12 1 13 2 14 1 2 303 3 378 4 347 5 253 6 194 7 96 8 33 9 19 Not Set 2351 dtype: int64
The table above shows the approximate breakdown of internal links by site level, including the mean (mean) and median (50% quantile).
This is accompanied by site-level variation (std for standard deviation), which tells us how close the pages are to the site-level mean; that is, the consistency of the distribution of internal links with the mean.
We can assume from the above that the average by site level, with the exception of home page (crawl depth 0) and top level pages (crawl depth 1), ranges from 0 to 4 by URL.
For a more visual approach:
# Distribution of internal links to URL by site level intlink_dist_plt = (ggplot(redir_live_urls, aes(x = 'crawl_depth', y = 'no_internal_links_to_url')) + geom_boxplot(fill="blue", alpha = 0.8) + labs(y = '# Internal Links to URL', x = 'Site Level') + theme_classic() + theme(legend_position = 'none') ) intlink_dist_plt.save(filename="images/1_intlink_dist_plt.png", height=5, width=5, units="in", dpi=1000) intlink_dist_plt

The plot above confirms our previous comments that the homepage and pages directly linked to it receive the lion’s share of links.
Advertising
Continue reading below
With the scales as they are, we don’t have much of a view of the distribution of the lower levels. We’ll modify this by taking a logarithm of the y-axis:
# Distribution of internal links to URL by site level from mizani.formatters import comma_format intlink_dist_plt = (ggplot(redir_live_urls, aes(x = 'crawl_depth', y = 'no_internal_links_to_url')) + geom_boxplot(fill="blue", alpha = 0.8) + labs(y = '# Internal Links to URL', x = 'Site Level') + scale_y_log10(labels = comma_format()) + theme_classic() + theme(legend_position = 'none') ) intlink_dist_plt.save(filename="images/1_log_intlink_dist_plt.png", height=5, width=5, units="in", dpi=1000) intlink_dist_plt

The above shows the same distribution of links with the logarithmic view, which helps us to confirm the distribution means for the lower levels. It’s much easier to visualize.
Considering the disparity between the first two site levels and the remaining site, this indicates an asymmetric distribution.
Advertising
Continue reading below
Accordingly, I will take a logarithm of the internal links, which will help normalize the distribution.
We now have the normalized number of links, which we will visualize:
# Distribution of internal links to URL by site level intlink_dist_plt = (ggplot(redir_live_urls, aes(x = 'crawl_depth', y = 'log_intlinks')) + geom_boxplot(fill="blue", alpha = 0.8) + labs(y = '# Log Internal Links to URL', x = 'Site Level') + #scale_y_log10(labels = comma_format()) + theme_classic() + theme(legend_position = 'none') ) intlink_dist_plt

From the above, the distribution appears to be much less skewed, as the boxes (interquartile ranges) have a more gradual step change from site level to site level.
This allows us to analyze the data before diagnosing which URLs are underoptimized from an internal linking perspective.
Advertising
Continue reading below
Quantify the problems
The code below will calculate the bottom 35th quantile (data science term for the percentile) for each site depth.
# internal links in under/over indexing at site level # count of URLs under indexed for internal link counts quantiled_intlinks = redir_live_urls.groupby('crawl_depth').agg({'log_intlinks': [quantile_lower]}).reset_index() quantiled_intlinks = quantiled_intlinks.rename(columns = {'crawl_depth_': 'crawl_depth', 'log_intlinks_quantile_lower': 'sd_intlink_lowqua'}) quantiled_intlinks

The above shows the calculations. The numbers don’t make sense to an SEO practitioner at this point, as they are arbitrary and for the purpose of providing a cutoff for sub-linked URLs at each site level.
Now that we have the array, we’re going to merge them with the main dataset to determine if the row-by-row URL is sub-linked or not.
Advertising
Continue reading below
# join quantiles to main df and then count redir_live_urls_underidx = redir_live_urls.merge(quantiled_intlinks, on = 'crawl_depth', how = 'left') redir_live_urls_underidx['sd_int_uidx'] = redir_live_urls_underidx.apply(sd_intlinkscount_underover, axis=1) redir_live_urls_underidx['sd_int_uidx'] = np.where(redir_live_urls_underidx['crawl_depth'] == 'Not Set', 1, redir_live_urls_underidx['sd_int_uidx']) redir_live_urls_underidx
We now have a data block with each URL marked as sub-linked under the ‘sd_int_uidx’ column as 1.
This allows us to sum the number of sub-linked site pages by site depth:
# Summarise int_udx by site level intlinks_agged = redir_live_urls_underidx.groupby('crawl_depth').agg({'sd_int_uidx': ['sum', 'count']}).reset_index() intlinks_agged = intlinks_agged.rename(columns = {'crawl_depth_': 'crawl_depth'}) intlinks_agged['sd_uidx_prop'] = intlinks_agged.sd_int_uidx_sum / intlinks_agged.sd_int_uidx_count * 100 print(intlinks_agged)
crawl_depth sd_int_uidx_sum sd_int_uidx_count sd_uidx_prop 0 0 0 1 0.000000 1 1 41 70 58.571429 2 2 66 303 21.782178 3 3 110 378 29.100529 4 4 109 347 31.412104 5 5 68 253 26.877470 6 6 63 194 32.474227 7 7 9 96 9.375000 8 8 6 33 18.181818 9 9 6 19 31.578947 10 10 0 5 0.000000 11 11 0 1 0.000000 12 12 0 1 0.000000 13 13 0 2 0.000000 14 14 0 1 0.000000 15 Not Set 2351 2351 100.000000
We now see that despite site 1’s depth page having an above average number of links per url, there are still 41 pages that are sub-linked.
To be more visual:
# plot the table depth_uidx_plt = (ggplot(intlinks_agged, aes(x = 'crawl_depth', y = 'sd_int_uidx_sum')) + geom_bar(stat="identity", fill="blue", alpha = 0.8) + labs(y = '# Under Linked URLs', x = 'Site Level') + scale_y_log10() + theme_classic() + theme(legend_position = 'none') ) depth_uidx_plt.save(filename="images/1_depth_uidx_plt.png", height=5, width=5, units="in", dpi=1000) depth_uidx_plt

With the exception of XML sitemap URLs, the distribution of sub-linked URLs looks normal, as indicated by the almost bell shape. Most of the sub-linked URLs are found at site levels 3 and 4.
Advertising
Continue reading below
Exporting the list of sub-linked URLs
Now that we have a handle on sub-linked URLs by site level, we can export the data and come up with creative solutions to fill in the gaps in site depth as shown below.
# data dump of under performing backlinks underlinked_urls = redir_live_urls_underidx.loc[redir_live_urls_underidx.sd_int_uidx == 1] underlinked_urls = underlinked_urls.sort_values(['crawl_depth', 'no_internal_links_to_url']) underlinked_urls.to_csv('exports/underlinked_urls.csv') underlinked_urls

Other data science techniques for internal linking
We briefly covered the motivation for improving a site’s internal linking before exploring how internal linking is distributed on the site by site level.
Advertising
Continue reading below
Next, we proceeded to quantify the extent of the sub-binding problem both numerically and visually before exporting the results for recommendations.
Of course, site level is only one aspect of internal linking that can be statistically crawled and analyzed.
Other aspects that could apply data science techniques to internal links include and are obviously not limited to:
- Authority at the offsite page level.
- Anchor the relevance of the text.
- Research intent.
- Find the user journey.
What aspects would you like to see covered?
Please leave a comment below.
More resources:
Advertising
Continue reading below
Featured Image: Shutterstock / Optimarc