For those of you who don’t know, a crawler is a piece of software that crawls a site in the same way as a search engine robot. Crawling a site allows you to understand its architecture and observe any errors.
When you launch a crawler on a website, you will have access to a large amount of very useful information: 404 errors, TITLE tag of pages, page status, page depth, weight, number of incoming and outgoing links, etc.
Crawling a site allows you to quickly get an idea of a website’s SEO health.
There are several types of software, including freeware, paid-for software and online services.
Xenu is probably the best-known crawler of all. Simple to use and lightweight, it’s a pleasure to use. It’s true that it’s getting a bit old.
When you launch Xenu, all you have to do is enter the address of the website you want to crawl. A few minutes later, or a few hours depending on the size of the site, you’ll have access to all your data. As with all crawlers, your data is classified by URL in a table.
Crawling the site will give you access to the following data:
- URL: address of each page
- Status: page found, 404 error, server error, 301,302, domain name not found, authentication required.
- Title: gives the anchor of the link pointing to the page and not the content of the Title tag.
- Date: Date the file was created
- Level: the level of the page from the home page. How many clicks are required to access the page.
- Out Links: Number of outgoing links from the page in question.
- In Links : Number of links pointing to the page
- Duration: Access time
- Charset: the character set used
- Description: content of the meta description tag
Data export :
With Xenu, you can export your data in CSV format and then process it in Excel. You can also generate a Google sitemap automatically.
This software is unique in that it runs on Windows, Mac and Linux.
It is also possible to export the data to GraphViz. Once GRaphViz has been installed, you can generate a graph showing the tree structure of your website.
However, with this functionality, as soon as the site has several hundred pages the graphs become almost unusable and GraphViz crashes all the time.
A great alternative to Xenu, this free crawler is also fast and lightweight. It takes the same form as Xenu except for a few details. The principle is the same: you enter your address and launch the crawl. The rest takes care of itself.
Once the crawl is complete, you will have access to the following data:
- HTTP Code : code returned by the server (200,301,302,404)
- HTTP Message: message returned by the server (OK, permanent redirection, etc.)
- Internal: internal or external link
- Nofollow: nofollow attribute or not
- Dynamic: URL generated automatically or not
- Relative: relative or absolute URL
- SEO: gives indications on the TITLE or Meta Description (TITLE too long, Description missing, meta keywords missing.)
- TITLE: Content of the TITLE tag
- Depth: depth of the page
- In: number of incoming links
- Out: number of outgoing links
- Last modified
- Link Type: Type of link (a href, image, form. )
- Similarity: tests the similarity of internal pages
This crawler has the merit of providing more information than Xenu. Not all the information provided seems essential. Knowing that the Meta keywords tag is missing from the page isn’t too interesting?
Green lines show pages whose status is OK, red lines show URLs whose pages have an error (404, server error, dn not found.).
The SEO column doesn’t seem essential to me either. As I export the data into Excel for processing, anything longer than the TITLE is processed differently.
On the other hand, you can ask for only one type of URL to be displayed (internal, 404, external, redirects, etc.), which is a very practical feature.
Data export :
As with Xenu, export is done in CSV. You can also export to XML to generate a sitemap.
Free SEO Toolkit from Microsoft
This crawler is presented a little differently. Personally, the presentation of the tool is less rustic than the other two. Unlike the other two, Microsoft has integrated a project manager. This is handy when you launch the software, as it displays your various projects directly.
As with other crawlers, you launch a crawl by entering a web page. Then it runs.
The report is not exactly in the form of a single table. With Seo Toolkit, there are tabs giving access to different categories.
It’s true that on a small screen the display isn’t very practical.
The main advantage I see with this crawler is that it seems more robust than the others. When it comes to crawling sites with more than 100,000 pages, SEO Toolkit crashes much less than the others.
Another feature: SEO Toolkit displays some interesting stats such as :
- Status Code Summary: number of pages according to status (404; ok, 301.)
- List of all outgoing links
- List of duplicate documents
- List of pages for each folder
- List of duplicate Title tags
- List of duplicate meta descriptions
- List of pages with dead links
Needless to say, this one only works on Windows!
This module must be installed on IIS7. You will therefore need to install IIS first.
The perfect crawler doesn’t exist, which is why I often use two different crawlers. Personally, I use LinkExaminer for small sites and SEO Toolkit for large sites, because it crashes less. As for the interface, I find Microsoft’s visually pleasing, but not very ergonomic for small screens.