0 Posted 2021-02-28Updated 2024-01-11others / Blog / more6 minutes read (About 863 words)

Find Dead Links in Your Blog/Website

Links may die from time to time. For example, some posts might be deleted, some old servers may be shout-done, some links dead because the host transplanted to another platform. These dead links may damage your website’s rankings and usability. So, it is important to check and delete them periodically.

Dead Link Checker

This is one of the most friendly and powerful tools for dead link check!
You can check a single page or Whole Website at a time.
You can check the first 2000 links on our website. But for more services, you must subscribe to achieve advanced applications.

Chick the single page	Chick the website

Deleted the Deadlinks

## delete page0 which generated by hexo theme-icarus
sed -i '/page\/0\//d' linsk.csv
## Deleted blogs from csdn
sed -i  '/https:\/\/blog.csdn.net\//d' linsk.csv
sed -i  '/cnblogs.com/d' linsk.csv
## Deleted blogs from 云栖社区
sed -i  '/yq.aliyun.com/d' linsk.csv
## Some links work for me
sed -i  '/edrawsoft.cn/d' linsk.csv
sed -i  '/cj.weather.com.cn/d' linsk.csv


## Deleted the line which have the link
links=https://karobben.github.io/2020/07/28/Bioinfor/BioDB/
ii=$(echo $links|sed 's=/=\\/=g')
sed -i "/$ii/d" example.md

站长工具

This tool can only check a single page.

693gMV.md.png

Local Dead Link Check

Python

Deadlinks

GitHub: butuzov

## Insatll
pip install deadlinks

## run
deadlinks gobyexample.com -n 10 -e -d play.golang.org -d github.com


© butuzov

I am currently running this version:

Name: deadlinks
Version: 0.3.2
Summary: Health checks for your documentation links.
Home-page: https://github.com/butuzov/deadlinks
Author: Oleg Butuzov
Author-email: butuzov@made.ua
License: Apache License 2.0
Location: ~/.local/lib/python3.7/site-packages
Requires: reppy, six, click, requests, urllib3
Required-by:

But it is easy to cease when we have too many deadlinks.

So, I’d like to add a timeout argument. But I failed.

By checking the command, we know taht the main function is __main__.py

cat $(which deadlinks)

## -*- coding: utf-8 -*-
import re
import sys
from deadlinks.__main__ import main
if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
    sys.exit(main())

So, let’s check the (part of codes) __main__.py:

from .settings import Settings

def main(ctx: click.Context, url: str, **opts: Options) -> None:
    """ Check links in your (web) documentation for accessibility. """

    try:
        settings = Settings(url, **opts)
        crawler = Crawler(settings)

        driver = exporters[str(opts['export'])]

        # Instantion of the exported before starting crawling will alow us to
        # have progress report, while we crawling website.
        exporter = driver(crawler, **opts)

It looks like it is checking the web with the function crawler and arguments are stored at **opts which handled by Settings

Write your tools

For my example, all my html are in the directory of public.

So, my strategist is:

find all <a> tags (grep)
remove redundant part (awk)
sorting all href (sort|uniq)
filter reliable links

## comend
cat $(find public/ -name "*.html")| \
  tr "<" "\n"|grep -E "^a|^img"| \
  tr ' ' '\n'|grep -E "href|src"| \
  awk -F">" '{print $1}'|awk -F"=" '{print $2}'| \
  awk -F"#" '{print $1}' |sed 's/"//g'| \
  sort|uniq| \
  grep -v ^\"\#|grep -v ^/ |\
  grep -vE "https://karobben.github.io/|javascri" |\
  grep -E "^http|^ftp" |sed 's/^/[1](/;s/$/)/' > link_list.md

Then, run another script

import multiprocessing as mp
import time, re
import requests

## Read links from md and stored them in List
Input = "link_list.md"
F = open(Input,'r')
File = F.read()
pattern = re.compile(r'\(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+\)') # 匹配模式
List = pattern.findall(File)

## Clean the list
List_clean = []
for i in List:
  List_clean += [i.replace('(','').replace(")","")]

List =List_clean


## Define a function for test the url
def RespTime(url,return_dict):
  # Page is exist or not
  try:
    r = requests.get(url, timeout=20)
    Result = url+"\t"+str(r.elapsed.total_seconds())
  except:
    Result = url+"\tFailed"
  print(Result)
  return_dict[Result] = Result


## run all links
if __name__ == '__main__':
    manager = mp.Manager()
    return_dict = manager.dict()
    jobs = []
    for i in List:
        p = mp.Process(target=RespTime, args=(i,return_dict))
        jobs.append(p)
        p.start()
    for proc in jobs:
        proc.join()

##F_list = ['www.youtube.com', 'github.com']

## store the live links
Link_ok = []
for i in return_dict.values():
  if 'Failed' not in i:
    Link_ok += [i.split('\t')[0]]
    print(i)


## remove the alive links and run it again
len(List)
for i in Link_ok:
  List.remove(i)



for i in List:
  print("[1](",i,")",sep='')