Comprehensive Analysis of Large-Scale FastQC Results using Python© Karobben

Comprehensive Analysis of Large-Scale FastQC Results using Python

FastqQC

FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. © illumina

When dealing with a large number of samples, it’s crucial to conduct quality control (QC) and scrutinize the results to identify any outliers. Filtering out low-quality data can significantly influence subsequent processes. Below is an example illustrating how we use all QC results from FastQC for cluster analysis.

Summary information collect

import os
import pandas as pd
from bs4 import BeautifulSoup

def Tab_grep(Sample):
html = open(Sample).read()
soup = BeautifulSoup(html, features='lxml')
Summary = soup.find_all('div',{"class":"summary"})[0]
Reu_l = [Sample]
Cla_l = ["Sample"]
for line in Summary.find_all("li"):
Cla_l += [line.get_text()]
Reu_l += [str(line).split('"')[1]]
Result_TB = pd.DataFrame([Reu_l], columns=Cla_l)
return Result_TB

Result_TB = pd.DataFrame()
for Sample in [i for i in os.listdir() if "fastqc.html" in i]:
Result_TB = pd.concat([Result_TB, Tab_grep(Sample)])

Result_TB.to_csv("QC.csv")

Plot in R

library(ggplot2)
library(reshape2)

TB <- read.csv("QC.csv")[-1]
TB_P <- melt(TB, id.vars = "Sample")
ggplot() + geom_tile(data= TB_P, aes(Sample,variable, fill= value))

Save the picutre in one file


import os
import pandas as pd
from bs4 import BeautifulSoup

def Pic_save(Sample, OUT="/home/wliu15/OUT.md"):
html = open(Sample).read()
soup = BeautifulSoup(html, features='lxml')
F = open(OUT,"a")
F.write(Sample+"\n")
F.write(str(soup.find('h2',{"id":"M5"})))
F.write(str(soup.find('img',{"alt" : "Per base sequence content"})))
F.close()

for Sample in [i for i in os.listdir() if "fastqc.html" in i]:
Pic_save(Sample)

Overrepresented Sequences

import pandas as pd

TB = pd.DataFrame()
for Sample in [i for i in os.listdir() if "fastqc.html" in i]:
if len(pd.read_html(Sample))!=1:
TMP = pd.read_html(Sample)[1]
TMP['Sample'] = Sample
TB = pd.concat([TB, TMP])
Sequence Count Percentage Possible Source Sample
0 CCGGTAGTTATTAAAGAATTCTTTTCCATGCCCAAATGCGGCACGTACTC 33857 0.178685926 No Hit S41_L002_R2_001_fastqc.html
1 CTTGATTATGTCTGTTTCTGATAACTACATTGAACACTTTAATGCTGTTA 26767 0.141267276 No Hit S41_L002_R2_001_fastqc.html
2 GAAAGTGTCAACGATACACCCATGTGGATAAAGGAACCCATAGCCTTTAA 19126 0.100940633 No Hit S41_L002_R2_001_fastqc.html
0 GTCCTTTCGTACTAAAATATCATAATTTTTTAAAGATAGAAACCAACCTG 24695 0.145008391 No Hit S15_L002_R2_001_fastqc.html
1 CTCGTCTTTTAAATAAATTTTAGCTTTTTGACTAAAAAATAAAATTCTAT 17359 0.101931592 No Hit S15_L002_R2_001_fastqc.html
0 CTCGTCTTTTAAATAAATTTTAGCTTTTTGACTAAAAAATAAAATTCTAT 19353 0.111227104 No Hit S44_L002_R2_001_fastqc.html
1 GTCCTTTCGTACTAAAATATCACAATTTTTTAAAGATAGAAACCAACCTG 18903 0.108640828 No Hit S44_L002_R2_001_fastqc.html

Comprehensive Analysis of Large-Scale FastQC Results using Python

https://karobben.github.io/2022/07/20/Python/fastqc_crawl/

Author

Karobben

Posted on

2022-07-20

Updated on

2024-01-11

Licensed under

Comments