DNA Seq Pipeline and General Infor statistics

DNA seq Pipeline

Quality Control

Quality control can use fastQC. In DNA-seq data, we may find lots of overrepresented sequences from mitochondria and it is totally normal. Not like RNA-seq, we do not expect errors in GC content.

mkdir fastQC/
fastqc -o ./fastQC/ -t 7 reads.fq

for loop for slumn system
Example file:

GFP-1_L1_1.fq.gz
GFP-1_L1_2.fq.gz
FQ_DB={dir for fastq}

for i in $(ls $FQ_DB/* );
do SAMPLE=$(echo $i|sed 's=FQ/==;s/.fq.gz//')
sed "s/Hi/$SAMPLE\_fQC/" Model.sh > script/$SAMPLE\_fQC.sh
echo fastqc -o fastQC -t 7 $i >> script/$SAMPLE\_fQC.sh
sbatch script/$SAMPLE\_fQC.sh
done

Align to the Genome

Choose appropriate short reads tools for your data. I heard that bwa was more suitable for short reads align, so it is frequently used in the miRNA-Seq pipeline. Bowtie is faster and can do better on longer reads alignment. According to this, if your reads are short like 50~70bp, you can choose bwa. If your reads are paired from 150~300bp, bowtie2 would work better.

In the post from David, 2015; the best short-reads alignment tool is segemehl. Meanwhile, bwa was more sensitive than bowtie2. According to the Benchmark, the best tools for DNA-Seq is segemehl or BWA-MEM, the best tools for RNA-Seq is segemehl

© ecSeq Bioinformatics

Scripts for align all your fq

Genome index

bwa index Genome.fa

Single end reads

In this Part, we will map all reads from the same directory to the reference genome by using bwa. Because programs would fail sometimes and we need to run them again. So, we can check and run only when the results don’t exist.

FQ_DB={Your Directory}
DB={Your Reference Genome}

for SAMPLE in $(ls $FQ_DB/*L001*|awk -F/ '{print $NF}'| awk -F_ '{print $1"_"$2}'| sort |uniq); do
if [ -f $SAMPLE.sorted.bam ]; then
echo $SAMPLE is done
else
echo $SAMPLE can not be find
sed "s/Hi/$SAMPLE/" Model.sh > script/$SAMPLE.sh
echo cat $FQ_DB/$SAMPLE* \| bwa mem $DB - -t 64 \|samtools view -S -b -\|samtools sort \> $SAMPLE.sorted.bam >> script/$SAMPLE.sh
echo samtools index $SAMPLE.sorted.bam >> script/$SAMPLE.sh
sbatch script/$SAMPLE.sh
fi
done

## Check unexpected size of file
rm $(du *.sorted.bam | awk '$1<=100'| awk '{print $2}')
## After removed failed files, we can excute the for loop again

for SAMPLE in $(ls $FQ_DB/*L001*|awk -F/ '{print $NF}'| awk -F_ '{print $1"_"$2}'| sort |uniq); do
if [ -f $SAMPLE.sorted.bam ]; then
echo $SAMPLE is done
else
echo $SAMPLE can not be find
sed "s/Hi/$SAMPLE/" Model.sh > script/$SAMPLE.sh
echo mv $SAMPLE.sorted.sam $SAMPLE.sorted.bam >> script/$SAMPLE.sh
echo samtools index $SAMPLE.sorted.bam >> script/$SAMPLE.sh
sbatch script/$SAMPLE.sh
fi
done

Then, move every thing to one directory.

mkdir Bam
mv *bam* Bam
rm *.err *.out # Clear logs

Paired end reads

Name Example of my Paired multiple-lines reads:

S40_L001_R1_001.fastq.gz
S40_L001_R2_001.fastq.gz
S40_L002_R1_001.fastq.gz
S40_L002_R2_001.fastq.gz
S41_L001_R1_001.fastq.gz
S41_L002_R1_001.fastq.gz

S40 is two lines paired-ends sample, S41 is two lines single-ends sample.

module load samtools bwa

mkdir script
mkdir Log
mkdir Bam
mkdir tmp

FQ_DB={Your Directory}
DB={Your Reference Genome}
#FQ_DB="../../RNA_RAW/*/*/"
#DB=DB/Genome.fa

# Get the unique sample names:
# ls $Directory/*.gz| awk -F"/" '{print $NF}'| sed 's/_L00[12]_R[12]_001.fastq.gz//'| sort |uniq

for SAMPLE in $(ls $FQ_DB/*.gz| awk -F"/" '{print $NF}'| sed 's/_L00[12]_R[12]_001.fastq.gz//'| sort |uniq); do
echo $SAMPLE;
# check the exist of the result
if [ -f Bam/$SAMPLE.sorted.bam ]; then
echo $SAMPLE is done
else
# check the Number of the files. 4 means paired and 2 means unpaireds
if [ $(ls $FQ_DB/$SAMPLE*|wc -l) -eq 4 ]; then
echo Paired
sed "s/Hi/$SAMPLE/" Model.sh > script/$SAMPLE.sh
echo cat $FQ_DB/$SAMPLE*L00[12]_R1_* \> tmp/$SAMPLE.R1.fq.gz >> script/$SAMPLE.sh
echo cat $FQ_DB/$SAMPLE*L00[12]_R2_* \> tmp/$SAMPLE.R2.fq.gz >> script/$SAMPLE.sh
echo bwa mem $DB tmp/$SAMPLE.R[12].fq.gz -t 64 \|samtools view -S -b -\|samtools sort \> $SAMPLE.sorted.bam >> script/$SAMPLE.sh
echo samtools index $SAMPLE.sorted.bam >> script/$SAMPLE.sh
echo mv $SAMPLE.sorted.bam $SAMPLE.sorted.bam.bai Bam >> script/$SAMPLE.sh
echo rm tmp/$SAMPLE.R[12].fq.gz >> script/$SAMPLE.sh
else
echo Single
sed "s/Hi/$SAMPLE/" Model.sh > script/$SAMPLE.sh
echo cat $FQ_DB/$SAMPLE* \| bwa mem $DB - -t 64 \|samtools view -S -b -\|samtools sort \> $SAMPLE.sorted.bam >> script/$SAMPLE.sh
echo samtools index $SAMPLE.sorted.bam >> script/$SAMPLE.sh
echo mv $SAMPLE.sorted.sam $SAMPLE.sorted.bam >> script/$SAMPLE.sh
echo samtools index $SAMPLE.sorted.bam >> script/$SAMPLE.sh
echo mv $SAMPLE.sorted.bam $SAMPLE.sorted.bam.bai Bam >> script/$SAMPLE.sh
fi
sbatch script/$SAMPLE.sh
fi
done

After checking the size of bam and re-run the failed files, each file has:

  • Bam: all bam results and bam index
  • Log: log files generated by sbatch. (Disposable)
  • tmp: temporarily combined reads from different lines and should be cleaned. (Disposable)
  • scripts: scripts for each sample from aligning to sort. (Disposable)

Basic information Statistics

.
└── Bam
│   ├── GFP-1.sorted.bam
│   └── GFP-2.sorted.bam
└── Model.sh
mkdir Bam_stats script

for i in $(ls Bam/*.bam); do
SAMPLE=$(echo $i| sed 's=Bam/==;s/.sorted.bam//');
if [ -f Bam_stats/$SAMPLE.stats.csv ]; then
echo $SAMPLE is done
else
sed "s/Hi/$SAMPLE\_BamStat/" Model.sh > script/$SAMPLE\_BamStat.sh
echo samtools stats $i \> Bam_stats/$SAMPLE.stats.csv >> script/$SAMPLE\_BamStat.sh
sbatch script/$SAMPLE\_BamStat.sh
fi
done

After jobs are done, we can grep the SN infor from all samples into one:

grep ^SN Bam_stats/*| sed 's/.stats.csv:SN//;s=Bam_stats/==' | awk -F"\t" '{OFS="\t"; print $1,$2,$3}'| sed 's/://' > SN_infor.csv

Insert size statistics

What is insert size
© Frances S. Turner, 2014

Why there are negative insert sizes?

Negative insert sizes can happen when the first read is mapped to the reverse strand. we can change it by using awk '{print sqrt(\$0^2)}'

Reference: accio, 2020
It might be help in ChIP-seq data.

for i in $(ls Bam/SJ_*.bam); do
SAMPLE=$(echo $i| sed 's=Bam/==;s/.sorted.bam//');
if [ -f Bam_stats/$SAMPLE.insert.csv ]; then
echo $SAMPLE is done
else
sed "s/Hi/$SAMPLE\_BamStat/" Model.sh > script/$SAMPLE\_BamInser.sh
echo samtools view -f66 $i \| cut -f9 \| awk "'{print sqrt(\$0^2)}'" \> Bam_stats/$SAMPLE.insert.csv >> script/$SAMPLE\_BamInser.sh
bash script/$SAMPLE\_BamInser.sh &
fi
done
grep . Bam_stats/*.insert.csv| sed 's=Bam_stats/==;s/.insert.csv:/ /' > Insert.csv

Check the number of lines from each file and remove incorrect files.

Visualization by ggplot2

Creat a directory, img, for storing the results.

mkdir img
library(ggplot2)
library(stringr)
library(reshape2)
library(pheatmap)
library(ggplotify)

TB <- read.table("SN_infor.csv", sep = '\t')
Norm_list <- TB$V2[TB$V1==TB$V1[[1]]][TB$V3[TB$V1==TB$V1[[1]]] >= 1000]

## Bar plot 1: sort by total reads

TB$V1 <- factor(TB$V1, levels=as.character(na.omit(TB$V1[TB$V2=='raw total sequences'][order(TB$V3[TB$V2=='raw total sequences'])])))
ggplot(TB, aes(V1,V3)) + geom_bar(stat = 'identity') +
facet_wrap(~V2, scales = 'free') + theme_bw()+theme(strip.background = element_rect(fill = 'white'), strip.text.x = element_text(size = 10))

ggsave("img/Stats_Sort_reads.png", w= 20, h= 10.9)

## Bar plot 2; sort by 'reads length'
TB$V1 <- factor(TB$V1, levels=as.character(na.omit(TB$V1[TB$V2=='raw total sequences'][order(TB$V3[TB$V2=='average length'])])))
ggplot(TB, aes(V1,V3)) + geom_bar(stat = 'identity') +
facet_wrap(~V2, scales = 'free') + theme_bw()
ggsave("img/Stats_Sort_AL.png", w= 20, h= 10.9)

TB_w <- reshape(TB, timevar = "V1", idvar = "V2", direction = 'wide')
row.names(TB_w) <- TB_w[[1]]
TB_w <- TB_w[-1]
colnames(TB_w) <- str_replace(colnames(TB_w), "V3.", "")


# Corrolation Plot
List <- row.names(TB_w)[!is.na(data.frame(scale(t(TB_w)))[1,])]
TB_ww <- TB_w[row.names(TB_w)%in% List,]
ggpairs(as.data.frame(t(TB_ww)))
ggsave("123.png", w=30, h = 30)

TB_cor <- as.data.frame(cor(t(TB_ww)))
corPLOT <- function(COL, CUT=0.75){
Cor_tmp = data.frame(
X = row.names(TB_cor)[order(abs(TB_cor[COL]))],
Y = TB_cor[COL][order(abs(TB_cor[COL])),]
)
Cor_tmp$X = factor(Cor_tmp$X, levels = Cor_tmp$X)
ggplot(Cor_tmp, aes(Y, X, fill = Y)) + geom_bar( stat = 'identity') +
geom_vline( xintercept = c(CUT, - CUT ), linetype = 2, color= 'grey') +
geom_text(aes(x=CUT, y=1, label = CUT), hjust = -.1) + theme_bw() +
ggtitle(COL) + theme(plot.title=element_text(hjust = .5))
ggsave(paste('img/Cor_', str_replace_all(COL, " ", "_"), ".png", sep =""), w= 7, h= 7)
}


TB_w[row.names(TB_w) %in% Norm_list,] <- TB_w[row.names(TB_w) %in% Norm_list,]/c(TB_w[1,])


TB_wS <- as.data.frame(t(scale(t(TB_w))))
TB_wS <- TB_wS[!is.na(TB_wS[[1]]),]
TB_w <- TB_w[!is.na(TB_wS[[1]]),]

TB_w[TB_w > 1] <- round(TB_w[TB_w > 1],2)
TB_w[TB_w < 1] <- paste(round(TB_w[TB_w < 1]*100,2), "%")
#TB_w[row.names(TB_w) %in% Norm_list,] <- round(TB_w[row.names(TB_w) %in% Norm_list,]*100,2)

P <- pheatmap(as.data.frame(scale(t(TB_wS))), display_numbers = t(TB_w[row.names(TB_w) %in% row.names(TB_wS),]), fontsize_col= 15)
as.ggplot(P)
ggsave("img/Stat_Pheatmap.png", w= 20, h = 10.9)

Pheatmap

Align Depth Plot

Depth statistics are based on samtools: samtools depth -r UAS:0-10000 sorted.bam

mkdir Depth
for i in $(ls Bam/*.bam);
do SAMPLE=$(echo $i| sed 's=Bam/==;s/.sorted.bam//');
if [ -f Depth/$SAMPLE.csv ]; then
echo $SAMPLE is done
else
sed "s/Hi/$SAMPLE/" Model.sh > script/Depth_$SAMPLE.sh
echo samtools depth -r UAS:0-10000 $i \> Depth/$SAMPLE.csv >> script/Depth_$SAMPLE.sh
sbatch script/Depth_$SAMPLE.sh
fi

done

rm *.err *.out
grep . Depth/*.csv| sed 's/.csv//;s/:/\t/;s=Depth/==' > Depth.csv
library(ggplot2)
library(stringr)
library(reshape2)

TB <- read.table("Depth.csv")
TB$V1 <- as.character(TB$V1)


TB2 <- TB[TB$V3 %in% c(5300:5600),]

TB_w <- reshape(TB2[c(-2)], idvar = "V1", timevar = "V3", direction = 'wide')
TB_w[is.na(TB_w)] <- 0
row.names(TB_w)= TB_w[[1]]
TB_w = TB_w[-1]
#HCLUST = hclust(dist(TB_w))

TB$V1 <- factor(TB$V1, levels= row.names(TB_w)[order(TB_w$V4.5400)])

ggplot(TB, aes(V3, V4))+ geom_point() + facet_wrap(~V1) +
xlim(5300,5500) + ylim(0, 100)+theme_bw() +
theme(strip.text = element_text(size = 20), strip.background = element_rect(fill = 'white'))

ggsave('img/Depth_UAS.png', w= 20,, h = 10.9)

Model.sh

#!/bin/sh
#SBATCH --qos=normal # Quality of Service
#SBATCH --job-name=Hi # Job Name
#SBATCH --time=1-0:00:00 # WallTime
#SBATCH --mem=256000
#SBATCH --mail-user=123@qq.com
#SBATCH --nodes=1 # Number of Nodes
#SBATCH --ntasks-per-node=1 # Number of tasks (MPI processes)
#SBATCH --cpus-per-task=1 # Number of threads per task (OMP threads)
#SBATCH --output=Log/Hi.out ### File in which to store job output
#SBATCH --error=Log/Hi.err ### File in which to store job error messages

DNA Seq Pipeline and General Infor statistics

https://karobben.github.io/2022/08/26/Bioinfor/DNA-seq/

Author

Karobben

Posted on

2022-08-26

Updated on

2024-01-11

Licensed under

Comments