SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation© Della-3

SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation

Install

GitHub: shenwei356/seqkit

wget https://github.com/shenwei356/seqkit/releases/download/v2.7.0/seqkit_linux_amd64.tar.gz
tar -zxvf seqkit_linux_amd64.tar.gz

Convert the Fastq to Fasta

seqkit fq2fa output_directory/output_prefix.extendedFrags.fastq -o output_directory/output_prefix.merged.fasta

Remove Duplicated Sequence

seqkit rmdup -s sequences.fasta -o unique_sequences.fasta -D counts.tsv
  • -s: Specifies that duplicates should be identified based on sequence content.
  • [input_file]: Replace this with the path to your input FASTA or FASTQ file.
  • -o [output_file]: Specifies the output file. Replace [output_file] with the desired path for the file containing the sequences after duplicate removal.
  • -D: write all removed duplicates (and counts) to this specified file.

Sequence Statitic

seqkit stats your_R1.fastq.gz
file                 format  type   num_seqs      sum_len  min_len  avg_len  max_len
your_R1.fastq.gz     FASTQ   DNA    15,800,000  2,370,000,000     150      150    150
seqkit fx2tab -l -i your_R1.fastq.gz | cut -f1 | sort | uniq -c
5000 150
8000 151
2000 149

Extract the R1 as length 28 for scRNA-Seq library

seqkit seq -m 28 -M 28 input_R1.fastq.gz -o R1_len28.fastq.gz
seqkit seq -m 28 -M 28 -n input_R1.fastq.gz > readnames_len28.txt
seqkit grep -f readnames_len28.txt input_R2.fastq.gz | gzip > R2_len28.fastq.gz
seqkit grep -f readnames_len28.txt input_I1.fastq.gz | gzip > I1_len28.fastq.gz

SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation

https://karobben.github.io/2024/02/05/Bioinfor/seqkit/

Author

Karobben

Posted on

2024-02-05

Updated on

2025-11-12

Licensed under

Comments