HDF5 Data Format Introduction© Karobben

HDF5 Data Format Introduction

HDF5 (Hierarchical Data Format version 5) is a file format designed for efficiently storing and organizing large, complex datasets. It uses a hierarchical structure of **groups** (like directories) and **datasets** (like files) to store data, supporting multidimensional arrays, metadata, and a wide variety of data types. Key advantages include **compression**, **cross-platform compatibility**, and the ability to handle large datasets that don’t fit in memory. It’s widely used in fields like scientific computing, machine learning, and bioinformatics due to its efficiency and flexibility.
Read more
Render Your Protein in Blender with Molecular Nodes© Karobben
NCBI Data Submit with FTP/ASCP© Karobben

NCBI Data Submit with FTP/ASCP

ASCP (Aspera Secure Copy Protocol) is a fast, reliable protocol for transferring large files, particularly over long distances or in conditions with network latency or packet loss. It uses a technology called fasp (Fast, Adaptive, and Secure Protocol) to maximize available bandwidth, making transfers faster than traditional methods like FTP.
For uploading data to NCBI, ASCP is particularly useful because it efficiently handles large datasets, such as genomic sequences or omics data. Its ability to resume interrupted transfers ensures that if a connection fails during an upload, the transfer continues from where it left off, saving time and bandwidth. ASCP also provides strong encryption, ensuring data security during the upload process.
Read more

Softmax

Softmax is a mathematical function commonly used in machine learning, particularly in the context of classification problems. It transforms a vector of raw scores, often called logits, from a model into a vector of probabilities that sum to one. The probabilities generated by the softmax function represent the likelihood of each class being the correct classification. $$\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}$$
Read more

Support Vector Machine

Support Vector Machine (SVM) is a supervised learning algorithm used for classification and regression. It finds the best hyperplane that separates the data into different classes with the largest possible margin. SVM can work well with high-dimensional data and use different kernel functions to transform data for better separation when it is not linearly separable.$$f(x) = sign(w^T x + b)$$
Read more

Random Forest

Random Forest is an ensemble machine learning algorithm that builds multiple decision trees during training and merges their outputs to improve accuracy and reduce overfitting. It is commonly used for both classification and regression tasks. By averaging the predictions of several decision trees, Random Forest reduces the variance and increases model robustness, making it less prone to errors from noisy data. $$\text{Entropy}_{\text{after}} = \frac{|S_l|}{|S|}\text{Entropy}(S_l) + \frac{|S_r|}{|S|}\text{Entropy}(S_r)$$
Read more
DLVO theory: Atom Interaction© Karobben

DLVO theory: Atom Interaction

DLVO theory is named after Derjaguin, Landau, Verwey, and Overbeek, who developed it in the 1940s. It describes the forces between charged surfaces interacting through a liquid medium. The theory combines two main types of forces
Read more
Kernel Density Estimation (KDE)© Karobben

Kernel Density Estimation (KDE)

Kernel Density Estimation (KDE) is a non-parametric method to estimate the probability density function (PDF) of a random variable based on a finite set of data points. Unlike parametric methods, which assume that the underlying data follows a specific distribution (like normal, exponential, etc.), KDE makes no such assumptions and can model more complex data distributions.$$ \hat{f}(x) = \frac{1}{n \cdot h} \sum_{i=1}^{n} K\left(\frac{x - x_i}{h}\right) $$
Read more
Understanding the Taylor Series and Its Applications in Machine Learning© Karobben

Understanding the Taylor Series and Its Applications in Machine Learning

The Taylor Series is a mathematical tool that approximates complex functions with polynomials, playing a crucial role in machine learning optimization. It enhances gradient descent by incorporating second-order information, leading to faster and more stable convergence. Additionally, it aids in linearizing non-linear models and informs regularization techniques. This post explores the significance of the Taylor Series in improving model training efficiency and understanding model behavior. $$\cos(x) = \sum_{n=0}^{\infty} \frac{(-1)^n}{(2n)!} x^{2n}$$
Read more
FoldX© Karobben

FoldX

The FoldX Suite builds on the strong fundament of advanced protein design features, already implemented in the successful FoldX3, and exploits the power of fragment libraries, by integrating in silico digested backbone protein fragments of different lengths. Such fragment-based strategy allows for new powerful capabilities: loop reconstruction, implemented in LoopX and peptide docking, implemented in PepX. The Suite also features an improved usability, thanks to a new boost Command Line Interface.
Read more
Juicer: a One-Click System for Analyzing Loop-Resolution Hi-C Experiments© Karobben

Juicer: a One-Click System for Analyzing Loop-Resolution Hi-C Experiments

Hi-C experiments explore the 3D structure of the genome, generating terabases of data to create high-resolution contact maps. Here, we introduce Juicer, an open-source tool for analyzing terabase-scale Hi-C datasets. Juicer allows users without a computational background to transform raw sequence data into normalized contact maps with one click. Juicer produces a hic file containing compressed contact matrices at many resolutions, facilitating visualization and analysis at multiple scales. Structural features, such as loops and domains, are automatically annotated.
Read more
NextDenovo: an efficient error correction and accurate assembly tool for noisy long reads© Karobben

NextDenovo: an efficient error correction and accurate assembly tool for noisy long reads

NextDenovo is a string graph-based de novo assembler for long reads (CLR, HiFi and ONT). It uses a “correct-then-assemble” strategy similar to canu (no correction step for PacBio HiFi reads), but requires significantly less computing resources and storages. After assembly, the per-base accuracy is about 98-99.8%, to further improve single base accuracy, try NextPolish.
Read more
IgCaller© Karobben

IgCaller

IgCaller is a python program designed to fully characterize the immunoglobulin gene rearrangements and oncogenic translocations in lymphoid neoplasms. It was originally developed to work with WGS data but it has been extended to work with WES and high-coverage, capture-based NGS data.
Read more