esm, Evolutionary Scale Modeling
Basic Use
|
Get the Embedding for Each Residues
ESM (like many transformer-based models) uses “special tokens” plus padding so that all sequences in a batch have the same length. Specifically:
-
Start and End Tokens: For any single sequence of length ( n ), the ESM model prepends a start token and appends an end token. That gives you ( n + 2 ) positions for a single sequence.
-
Batch Processing Requires Padding: When you process multiple sequences in a single batch, they all get padded (on the right) to match the length of the longest sequence in the batch. So if the longest sequence has ( n ) residues, all sequences become length ( n + 2 ) (including the special tokens), and shorter sequences get padding tokens to fill in the gap.
Hence, whether a sequence originally has ( k ) residues or ( m ) residues, in a batch whose longest sequence is ( n ) residues, everyone ends up with a vector length of ( n + 2 ). This ensures the entire input tensor in the batch has a uniform shape.
Here is an example of extract the embedding by following codes above:
|
The embedding results form batch and single chain are general the same but slightly different. If you embedding them one by one and calculate the difference, you’ll find there are slight different. According to the ChatGPT, it could be caused by:
-
Position Embeddings
- ESM (like most Transformer models) uses positional embeddings. If the model sees a “longer” padded batch, the position indices for each token can differ from the single-sequence scenario, so the sequence’s tokens may be mapped to slightly different (learned) position embeddings.
-
Attention Masking and Context
- In a batched setting, the model creates a larger attention mask (covering all tokens up to the longest sequence in the batch). Although it’s not supposed to mix information across sequences, the internal computations (e.g., how attention is batched or chunked) can differ from the single-sequence forward pass, leading to small numeric discrepancies.
-
Dropout or Other Stochastic Layers
- If your model isn’t in
eval()
mode (or if dropout is enabled for any reason), you’ll get random differences each pass. Always ensuremodel.eval()
and (ideally) a fixed random seed for more reproducible outputs.
- If your model isn’t in
-
Floating-Point Rounding
- GPU parallelization can cause minor floating-point differences, especially between batched and single-inference calls. These are typically very small numerical deviations.
esm, Evolutionary Scale Modeling