How we handled pod kills (Crash Loop Backoff) due to memory spikes while running heavy scripts in K8s pods

6 min readOct 26, 2024

🤔 Problem Statement

We use Kubernetes(K8s) for seamless service deployments. After each deployment, we have defined one-time deployment scripts to perform some scripts to accommodate new code changes, migrations and revert migrations (to handle backward compatibility/rollbacks).

Migrations are used to perform database schema changes and handle column updates according to new code changes.

Now the problem arrives: When we run migrations, we see sudden memory spikes in K8s pods as the updating data is huge. Below is some information that will help you get an idea about the problem statement in detail.

⚙️ K8s pod configuration:

MinPodCount: 2
MaxPodCount: 5

Pod auto-scaling and downscaling have already been handled using k8s configurations.

💥 Code snippet:

def migrate():
  data = Model.objects.filter(**{"some filter logic here": "some filter value here"})

  updated_data = []
  for instance in data:
      # some business logic to update the field based on it's existing value
      # do something
      updated_data.append(
          UpdateOne(
              {"_id": instance.id},
              {"$set": {"updating field": "updating value"}},
          )
      )

  if updated_data:
      Model._get_collection().bulk_write(
          updated_data, ordered=False
      )

if __name__=="__main__":
  migrate()

💭 Code explanation:

It’s a simple Python script, which was fetching data from a Database using ORM. Data is iterated using a for loop in Python, and for each iteration, the updating value is calculated based on business logic. And after iteration, using bulk write, it is getting pushed to the database again.

💣 Problem:

But the issue was happening at a time when we were running migration, we can see big spikes in memory consumption as while Python for loop iteration, big data is being fetched into memory and then it was getting processed in memory and pushed to the database.

Also, about K8s pod configuration:

Request: A minimum amount of configuration a pod can have while assigning a pod to service.

Limit: A maximum amount of resources a pod can consume. Once it crosses the limit, the auto-scaling logic comes into the picture (which we are not going to discuss in this article).

But, the very interesting part is, when a new pod is assigned to service, it’s not always necessary in k8s that a pod will have {limit} amount of resources available. It will always have the minimum required {request} amount of resource available, but there’s no guarantee about the upper limit that we have defined in {limit} parameter.

So even though we defined a big enough limit (in our case we tried to extend it to 10 GiB), we were not getting the expected result and the pod kept on getting killed. The below image refers to the memory consumption charts of a pod executing migration script for a specific amount of time.

🆘 Error message:

Task {task_name} with id {request_id} raised exception: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL) Job: 4.') 
Task was called with args: [***] kwargs: {***}. The contents of the full traceback was: Traceback (most recent call last):
 File "/usr/local/lib/python3.8/site-packages/billiard/pool.py", line 1265, in mark_as_worker_lost
   raise WorkerLostError(
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL) Job: 4.

🛠️ Solution:

A. Memory Profiling: We introduced memory profiling, and found out for loop executors are consuming a big amount of memory in Python.

from memory_profiler import profile

@profile
def migrate(): ...

B. Selective Field Retrieval & Utilize values_list for Even Lighter Data Fetching: Use .values_list() or .only() operators to fetch only necessary fields from db.

model_qs = Model.objects.filter(
  **{"some filter logic here": "some filter value here"}
).values_list(*["fetch list of required field only"])

C. Use Generators and Iterator Protocols Extensively: As we identified, for-loops are the evil elements here, we defined generators in Python.

for instance in iter(model_qs):
  yield instance

D. Batch Processing & Optimal Chunk Size Determination.

def chunked(iterable, size):
    chunk = []
    for item in iterable:
        # some business logic to update the field based on it's existing value
        # do something
        chunk.append(item)
        if len(chunk) >= size:
            yield chunk
            chunk = []
    if chunk:
        yield chunk

E. Use garbage collector specifically.

Final Code Changes:

import gc

import bson

from memory_profiler import profile
from pymongo import UpdateOne
from commons.utils.loggers import app_logger


def get_instances():
    # Yield tuples with only necessary fields
    model_qs = Model.objects.filter(
        **{"some filter logic here": "some filter value here"}
    ).values_list(*["fetch list of required field only"])

    for instance in model_qs.__iter__():
        yield instance


def chunked(iterable, size):
    """ We will create a batch to perform batch updates in db"""
    chunk = []
    for item in iterable:
        # some business logic to update the field based on it's existing value
        # do something
        chunk.append(item)
        if len(chunk) >= size:
            yield chunk
            chunk = []
    if chunk:
        yield chunk


@profile
def migrate():
    count = 0
    CHUNK_SIZE = 5000  # Adjust based on profiling

    # use generator & chunks to iterate over large data in efficient manner
    for chunk in chunked(get_instances(), CHUNK_SIZE):
        count += len(chunk)
        update_operations = [
            UpdateOne(
                {"_id": bson.ObjectId(instance_id)},
                {"$set": {"updating field": updating_value}},
            )
            for instance_id, updating_value in chunk
        ]
        Model._get_collection().bulk_write(update_operations)
        del update_operations  # Free memory
        gc.collect()

Further improvement:

Implement Asynchronous Processing (up to a certain extent)
Optimize Garbage Collection Calls (if memory consumption exceeds to a certain level, then only use gc.collect() method)

Note: Size of data getting updated: ~50k entries

📈 Initial Result:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    10    185.6 MiB    185.6 MiB           1   @profile
    11                                         def upgrade(pymongo_db) -> None:
    17    359.0 MiB      0.0 MiB           6           data = Model.objects.filter(
    18    359.0 MiB      0.0 MiB           3               **{"some filter logic here": "some filter value here"}
    19                                                 )
    20                                         
    21    359.0 MiB      0.0 MiB           3           update_operations = []
    22    390.0 MiB -86425.2 MiB        6121           for instance in data:
    31    390.0 MiB -173168.7 MiB       12236               update_operations.append(
    32    390.0 MiB -173168.0 MiB       12236                   UpdateOne(
    33    390.0 MiB -86584.4 MiB        6118                       {"_id": instance.id},
    34    390.0 MiB -86583.7 MiB        6118                       {"$set": {"updating field": updating_val}},
    35                                                         )
    36                                                     )
    37                                         
    38    359.0 MiB    -72.9 MiB           3           if update_operations:
    39    359.0 MiB    -20.3 MiB           6               Model._get_collection().bulk_write(
    40    359.0 MiB    -10.8 MiB           3                   update_operations, ordered=False
    41                                                     )

📊 Final Results:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    35    186.5 MiB    186.5 MiB           1   @profile
    36                                         def upgrade(pymongo_db) -> None:
    42    189.6 MiB     -0.7 MiB           3       count = 0
    43    189.6 MiB     -0.7 MiB           3       CHUNK_SIZE = 500  # Adjust based on profiling
    44                                         
    45    190.1 MiB     -3.7 MiB          17       for chunk in chunked(get_instances(), CHUNK_SIZE):
    46    190.1 MiB  -2376.7 MiB        6132           for instance_id, updating_value in chunk:
    48    190.1 MiB  -2371.3 MiB        6116               count += 1
    49    190.1 MiB  -2371.3 MiB        6116               update_operations = [
    50    190.1 MiB  -2371.3 MiB        6116                   UpdateOne({'_id': bson.ObjectId(instance_id)}, {'$set': {'updating field': updating_value}})
    51                                                     ]
    52    190.1 MiB     -5.5 MiB          14           bulk_write_operations(update_operations)
    53    190.1 MiB     -5.5 MiB          14           del update_operations  # Free memory
    54    190.1 MiB     -6.7 MiB          14           gc.collect()

As we can see from charts, memory consumption has decreased in a great manner while executing the script. Yayy!!

😃 Conclusion:

By implementing the above strategies, the modified script can achieve significantly lower memory consumption.

The key principles involve:

Selective Data Loading: Fetch only what is necessary.
Efficient Data Processing: Use generators and chunked processing to handle data in manageable portions.
Minimal Memory Footprint: Avoid unnecessary data structures and manage garbage collection judiciously.
Continuous Profiling: Regularly monitor memory usage to identify and address new bottlenecks.
These optimizations not only reduce memory usage but can also lead to performance improvements, making the script more robust and scalable for large datasets.