One of the main advantages of using Python as a programming language is the wide variety of code packages that you can download and add to your codebase. As of July 2021, the package repository PyPI has just over 300,000 packages loaded to its repository. While these packages are all in various states of complexity, documentation, and quality, you can grab packages like scikit-learn or pandas for machine learning and data science as an example, follow their documentation, and begin doing analysis right away. Along with the amazing advances in data science that have happened in the past several years, another technological big advance has been the massive improvements in cloud computing. On large platforms like Amazon Web Services (AWS), computing has moved far beyond essentially providing a virtual computer. You can now do things like create new databases, websites, and yes, even deploy machine learning models with a few button presses.
Even with all these services that you can create, sometimes you still need or want to write some code, for example, if I have a database and a new entry is put into this database by some other service, I want to add a line to a page on a website. To run this code in, say, Python, you'd need to have a computer to run this on, so you'd probably have a virtual instance already running that could handle this request. If this job is an infrequent occurrence, you now have an AWS option in the realm of what is now called serverless computing, AWS Lambda. What Lambda or any other serverless computing option allows you to do is to give you a fast computing option to quickly run your code without having to manage a computer, allocate resources, or anything other machine configuration options. And the best part is you are only charged by the millisecond of computing time you use to run your code along with how much memory you allocate for the code to run. That's it. With regards to running Python code, Lambda functions (not to be confused with standard Python Lambda expressions) come allocated with their own Python runtime, which includes many standard modules. AWS also includes in this runtime a module called boto3 which allows you to interact with many other AWS services via API's. You can use the boto3 API's (or any of the other included modules) the same way as any other Python package code: import boto3
Quick terminology note: I interpret this naming convention as a "module" being a set of code file(s) that are imported into Python using a single import statement like import boto3
. A "package" (see next paragraph) would then be a set of multiple modules where you can import one or more modules from a package via a statement like from sklearn import svm
. This article does a good job at explaining the terminology and delineation better than I can, but just know that both modules and packages contain external Python code that can be imported and used in your code. You'll likely see the terms used in the wild interchangeably, including this article.
With the ability to import and run the same standard modules on AWS Lambda that we would on our own local machine, we might also like to be able to use some of the other packages on PyPI like pandas
and scikit-learn
with our Lambda code. So, how do we get the packages from the online PyPI repository to our AWS Lambda function? One of the main ways to add extra code from an outside source to a Lambda function is by uploading a deployment package zip file. You can create the code you want to run with the module you want to import and name it something like RunLambda.py
. If this function uses the scikit-learn module, to get the scikit-learn code, you can go to its download page and download the wheel file (.whl) associated with the runtime and module version you want to use. You can then unzip this wheel file just like any other zip file with your file unzip utility. Once you have this unzipped module contents, you can follow along with these directions to create your Lambda function along with the additional module code, all zipped together in a .zip file. You now have the capability to run scikit-learn module functions inside your Lambda code.
While adding Python module code to your function gives you additional capabilities, there are a few drawbacks to this method. First, when you add your Python code as a .py in a zip file, you don't have the ability to edit your Python code on the fly in the Lambda console GUI, making debugging more difficult since every time you want to run the code again, you have to re-zip your .py file with the other module folders and re-upload the zip file. This can be cumbersome when making the final tweaks to your code. Fortunately, AWS introduced Lambda Layers as way to add extra code to your function. With these, it is possible to load the scikit-learn code as a zip file to a standalone layer. We can then manage our Python code in a Lambda function and add the scikit-learn layer to this function, easily done in the add layer drop down box in the Lambda console GUI. I wanted to then use these layers as a way to store and manage Python modules such that if I needed to use a specific modules code, I could just add the layer and its code using the import
statement. Turns out that AWS already implemented this idea and preloaded the popular NumPy and SciPy numeric and scientific packages together in a layer. This layer should be available in the Lambda GUI in the add layer drop down, named something like "AWSLambda-Python38-SciPy1x". If you want to use the functionality in those packages, AWS has already done the work for you!
After finding a modular way to add a Python package to a Lambda layer using an uploaded .zip file, I also wanted to see if I could improve on this by reducing or eliminating the additional computing resources I needed. Packaging and loading a script and any needed packages as a zip file, but this requires an additional computing resource outside of Lambda itself. To start with, in my case, it was my own PC running Microsoft Windows 10. When I make my zip files locally, I have to ensure that I download the Linux version as opposed to the Windows version of certain packages. This can make a difference, for example, when running Python code in the NumPy module which adds pre-compiled code in the form of shared libraries. On Windows, these files have a .dll extension, while on Linux (and usually Mac) these files will have a .so extension. Running the Windows package on a Linux machine will likely throw an error, and vice versa.
If you search the internet, many of the examples you'll find out there that address these issues will have various additional computing resource requirements, such as:
You can launch an EC2 instance on AWS, start it up, do your above packaging, then shut it down. You are already on AWS, so you don't have to go out and find an additional resource, you just select what you want and pay for the compute time you use. You can keep track of the module code you already have installed this way, too, and that code will persist even while the machine is shut down. The big drawback to EC2 if you are on a tight budget is the issues you have with managing EC2 instances, aka, "Oh no, I forgot to shut that down 4 days/weeks/months ago."
AWS suggests using Cloud9 in the link above to get the package contents into a layer. I wanted to like Cloud9, and while the coding interface was fine when I was up and running, getting it set up was sometimes clumsy and poorly documented. AWS also advertises Cloud9 is free, which is true for the service itself, but you are paying AWS for the EC2 instance it spools up to run Cloud9. See EC2 issues above.
Some examples to load packages to Lambda layers include variants of installing Docker on your machine so that your local machine, regardless of operating system, can run a virtual AWS Linux environment in it. This might work for some people, however, instead of just managing computing resources on my local machine, I now have to manage computing resource inside a virtual environment...which is itself using computing resources on my machine. I've added an extra layer of complexity to my computer just so I can pretend that because the computing problems are now virtual, they won't come back to haunt me. And I'm still using my own machine, and I have to install an additional program to get this workflow running.
See Docker above
Based on the above options that I found out on the internet, the EC2 option was probably the best. However, I went even further and wanted to be able to retrieve module code from the internet via PyPI, rearrange the contents, and save the contents as a new Lambda layer...in Lambda itself. I only wanted to use the tool I was already using. This would give an added benefit in that I could launch my layer creation tool as a just-in-time function call, if needed. I could also encapsulate this function such that it could be run as external API to be called via API Gateway, for instance.
I started down the road of navigating PyPI programmatically. Since Lambda functions have internet access, you can connect to the simple interface for PyPI, which gives you text only access to all packages/modules. If I navigate to the simple page for scikit-learn, I can see all the possible wheel files that can be downloaded. This now presents a problem - which one to download? I'd ideally like the latest one. I tried the PyPI JSON interface to find the latest release version needed for download, but then how do I want to parse the file names? Python.org has released one Python Enhancement Proposal (PEP), specifically PEP 425, which helped me decipher how operating system info, python versions, and Application Binary Interfaces (ABI) are designated. Then I started reading PEP427 for help when the wheel file naming convention. From there, I realized there that navigating PyPI and its package contents required fairly sophisticated logic, and I could keep spiraling down that rabbit hole...or I could turn to a tool that already has this.
The Python package installer, also known as pip
, is the recommended interface to effectively retrieve and manage Python packages from PyPI. pip
now comes preinstalled with most Python interpreter packages, works on Windows, Mac, or Linux, and retrieving a package like scikit-learn from PyPI is as easy as typing pip install scikit-learn
at the command prompt or terminal. There have been previous other tools to attempt to manage Python packages, but pip
has kind of won out and become the standard, and you can read a little brief history about it here. There are other package managers, such as conda, which is often included with the Anaconda Python distribution. The Anaconda Python distribution is an alternative example to the default "CPython" distribution found at python.org, and is geared toward Data Science applications. You can also just use conda
with CPython as well, and some people swear by it, which is great - if you like that tool more, use it. However, you may also run into a few people that just can't resist telling you that "pip
is terrible, (insert package manager name) is way better and will take care of all your problems". I'm here to caution against heeding those words, and to help you understand why, let's take a short detour and look at a short example to illustrate what a Python package manager attempts to do:
Let's say you want to install a hypothetical Python package that we will call "A". That's pretty straightforward, we go to the command prompt and type pip install A
, and pip
will go ahead and install package A. The code in package A is now available for use, but it turns out that the package A code requires code from two other packages, B and C, to be installed. Now, pip
usually handles this scenario and will often automatically install packages B and C for you, depending on how you used it to get package A. In turn, packages B and C may rely on code from packages, D, E, F, and so on... As time goes along, packages A-F may all be updated at different times, and package B may require an older version of package E, but if package E gets updated and its newest version happens to break compatibility with B, your code may break, because you were dependent on the behavior of this stack of code packages NOT changing. There are ways to help manage this situation to fix or "pin" the dependencies of your packages, but sometimes you may want this to automatically happen and sometimes not. Now, any package manager you use doesn't automatically know what you want to happen, it needs you to configure it. Likewise, the maintainers of package B may specify exactly what version(s) of E are compatible but the maintainers of C may not.
Trying to keep track of complex software with all of these additional code packages utilizing other code packages means you are now in what is often called Dependency Hell and can be a challenge no matter what software languages or packages you use. With regards to Python, you may have multiple interpreter versions installed (3.6, 3.7, 3.8), so you may switch between different interpreter versions for different projects and find that you may have different versions of some packages installed via pip
. Now your code isn't working because you are calling a newer function from an older version of the package that doesn't have it. To help with this confusion, many people organize their Python code into virtual environments using tools such as venv. This way you can keep your code projects separated virtually, with the interpreter and the installed packages setup only for that particular project. If you are managing 20+ different Python projects on your computer, that all use a set of code packages, you can keep a separate environment for each project. The benefit of this is that if you somehow get into package dependency issues that causes your Python codebase to really get into a horrible broken mess, you confine the damage to that particular project. If you have a package dependency mess at the level of the Python interpreter that you are using to run all of your projects, well, you've now broken ALL of your code and you likely have to stop coding and spend hours digging through nested dependencies. There is a manager called Pipenv to help sort out these package dependency issues. A major strength of using these virtual environments is to confine the damage to as small as space as possible, similar to how ships are divided into compartments with doors that can seal off a single compartment in case of a leak.
I added this little aside on code/software/package dependencies to help you understand the complexity around this little project, but also to help out those of you who are newer to coding and software development. If you are working on a team with multiple people who are checking code in and out of the codebase, it's easy to run into issues where someone checks in a change that breaks the code at either build time or runtime. If you start importing new libraries or external code packages, the complexity of your codebase increases exponentially, and you can run into real issues that can take a team of multiple developers hours, even days, to track down. There is a reason that there are all kinds of solutions out there to help you manage things like code versioning, testing, integration, etc. and companies spend BILLIONS (with a capital B) on these tools to keep things running smoothly. Of course, many of these solutions are often marketed as the one new thing that will fix all of your problems. A new package manager will profess to fix all of your package problems, but its usefulness is making your problems visible and identifiable. It's not a magic button that you can press to setup each package exactly how you want it. You can wantonly install a bunch of Python packages into different virtual environments, but what happens if the main interpreter installation gets messed up? "I know, we'll keep one Python interpreter installation with our standard OS configuration in a Docker container!" OK, but what happens when that gets messed up? "We'll keep that container in one availability zone, and use another zone..." It's always good to keep backups for everything you do, especially in a different virtual/physical location, but you can see that this logic can keep going on infinitely - it's turtles all the way down. To mangle a great quote from the movie The Princess Bride, "Software is complexity - anyone who says differently is selling something". If a product's sales page or person is promising you a simple solution to all of your problems, always keep thinking, "Where is the complexity hiding?". Too often you'll realize that they are trying to hand the complexity off to you.
Now back to code...if we'd like to use pip
inside our Lambda function to retrieve a package from PyPI, we need to find it first. We're now presented with a dilemma - the code inside our Lambda function is already running inside a custom AWS Python interpreter. Yet, we typically run pip
either at the command line, e.g. pip install scikit-learn
, or as a module, e.g. python -m pip install scikit-learn
. How would we then run pip
inside the Python interpreter in real-time? Fortunately the pip
developers address this particular scenario in their user guide. In this section about using pip
from your program, the pip
team explain why it should not be used as an imported module, but instead be run as a subprocess from within Python. The documentation there provides an example for running pip within another Python interpreter instance:
subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'my_package'])
This function call is essentially the same as our example above python -m pip install scikit-learn
where sys.executable
is a path to the Python interpreter and my_package
would be the name of the Python package we want installed (like scikit-learn
). To discuss my implementation of pip within an AWS Lambda function, it's probably best to give you the link to my Python code file named python-add-module-as-layer.py, so you can follow along. About halfway down, I run a variant of the above pip
function call in my code:
output = subprocess.run([sys.executable, "-m", "pip", "download", ModuleName, "--no-deps", \
"--dest", TmpModuleFolderName, "--cache-dir", TmpPipCacheFolderName], capture_output=True)
In essence, this function calls pip
and tells it to find whatever module name the user wants to import from PyPI (ModuleName
). I'll use this function call to explain what the rest of this function does at a higher level. I will also try and use its Github page to technically explain more and will try to keep that updated as best as I can. First, I set the download
flag to download the package file instead of the standard method of install
. When you install a Python package, pip extracts the contents of the wheel file and typically places it in the /Lib/site-packages
folder in the interpreter directory. For this particular application, I don't want that. I want pip to download the wheel file for me and save it as a zip file to put inside an AWS Lambda layer, such that when a Lambda function calls this layer, it will extract the contents of the zip file at runtime. Because I am attempting to compartmentalize each Python package as a separate layer, I don't want pip to get any needed dependencies at this time, so I set the --no-deps
flag. For future functionality, I could have this call the same Lambda function for all needed package dependencies for the initial package. I will address this issue when I talk about duplicate layers below.
After telling pip
to download the appropriate wheel file from PyPI for the specified module, I need to specify the download location. If no destination is specified, pip
will attempt to download your package to the home directory for the user, ~
. This will cause the AWS Lambda runtime to throw a permission error as there are no write permissions for this directory. Instead, the /tmp
directory is specified as the appropriate place to store intermediate files when running your Lambda function, with a limit of 512 MB. Hence, I create a new module folder in /tmp
which is set earlier in the function in the TmpModuleFolderName
variable, and point to its location for the --dest
flag. I also do the same for the cache folder (TmpPipCacheFolderName
) in case pip needs to cache anything as part of its operation.
After the function downloads the wheel file, it is theoretically possible to use it as the file archive, since it's equivalent to a zip file. However, the folders within that wheel file are usually set with the zip file as the base directory, which pip
would then copy directly into the site-packages
folder. This causes issues, however, when the Lambda runtime extracts these folders, because it puts them in a different location (/opt
) than the standard site-packages
location that the Python interpreter looks for. To fix this, I decided to create a new .zip file, and copy each file from the wheel file to the zip file, prepending the correct folder location (/python/lib/python38/site-packages/
) to each file. Then when the .zip file is extracted at runtime, all of its contents will be correctly loaded to the site-packages
folder. After a few teaks to the zip file contents in Python, I create a new Lambda layer with this zip file as its archive.
To add this python-add-module-as-layer.py
function to your AWS Lambda setup, create a new AWS Lambda function from the Lambda homepage (usually something like https://us-west-2.console.aws.amazon.com/lambda/home
depending on the AWS data center region you are using) by selecting the "Create function" button in the top right of the screen. Call the function what you like, select the Python runtime and setup the Lambda function permissions you wish to use. This is usually done by setting up a new execution role (see the docs here) and adding these permissions:
lambda:PublishLayerVersion
logs:CreateLogGroup
logs:CreateLogStream
logs:PutLogEvents
Once that is done and saved, copy the function contents from the Github page to the function's "Code source" window. To run the function, you can use the test button. First, click the down arrow next to the "Test" button at the top of the "Code source" window. Select "Configure test event", and in the new window that opens, select "Create new test event", name it what you'd like, and in the code window, add the following:
{
"ModuleName": "scikit-learn"
}
I used scikit-learn
here for the example I'm about to give, but you can use whatever module in PyPI you like. When that is run, if the function ran successfully, you should see something in the "Response" window like:
{
"StatusCode": 200,
"Payload": "New layer was successfully created with the name scikit-learn_0242_py38"
}
Note, I return errors this way to mimic how you would return an error if you were linking this AWS Lambda function to an API Gateway call. "200" means a successful HTTP Get request, with the error codes I chose following this format. There will also be some data in the "Function Logs" window below the Response window to help in debugging, in the case of an error. After the function completes successfully for the module you chose, you can go to the Lambda -> Layers page (e.g. https://us-west-2.console.aws.amazon.com/lambda/home?region=us-west-2#/layers) and your layer should now be listed.
We can now test out your new scikit-learn layer with a new AWS Lambda function. Create a new Lambda function with a similar execution role to the layer publish function above. In the code source window, copy the following code, modified from the scikit-learn documentation:
# Import needed Python math and fitting modules
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import numpy as np
import os
print('Run SciPy Curve Fitting Lambda Layer Test')
def lambda_handler(event, context):
model = Pipeline([('poly', PolynomialFeatures(degree=3)), \
('linear', LinearRegression(fit_intercept=False))])
# fit to an order-3 polynomial data
x = np.arange(5)
y = 3 - 2 * x + x ** 2 - x ** 3
model = model.fit(x[:, np.newaxis], y)
print(model.named_steps['linear'].coef_)
return
Before running this function, you'll need to add your scikit-learn layer to your function. To do this, scroll down to the bottom of your Lambda function where you should see a "Layers" heading. Click the "Add a layer" button on the right, and in the new window that pops up, select the "Custom layers" option button, and then your scikit-learn layer should be visible. Select it as well as its version which should be "1", and click "Add". Back on your function page, you should see the layer name with a "Merge order" value of 1. I will tell you before you run this function, that there are still three package dependencies needed (see here for the full scikit-learn dependency list). One is the module joblib
and the others are the scientific libraries numpy
and scipy
. To get these dependencies, run the same module layer create function for joblib
as you did for scikit-learn
above. Once the joblib
layer is created, add it to your function. The numpy
and scipy
packages are available in a standard layer created by AWS that I mentioned earlier in this article. To add it, after clicking the "Add a layer", select the "AWS layers" option and again there should be a layer named something like "AWSLambda-Python38-SciPy1x" and pick whatever version is available.
One last thing before you run your function, the order that your layers are merged into your code matters, so if package B is dependent on package A, you need to load package A before package B. In this case, I would put your numpy/scipy
layer at position 1, joblib
at 2, and scikit-learn
at 3. Finally you can run your code by clicking the "Test" button, using the default test configuration, and after running successfully, in the Function Logs section you should see a line showing [ 3. -2. 1. -1.]
between the START and END lines. That's it! You should be able to run other examples from scikit-learn as a test. After that, you can try doing some on-the-fly data manipulation using entries in DynamoDB or another data source.
Currently, AWS Lambda only lets you use five layers, so if your layer dependency stack goes deeper than that, you'll run into problems. Additionally, the maximum size of your /tmp
directory is 512 MB, so if you have any modules in PyPI that have wheel file sizes greater than 512 MB, for example, torch
aka PyTorch or tensorflow
. Finally, the total unzipped size of your deployment package, including all added layers, is 250 MB, with further limits your options. However, this workflow is meant to be a lightweight way of expanding your options when coding AWS Lambda functions with Python, so keep that in mind.
In the layer merge order description above, notice I gave an example where package A must be put first since package B depends on A. What happens here where I have a situation where package C depends on B, package B depends on package A, but package A depends on package C? This is what is known as a circular dependency and can cause all kinds of fun issues. This dependency example is the reason why I setup my function to only import one Python package at the level saved in PyPI. If I wanted to have my function that adds a package as a new layer automatically recursively call itself for any dependencies that pip
finds, I now run into a new problem - versioning. What happens if I am creating a new layer for package B (version 1) and it is dependent on A (version 4). It will create two new layers for Bv1 and Av4. If I then create a layer for package C (version 1) which is reliant on package A (version 3), I now need to keep two separate versions of package A, otherwise, package B will import an older version of package A. So, do I automatically create two versions? What happens if I then import package D which is also reliant on Av4? How do I compare the contents of the old Av4 layer vs the new one? A hash value of the wheel file contents? I'm just illustrating these examples, as well as the dependency discussion earlier, to show you how a "simple, lightweight" tool like this function can easily get bogged down in complexity. But, if you think you've mastered all of the complexities of pip
and Python package dependencies, congratulations! You are now ready to master git
code versioning complexities. Best of luck and happy coding!