In this post, I’ll walk through our experiences using Large Language Models (LLMs) for user intent extraction. By this, I mean taking plain English queries and interpreting their semantic meaning in terms of cloud architecture. I’ll describe what led us to explore LLMs as an interface for cloud architectures, why our approach is different from other LLM cloud architecture products, and what lessons we learned along the way.
Natural Language as an Interface
We’re building InfraCopilot, an IaC editor that combines natural language with a graphical User Interface (UI). Under its hood is the Klotho Engine, responsible for architecture auto-completion, optimization, and IaC generation.
When we decided to create a natural language interface, LLMs and ChatGPT were were the natural choice. However instead of fully relying on the LLM for understanding the cloud architecture as well as generating the IaC, we instead used it to only extract user intent. This intent then feeds into a format the Klotho Engine can process.
This approach helps us handle the correctness, upgradability, explainability and unpredictability of LLMs while still utilizing their ability to interpret language and intent. To better understand this a comparison to other LLM architecture products is necessary.
Other LLM-based Cloud Architecture Tools
The most notable alternatives to our approach are LLM based solutions like Pulumi AI or using GPT4 directly. Both of these approaches offer some advantages, they’re leveraging the creativity of LLMs to fill in the details. Pulumi AI is likely using tuning or very well crafted prompts on top of GPT3.5/GPT4 in order to support their custom experience. The downside to both approaches is that the results may be incorrect and incomplete, requiring a human expert to verify, correct or potentially redo the entire thing. Some examples of incorrect information are things like over permissive security groups or strange CIDR ranges, or missing certain required resources like when asked to connect a lambda to an RDS placing both in separate VPCs, there is no indication that you need to do VPC peering. Here is an example output of a Pulumi AI query to connect a lambda to an RDS instance:
Another example, using Pulumi AI again, is to add an RDS proxy between a lambda and RDS instance. Pulumi AI chose not to add a secret manager, but still relied on one being present. When asked why one wasn’t present, it decided that it should add it:
These tools provide value in bootstrapping, learning new IaC, or asking questions about existing IaC. They don’t guarantee correctness, nor are they fully usable by individuals who aren’t experienced in cloud architectures and IaC already. These LLM-based products will always return a result, even if it’s incorrect whereas InfraCopilot will only return a result if it is certain it can satisfy the requirements.
When working with a new technology like LLMs, it can feel like everyone is scrambling to find the best and “proper” ways to do things. We tried a variety of approaches and learned a few key lessons:
Use the GPT Chat API for now
We explored different libraries for interacting with LLMs, including LangChain, but they introduced problems and complexity beyond what we needed. We ultimately found that using the chat completions API directly from OpenAI provides is the fastest and easiest to work with. It’s compatible with Azure OpenAI and makes it relatively simple to maintain history during a conversation.
We also noticed significant speed differences between using the OpenAI API vs a deployed Azure API seeing a 3X round-trip latency improvement even in complex cases. In real terms, this meant that GPT3.5-turbo prompts that were taking upwards of 10 seconds using OpenAI were taking us 3 seconds or less consistently with Azure. This extra time translated directly to time spent by users in our chat interface waiting for a response. This can be frustrating if the Klotho engine cannot satisfy their query as they must wait even longer to iterate. Additionally having multiple private deployments allows you to load balance requests and avoid overloading a single API.
Beyond the performance difference, these APIs produced identical results and we could switch between them with a simple config change as required. In terms of total cost the Azure and OpenAI APIs were negligible relative to our overall architecture, and for startups especially, Azure credits can lower this cost to free for a period of time. Azure also gives very explicit data privacy guarantees.
Defining an output format
We wanted to create a format that would work well with both GPT4 and GPT3.5 for cost reasons.
We initially started by defining a JSON format, but found that the extra JSON characters ate into our token count and slowed down response time. For a complex architecture, repeated keys, curly braces, and punctuation can quickly become hundreds of tokens, and tokens means speed. We instead opted for a dense CSV format, gaining a 30% performance increase. We ultimately settled on this CSV format:
create a lambda hello_lambda and a second lambda
create,create a lambda,RESOURCE,hello_lambda,AWS_LAMBDA_FUNCTION
create,a second lambda,RESOURCE,lambda_01,AWS_LAMBDA_FUNCTION
What this really illustrates though is that shrinking the response payload can create fairly substantial differences. It’s especially noticeable when using GPT4 as responses are already slower.
We found that this format was sufficient to communicate intent. The schema for these create events looks like:
Most notably, reference_name is the part of the text that ChatGPT used to identify the intent. We saw anecdotal evidence that forcing the LLM to reflect its source of truth led to more accurate results, but it was primarily for our debugging purposes. Object classification can either be RESOURCE if our LLM can identify it as a specific cloud resource, or ABSTRACT if not. We want it to set object_name based on the resource type or use the explicit name provided. Lastly, type represents what it believes is the actual cloud type, in this case AWS_LAMBDA_FUNCTION.
As we added more and more capabilities we expanded the schema, adding more types of events, and tuned the results to better identify ABSTRACT vs RESOURCES. ABSTRACT resources types allow us to classify higher level constructs and later remap them to support custom terminology within an organization.
When creating an output format, drive it with examples
Initially we tried to define this format using a series of rules, this worked well until we added new capabilities and would suddenly regress on things that were previously working. Adding examples improved things, but sometimes the LLM would directly ignore rules, even with explicit examples.
After a lot of iteration we found that the most reliable way to do this is with a very small set of rules and a LOT of examples. Our final set of ‘rules’ looks like this:
You are a language intent parser system familiar with software infrastructure and cloud services. You translate English prompts into actions and attributes representing cloud resources (RESOURCE) or general concepts (ABSTRACT).
The following represents actions you can take:
As well as ending the prompt with:
only respond with CSV lines with the same format as the examples. Group together actions based on type. Do not include any explanations, notes, warnings, logs, or other information, reasons, or human readable text. Only the CSV lines to the best of your understanding.
Everything else is examples using the following format:
query: "create a lambda hello_lambda and a second lambda"
create,create a lambda,RESOURCE,hello_lambda,AWS_LAMBDA_FUNCTION
create,a second lambda,RESOURCE,lambda_01,AWS_LAMBDA_FUNCTION
query: "add a high latency dynamodb that is used by a new lambda called my_lambda_03"
create,add a high latency dynamodb,RESOURCE,dynamodb_01,AWS_DYNAMODB
create,used by a new lambda,RESOURCE,my_lambda_03,AWS_LAMBDA_FUNCTION
query: "create a highly available serverless API that uses dynamodb"
In general, as expected, GPT4 can work more accurately with fewer examples or even using the rules-based version, but for cost and performance reasons we needed to optimize for GPT3.5-turbo.
At the time of this exploration fine tuning was not yet available for GPT3.5 turbo, but is an area we’d explore. For providing these examples as a prompt, the number of examples can become quite large which also uses a large number of tokens. We explored an interesting optimization we called dynamic examples
The primary reason to introduce a new example is when the LLM is incorrectly identifying technology as ABSTRACT vs not, missing connections, or even missing resources entirely. However, we don’t have unlimited tokens for every example we might need. An optimization we came up with here is Dynamic Examples using pre-filtering. Instead of providing examples that would generalize to everything, we focus the examples on what we can guess is in the queries.
We extract a list of technologies using word lists, which is easier than extracting their intents, and if we don’t find many matches, we assume that more ABSTRACT resources are present. Once extracted, we can then create a custom prompt by selecting specific examples to the technologies mentioned in the query and a set of bedrock examples including baseline rules for the different actions that expand language understanding.
Github Copilot as an Example Copilot
When writing our prompt, because the file was a large list of examples, we found that Github Copilot would begin to understand the format and in many cases provide a valid example to complete a query. With minimal hand-holding, we could quickly generate new examples using different technologies or wording to better improve the GPT3.5 results.
This meant that Github Copilot became an invaluable tool for modifying our prompt as it could produce new examples with relative ease only requiring minor tweaks and it became better and better as more examples were added.
GPT4 as an Inconsistency Scanner
We found GPT3.5 to be highly sensitive to inconsistent examples, and given our many-examples approach, maintaining correctness of the parser became harder as the set grew. As a validation and debugging step, we used GPT4 with the 32k model to pass in as many of the examples and ask GPT4 to group inconsistent examples. Manually reviewing and fixing those inconsistencies improved the GPT3.5 parser every time.
Testing and Iteration
We knew that manually scaling testing would be unrealistic given the possible combinations and variations in language, supported services and actions. To automate testing worked backwards to create tests, recognizing that generating a user-like query from an answer is easier than parsing an answer from a user-query.
The engine creates an architecture based on what it supports , encodes it into the CSV format and then we use GPT4 to generate a query based on the CSV representations.
The way we teach GPT4 to generate the queries is by flipping the same parser examples – what used to be a (Query,Parsed CSV) pair becomes (Parsed CSV, Query) pair.
The end result is that we’re able to start with an answer, generate queries, parse them with the GPT3.5 parser, then validate the whole system end-to-end.
Some generated example queries include:
- Create an AWS API Deployment called “api_deployment_01” and an AWS ECR Image called “ecr_image_01”. Connect api_deployment_01 -> ecr_image_01.
- Create an AWS Lambda function called “lambda_function_01” and an AWS DynamoDB Table called “dynamodb_table_01”. Connect the lambda function to the DynamoDB Table.
Here is an example intent test yaml file that is ready and processed by our test runner:
- Create an AWS API Deployment called "api_deployment_01" and an AWS ElastiCache Cluster called "elasticache_cluster_01". Connect api_deployment_01 -> elasticache_cluster_01.
- - create,api_deployment,RESOURCE,api_deployment_01,AWS_API_DEPLOYMENT
We found that running with these intents with temperature=0 for as little variability in answers as possible as well as 3 test iterations for each test was sufficient for correctness in most cases. Cost is a serious consideration here, running hundreds of tests three times each for each iteration of the prompt can quickly add up in cost. Individual tests would take fractions of a second to 7+ seconds for larger inputs primarily driven by longer results that had more tokens.
In our case we had 475 tests, Our prompt is ~7000 tokens, this means to run the full suite of tests:
475 (number of tests) * 3 (iterations) * ~7000 (prompt+response size) = 9,975,000 tokens. Based on pricing for GPT 3.5-turbo with 16k context, our total for that would be (9,975,000/1000) * $0.003 = $29.93 to run our full set of tests.
We found that additional normalization helped for places where we relied on LLM intelligence. For example, sometimes the LLM would use AWS_API_GATEWAY, other times it might use AWS_APIGATEWAY. We had to create dynamic maps on the backend to ensure these would be mapped to the same type before sending it to our engine.
This was even more necessary for ABSTRACT resources where the LLM had to make a best guess based on whatever was typed. For example ABSTRACT_DATABASE, ABSTRACT_DATASTORE, etc.
LLMs have proven to be a powerful interface to the Klotho engine. We believe that chat is an excellent complimentary interface to classic GUIs and CLIs, but doesn’t have to replace them entirely. GPT4 is amazingly powerful at extracting user intent with minimal examples, and GPT3.5 is both fast and accurate when given comprehensive examples and a diligent approach to testing/validation as you modify your prompts.
Our approach differs from other LLM-based cloud architecture tools in that we don’t want to rely on the LLM intelligence to solve architectural problems, only translate user intent. The biggest downside to this approach is that you’ll simply fail when you can’t generate an answer and since users can ask literally anything.
We’re excited about the interweaving of GUIs with chat-based interfaces, as multiple interaction modalities lend themselves differently depending on the task at hand.