Multi-LoRA: Personalize AI at scale and deliver the best experience for each customer and use case, with 100x cost-efficiency
By Fireworks AI Team|9/18/2024
DeepSeek V3, a state-of-the-art open model, is now available. Try it now!
By Fireworks AI Team|9/18/2024
At Fireworks, we know how important it is for you to deliver the best product experiences to your users. Today, we’re excited to spotlight Multi-LoRA, a FireOptimizer capability that customers have used to personalize AI at scale and deliver the best experience for each customer and use case.
Why it matters: Personalized experiences are critical to driving greater usage, retention and customer satisfaction for your product. Before Multi-LoRA, if you had many users, user segments or use cases to personalize for, deploying hundreds of fine-tuned models on separate GPUs would be prohibitively expensive. With Multi-LoRA, you can now deliver personalized experiences across thousands of users and use cases, without scaling your costs! 🚀
Multi-LoRA benefits:
Multi-LoRA is part of FireOptimizer, our adaptation engine designed to customize and enhance AI model performance for your unique use cases and workload. FireOptimizer capabilities include Adaptive Speculative Execution, that enables up to 3x latency improvements, Customizable Quantization, to precisely balance speed and quality, and LoRA Fine-Tuning to customize and improve model performance.
Companies often serve thousands of customers across multiple use cases, and want to deliver personalized experiences for each customer. Low-Rank Adaptation (LoRA) is a popular fine-tuning technique to customize and improve model performance, by updating only a small subset of model parameters.
Multi-LoRA takes LoRA to the next level, by allowing you to serve hundreds of fine-tuned LoRAs on a single base model, simultaneously, at the same inference cost as a single base model.For instance, you can user Fireworks’ Multi-LoRA to serve 100 LoRA fine-tunes of Llama3.1 8B and run inference for 1M tokens for just $0.2 on Fireworks Serverless, the same price as the Llama3.1 8B base model.
Cresta, a leader in AI-powered solutions for contact centers, uses Multi-LoRA to personalize their Knowledge Assist feature for each individual customer. This enabled their customers’ contact center agents to resolve queries more quickly and accurately. They deployed 1000s of LoRAs on top of a single Mixtral/Mistral-based Ocean model cluster, at 100x lower cost compared with the previous GPT-4 setup (at the time of deployment in December 2023). Read more about Cresta’s innovations in personalizing AI with Multi-LoRA here.
"Fireworks' Multi-LoRA capabilities align with Cresta's strategy to deploy custom AI through fine-tuning cutting-edge base models. It helps unleash the potential of AI on private enterprise data." - Tim Shi, Co-Founder and CTO of Cresta
Companies often experiment with 100s of fine-tuned models in order to determine the best user experience, with multiple engineers and engineering teams experimenting simultaneously on different use cases. The most promising variants are then deployed into production for A/B testing.
Coordinating AI model experimentation across teams is challenging - teams need to be able to work in parallel, while still being efficient with compute resources as well as managing changing models and fluctuating traffic patterns.
Multi-LoRA offers an elegant solution: teams can create LoRA fine-tunes for each experiment, benefiting from the compute efficiency of utilizing the same base model, increased velocity in training and deploying LoRA fine-tunes, and easier onboarding and offboarding of experimental fine-tuned models. Companies can periodically ‘merge’ the most successful experiments by combining datasets and hyperparameter findings.
Inference prices are the same as a single base model, even when serving hundreds of fine-tuned models
Multi-LoRA makes it feasible to serve hundreds of fine-tuned models at the same cost as base model inference. Serving 100 LoRA fine-tunes of Llama3.1 8B fine-tuned costs only $0.2/1M tokens on Fireworks Serverless, the same price as the Llama3.1 8B base model and 2.5x more cost efficient than GPT4o-mini fine-tuned (assuming a 3:1 input:output token ratio).
We believe this speed and cost efficiency unlocks personalization at scale. If you have many users, user segments or use cases to personalize for, deploying each fine-tuned model on a separate GPU would be prohibitively expensive. With Multi-LoRA, you can personalize AI at scale, without scaling your costs.
Multi-LoRA is also competitive in throughput and speed: you can serve hundreds of fine-tuned models, at inference speeds and throughput that approach 90% of base model performance.
Scott Kramer, former Head of ML at Amplitude, helps businesses leverage their proprietary data to fine-tune and deploy models using Fireworks. Scott's clients find it easier and more cost-effective to deploy their tuned models on Fireworks Serverless, compared to other platforms like AWS. Scott also teaches a fine-tuning course for developers, leveraging Fireworks as a key platform for learning.
“Using Fireworks, clients with limited AI expertise can successfully maintain and improve the solutions I provide. Additionally, students in my course are able to complete real-world fine-tuning projects, dedicating just a few hours per week to the process.” - Scott Kramer, CEO of Brainiac Labs, ex-Head of ML at Amplitude.
At Fireworks, we offer our own LoRA tuning service via Firectl CLI, with a REST API coming soon. However, you can also tune LoRAs using any tool or platform of your choice, and upload and deploy your LoRAs on Fireworks. Follow our documentation to fine-tune as well as upload and deploy models on Fireworks.
Once tuned, there are three deployment options on Fireworks:
And for even faster inference at high volumes, you can merge a LoRA into the base model and deploy the merged models on our on-demand or enterprise reserved deployments.
At Fireworks, we’ve built first class support for LoRA serving in our proprietary FireAttention inference stack. We optimized Multi-LoRA serving through Cross-Model Continuous Batching, which enables processing requests from multiple LoRAs simultaneously on the same base model and dynamically adjusts batch sizes to maximize GPU utilization. We also use Dynamic Loading of LoRA adapters with caching, to ensure that the number of supported LoRAs is not constrained by GPU memory and new models can be added or taken down within seconds.
Ready to experience the power of Multi-LoRA and FireOptimizer? Here's how to get started:
We can’t wait to see what you build!