Inference Endpoints Changelog ๐
Week Christmas and New Years, Dec 15 - Jan 05
I'll do one larger with several weeks of updates combined with the Holidays we had. Since last time we have some nice improvements on:
- added llama.cpp supported models to the catalog ๐ฅ check them out here
- fixes bugs on the
/new
page - bug fix related to updating passwords in container
- a lot of nitty gritty work in the background
- and recharged for the new year ๐ช
Week 50, Dec 09 - Dec 15
The big update from this week is getting TGI v3 out ๐ฅ You can read all about the update here but a short tl;dr is:
- zero configuration
- increased performance
We also:
- improved the messaging in the UI when you reach your quota
- did minor bug fixes
Week 49, Dec 02 - Dec 08
This week we have a lot of nice updates ๐
- New and improved UI for the
/new
page ๐ our aim was to make the configuration cleaner and remove outdated fields, there are more updates coming but we think this is already a nice improvement.
- You can now configure the hardware utilization threshold for autoscaling.
- A bunch of models are now supported on the inf2 accelerator.
- Mixtral-8x7B is now supported on TPUs.
Week 48, Nov 25 - Dec 01
This week we finally got back to shipping after the off-site and flu ๐ฅ
Updates:
- If you autoscale based on pending requests, you can manually set the threshold to meet your specific requirements
- You can now view logs further back in history. Up to the last 50 replicas for a particular deployment.
- New models added to the catalogue, like Qwen2-VL-7B-Instruct and Qwen2.5-Coder-32B-Instruct.
- Updated default TGI version to 2.4.1
- Added CPU as an alternative for the llama.cpp container type (shoutout to @ngxson)
- Fixed an issue with the revision link and default hardware configurations for catalog models.
- The default scale-to-zero timeout is now 15min. Previously it was never scale to zero.
Week 47, Nov 18 - Nov 24
Unfortunately, a wave of flu has hit our team, and we needed some time to recover ๐ค No updates this week, but stay tuned for next weekโwe have a lot of exciting things coming up! ๐ฅ
Week 46, Nov 11 - Nov 17
No changes this week as the team was on an off-site in Martinique! But a lot of ideas and energy cooked up for the coming week ๐
Week 45, Nov 04 - Nov 10
This week, we have some awesome updates that are finally out ๐
- Scaling replicas based on pending requests is now in beta ๐ฅ Since it's in beta, things might change, but you can try it out and read more about it here
- Improved analytics with a graph of the replica history
- Updates to the widgets
- Fixed bug in streaming
- Conversations can now be cleared
- Submit message with cmd+enter
Week 44, Oct 28 - Nov 03
Probably the biggest update this week was a revamp to the Inference Catalogue ๐ฅ You can now with a one-click-deploy find a model based on:
- license
- price range
- inference server
- accelerator
- and the previously existing task and search filters
Additionally:
- we fixed the config for
MoritzLaurer/deberta-v3-large-zeroshot-v2.0
so that you can run it on CPU as well - and also thanks to @ngxson for fixing a bug in the llama.cpp snippet
Week 43, Oct 21-27
This week you'll get a sneak peak of the upcoming autoscaling, in the form of analytics ๐
We have:
- Added pending http requests to the analytics
- Support for Image-Text-To-Text, aka language vision models ๐ฅ (llama vision has some good jokes ๐ )
- Improved the log pagination and added some nice visual touches
- Fixed a bug related to total request count in the analytics
Week 42, Oct 14-20
This week was unfortunately slower on the user-facing updates.
Behind the scenes, we:
- fixed several recommendation values for LLaMA and Qwen 2,
- improved our internal analytics,
- debugged issues related to weights downloading and getting 429s,
- and hopefully squashed the last bugs so we can soon release the new autoscaling ๐ฅ
Week 41, Oct 7-13
This week we had a lot of nice UI/UX improvements:
Additionally:
- deprecated the "text2text-generation" tasks, it's been deprecated on the Hub and in the Inference API as well
- you can now pass the "seed" parameter in the widget for diffuser models
- small bug fixes on llama.cpp containers
- you can directly play in the widget with openAI API parameters
- Shoutout to Alvaro for making the NVLM-D-72B model compatible on endpoints ๐
On the backend we're also making improvements to the autoscaling. This might not immediately have noticeable impact for user but soon it'll ripple to the front end as well. Stay tuned ๐