Skip to content

Local deepseek deepseek-r1:14b, working but took very long time #195

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
harst21 opened this issue Jan 29, 2025 · 13 comments
Open

Local deepseek deepseek-r1:14b, working but took very long time #195

harst21 opened this issue Jan 29, 2025 · 13 comments

Comments

@harst21
Copy link

harst21 commented Jan 29, 2025

Hi there,

Im using ollama and local model of deepseek-r1:14b, it works for default prompt. But each step took very long time, more than 4 minutes each. Its finally get into first page of search result

But it took more than 30 minutes. Lol

Any idea why is this happening? is there any way to speed up the process?

Thanks in advance!

@vvincent1234
Copy link
Contributor

maybe your ollama does not use gpu. try this two methods to solve: 1. ollama run deepseek-r1:14b first to check whether it can run faster? 2. try this: #185

@harst21
Copy link
Author

harst21 commented Jan 29, 2025

maybe your ollama does not use gpu. try this two methods to solve: 1. ollama run deepseek-r1:14b first to check whether it can run faster? 2. try this: #185

It run fast enough with ollama run deepseek-r1:14b, have no issue prompting directly. Already try 2 option, but no luck. Its still super slow.. it works, doing step 1, but super slow until reach step 2 and so on

@techstartupexplorer
Copy link

deepseek-reasoner doesn;t work with 1.3

@warmshao
Copy link
Collaborator

deepseek-reasoner doesn;t work with 1.3

please donot use 1.3,the code is under development, keep up with the latest codes

@NoraNemet
Copy link

I have the same issue on my laptop with 13th gen i9 and RTX 4080 (ollama splits 50-50 between cpu & gpu). I think it may be related to the large context length (DeepSeek API was giving me errors on this as well).

@maximus1127
Copy link

I followed that link to #185 as mentioned above and downloaded/installed the newest version of ollama that (according to that post) allows the CPU/GPU to share workload. However, running mistral results in a nearly 15 minute execution time of "go to google.com and type 'OpenAI' click search and give me the first url". Running even deepseek 7b pretty much does nothing by the time i get tired of waiting ten minutes later. I'm running an AMD Ryzen 9 7940HX with an RTX4070 gpu. As others have mentioned, the direct prompting in the terminal is very fast. But the browser manipulation is dreadfully slow.

@harst21
Copy link
Author

harst21 commented Feb 1, 2025

I followed that link to #185 as mentioned above and downloaded/installed the newest version of ollama that (according to that post) allows the CPU/GPU to share workload. However, running mistral results in a nearly 15 minute execution time of "go to google.com and type 'OpenAI' click search and give me the first url". Running even deepseek 7b pretty much does nothing by the time i get tired of waiting ten minutes later. I'm running an AMD Ryzen 9 7940HX with an RTX4070 gpu. As others have mentioned, the direct prompting in the terminal is very fast. But the browser manipulation is dreadfully slow.

This is exactly whats happened to me also, CPU/GPU 49/50, results took very long time

@Batman313v
Copy link

Related to #111
For some reason ChatOllama defaults to trying to split CPU and GPU usage. I'm assuming that num_thread is being set somewhere but after a quick look and not finding anything obvious I opted for the quicker solution. Commenting it out.
.venv\Lib\site-packages\langchain_ollama\chat_models.py

options_dict = kwargs.pop(
            "options",
            {
                "mirostat": self.mirostat,
                "mirostat_eta": self.mirostat_eta,
                "mirostat_tau": self.mirostat_tau,
                "num_ctx": self.num_ctx,
                "num_gpu": self.num_gpu,
                # "num_thread": self.num_thread,  ############## Commented out
                "num_predict": self.num_predict,
                "repeat_last_n": self.repeat_last_n,
                "repeat_penalty": self.repeat_penalty,
                "temperature": self.temperature,
                "seed": self.seed,
                "stop": self.stop if stop is None else stop,
                "tfs_z": self.tfs_z,
                "top_k": self.top_k,
                "top_p": self.top_p,
            },
        )

This resolves the issue as Ollama will automatically set this while loading the model.
This is obviously not a fix recommended for this project and simply ment to document a workaround for anyone else with this issue. I don't personally use langchain so I'm not sure what an actual fix for this would be.
This workaround results in a large increase in TPS and makes this usable on a mid level gpu. Even R1:14 runs at a reasonable speed now.


Stats
Before:
qwen2.5:7b 845dbda0ea48 9.1 GB 64%/36% CPU/GPU
deepseek-r1:14b ea35dfe18182 19 GB 73%/27% CPU/GPU
After:
qwen2.5:7b 845dbda0ea48 9.3 GB 14%/86% CPU/GPU
deepseek-r1:14b ea35dfe18182 19 GB 60%/40% CPU/GPU

Other things to note:
I did notice while debugging that while ollama run qwen2.5:7b results in 100% GPU usage it doesn't here. This seems to be related to num_ctx being set to 32000 vs ollamas default 2048. Qwen supports a max context length of 32768 so this isn't a problem as far as the model goes however it does use > 10GB of vram so ollama starts to offload to the CPU (on mid range GPUs). For Qwen this results in a TPS of 17.15 which isn't bad. I played around with setting the context to different lengths and while you can get faster inference and still get results from WebUI it does start to have a harder time completing larger tasks.

@harst21
Copy link
Author

harst21 commented Feb 8, 2025

Related to #111 For some reason ChatOllama defaults to trying to split CPU and GPU usage. I'm assuming that num_thread is being set somewhere but after a quick look and not finding anything obvious I opted for the quicker solution. Commenting it out. .venv\Lib\site-packages\langchain_ollama\chat_models.py

options_dict = kwargs.pop(
"options",
{
"mirostat": self.mirostat,
"mirostat_eta": self.mirostat_eta,
"mirostat_tau": self.mirostat_tau,
"num_ctx": self.num_ctx,
"num_gpu": self.num_gpu,
# "num_thread": self.num_thread, ############## Commented out
"num_predict": self.num_predict,
"repeat_last_n": self.repeat_last_n,
"repeat_penalty": self.repeat_penalty,
"temperature": self.temperature,
"seed": self.seed,
"stop": self.stop if stop is None else stop,
"tfs_z": self.tfs_z,
"top_k": self.top_k,
"top_p": self.top_p,
},
)
This resolves the issue as Ollama will automatically set this while loading the model. This is obviously not a fix recommended for this project and simply ment to document a workaround for anyone else with this issue. I don't personally use langchain so I'm not sure what an actual fix for this would be. This workaround results in a large increase in TPS and makes this usable on a mid level gpu. Even R1:14 runs at a reasonable speed now.

Stats Before: qwen2.5:7b 845dbda0ea48 9.1 GB 64%/36% CPU/GPU deepseek-r1:14b ea35dfe18182 19 GB 73%/27% CPU/GPU After: qwen2.5:7b 845dbda0ea48 9.3 GB 14%/86% CPU/GPU deepseek-r1:14b ea35dfe18182 19 GB 60%/40% CPU/GPU

Other things to note: I did notice while debugging that while ollama run qwen2.5:7b results in 100% GPU usage it doesn't here. This seems to be related to num_ctx being set to 32000 vs ollamas default 2048. Qwen supports a max context length of 32768 so this isn't a problem as far as the model goes however it does use > 10GB of vram so ollama starts to offload to the CPU (on mid range GPUs). For Qwen this results in a TPS of 17.15 which isn't bad. I played around with setting the context to different lengths and while you can get faster inference and still get results from WebUI it does start to have a harder time completing larger tasks.

Im try to commenting out also but nothing changed. Its still 51/49 CPU/GPU

Dunno why

@Batman313v
Copy link

@harst21 You have to restart WebUI and Ollama when you do or Ollama caches the model and keeps the old config loaded

@jasonstetson
Copy link

What to do ???

@maximus1127
Copy link

I've also commented out the line mentioned above and it got me no results. The default task description from webUI still takes nearly 30 minutes to run using mistral which is a lighter model. It takes almost 5 minutes just for the google home page to load in the browser for this command:

go to google.com and type 'OpenAI' click search and give me the first url

@beetrandahiya
Copy link

Any updates? did anyone get it working on GPU?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants