Hello, I want to train a Dflash draft model as a target model with a relatively large number of parameters. This target model, for example, DeepSeek v2, has 236B parameters and includes multiple dialogue formats such as think/no-think/function tool.
From experience, how much training data is appropriate for a model of this size?
Hello, I want to train a Dflash draft model as a target model with a relatively large number of parameters. This target model, for example, DeepSeek v2, has 236B parameters and includes multiple dialogue formats such as think/no-think/function tool.
From experience, how much training data is appropriate for a model of this size?