You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/quick_start.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -125,10 +125,10 @@ You can pass input prompts in single string but separate with pipe (|) symbol".
125
125
python -m QEfficient.cloud.infer --model_name gpt2 --batch_size 3 --prompt_len 32 --ctx_len 128 --num_cores 16 --device_group [0] --prompt "My name is|The flat earth theory is the belief that|The sun rises from" --mxfp6 --mos 1 --aic_enable_depth_first
126
126
```
127
127
128
-
You can also pass path of txt file with input prompts when you want to run inference on lot of prompts, Example below, sample txt file(prompts.txt) is present in examples folder.
128
+
You can also pass path of txt file with input prompts when you want to run inference on lot of prompts, Example below, sample txt file(prompts.txt) is present in examples/sample_prompts folder.
Copy file name to clipboardExpand all lines: docs/source/supported_features.rst
+11-11Lines changed: 11 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,43 +7,43 @@ Supported Features
7
7
* - Feature
8
8
- Impact
9
9
* - Sentence embedding, Flexible Pooling configuration and compilation with multiple sequence lengths
10
-
- Supports standard/custom pooling with AI 100 acceleration and sentence embedding. Enables efficient sentence embeddings via Efficient-Transformers. Compile with one or multiple seq_len; optimal graph auto-selected at runtime. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/embedding_model.py>`_ for more **details**.
10
+
- Supports standard/custom pooling with AI 100 acceleration and sentence embedding. Enables efficient sentence embeddings via Efficient-Transformers. Compile with one or multiple seq_len; optimal graph auto-selected at runtime. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/embeddings/sentence_embeddings.py>`_ for more **details**.
- Implemented post-attention hidden size projections to speculate tokens ahead of the base model. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/multiprojs_spd_inference.py>`_ for more **details**.
12
+
- Implemented post-attention hidden size projections to speculate tokens ahead of the base model. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/performance/speculative_decoding/multi_projection.py>`_ for more **details**.
13
13
* - `QNN Compilation support <https://github.com/quic/efficient-transformers/pull/374>`_
14
14
- Enabled for AutoModel classes QNN compilation capabilities for multi-models, embedding models and causal models.
- It support for separate prefill and decode compilation for encoder (vision) and language models.
17
17
* - `GGUF model execution <https://github.com/quic/efficient-transformers/pull/368>`_
18
-
- Supported GGUF model execution (without quantized weights). Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/basic_gguf_models.py>`_ for more **details**.
18
+
- Supported GGUF model execution (without quantized weights). Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/text_generation/gguf_models.py>`_ for more **details**.
19
19
* - Replication of KV
20
20
- Enabled FP8 model support on `replicate_kv_heads script <https://github.com/quic/efficient-transformers/tree/main/scripts/replicate_kv_head>`_.
- Supports gradient checkpointing in the finetuning script
23
23
* - Swift KV `Snowflake/Llama-3.1-SwiftKV-8B-Instruct <https://huggingface.co/Snowflake/Llama-3.1-SwiftKV-8B-Instruct>`_
24
24
- Reduces computational overhead during inference by optimizing key-value pair processing, leading to improved throughput. Support for both `continuous and non-continuous batching execution <https://github.com/quic/efficient-transformers/pull/367>`_ in SwiftKV
25
25
* - :ref:`Vision Language Model <QEFFAutoModelForImageTextToText>`
26
-
- Provides support for the AutoModelForImageTextToText class from the transformers library, enabling advanced vision-language tasks. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/image_text_to_text_inference.py>`_ for more **details**.
26
+
- Provides support for the AutoModelForImageTextToText class from the transformers library, enabling advanced vision-language tasks. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/image_text_to_text/basic_vlm_inference.py>`_ for more **details**.
27
27
* - :ref:`Speech Sequence to Sequence Model <QEFFAutoModelForSpeechSeq2Seq>`
28
-
- Provides support for the QEFFAutoModelForSpeechSeq2Seq Facilitates speech-to-text sequence models. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/speech_to_text/run_whisper_speech_to_text.py>`_ for more **details**.
28
+
- Provides support for the QEFFAutoModelForSpeechSeq2Seq Facilitates speech-to-text sequence models. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/audio/speech_to_text.py>`_ for more **details**.
29
29
* - Support for FP8 Execution
30
30
- Enables execution with FP8 precision, significantly improving performance and reducing memory usage for computational tasks.
31
31
* - Prefill caching
32
32
- Enhances inference speed by caching key-value pairs for shared prefixes, reducing redundant computations and improving efficiency.
33
33
* - Prompt-Lookup Decoding
34
-
- Speeds up text generation by using overlapping parts of the input prompt and the generated text, making the process faster without losing quality. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/pld_spd_inference.py>`_ for more **details**.
34
+
- Speeds up text generation by using overlapping parts of the input prompt and the generated text, making the process faster without losing quality. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/performance/speculative_decoding/prompt_lookup.py>`_ for more **details**.
35
35
* - :ref:`PEFT LoRA support <QEffAutoPeftModelForCausalLM>`
36
-
- Enables parameter-efficient fine-tuning using low-rank adaptation techniques, reducing the computational and memory requirements for fine-tuning large models. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/peft_models.py>`_ for more **details**.
36
+
- Enables parameter-efficient fine-tuning using low-rank adaptation techniques, reducing the computational and memory requirements for fine-tuning large models. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/peft/single_adapter.py>`_ for more **details**.
37
37
* - :ref:`QNN support <id-qnn-compilation-via-python-api>`
38
38
- Enables compilation using QNN SDK, making Qeff adaptable for various backends in the future.
39
39
* - :ref:`Embedding model support <QEFFAutoModel>`
40
40
- Facilitates the generation of vector embeddings for retrieval tasks.
- Accelerates text generation by using a draft model to generate preliminary predictions, which are then verified by the target model, reducing latency and improving efficiency. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/draft_spd_inference.py>`_ for more **details**.
42
+
- Accelerates text generation by using a draft model to generate preliminary predictions, which are then verified by the target model, reducing latency and improving efficiency. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/performance/speculative_decoding/draft_based.py>`_ for more **details**.
- Users can activate multiple LoRA adapters and compile them with the base model. At runtime, they can specify which prompt should use which adapter, enabling mixed adapter usage within the same batch. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/lora_models.py>`_ for more **details**.
44
+
- Users can activate multiple LoRA adapters and compile them with the base model. At runtime, they can specify which prompt should use which adapter, enabling mixed adapter usage within the same batch. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/peft/multi_adapter.py>`_ for more **details**.
45
45
* - Python and CPP Inferencing API support
46
-
- Provides flexibility while running inference with Qeff and enabling integration with various applications and improving accessibility for developers. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/cpp_execution/text_inference_using_cpp.py>`_ for more **details**.
46
+
- Provides flexibility while running inference with Qeff and enabling integration with various applications and improving accessibility for developers. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/performance/cpp_execution/text_inference_cpp.py>`_ for more **details**.
0 commit comments