Together-LLM

English | 中文

Cross-Machine Inference LLM Framework

Quick Start

Install dependencies

On macOS (Apple silicon): pip install -U -e ".[mlx]"
Other platforms (NVIDIA): pip install -e ".[torch]"

This machine is running: python3 ./run_engine.py --model_path mlx-community/Llama-3.2-1B-Instruct-4bit

Start HTTP service

Single machine: tllm.server --model_path mlx-community/Llama-3.2-1B-Instruct-4bit
Multi-machine:
- Start a server in a terminal: tllm.server --model_path mlx-community/Llama-3.2-1B-Instruct-4bit --hostname $YOUR_IP
- Start a client on another terminal tllm.client --hostname http://$YOUR_IP:8022

Test HTTP service

python3 benchmarks/run_async_requests.py

Support model

Llama
Qwen
Janus Pro: Currently only supports MacOS platform
- Text to Text: PYTHONPATH="./" python3 run_janus_pro.py --model_path wnma3mz/Janus-Pro-1B-4bit --message_type llm
- Image to Text: PYTHONPATH="./" python3 run_janus_pro.py --model_path wnma3mz/Janus-Pro-1B-4bit --message_type mllm
- Text to Image: PYTHONPATH="./" python3 run_janus_pro.py --model_path wnma3mz/Janus-Pro-1B-4bit --message_type image
Qwen-VL: On MacOS platform, additional installation is required: pip install mlx-vlm==0.1.12.
flux: Currently only supports MacOS platform, requires additional installation pip install mflux=0.4.1.

Advanced

For multi-machine deployment, the default part of the port will be used for running. If there are special requirements, you can modify it through the configuration file examples/config.json.

{
    "server": {
        "grpc_port": 25001,
        "http_port": 8022,
        "hostname": "mac-mini"
    },
    "client": [
        {
            "grpc_port": 25002,
            "hostname": "m3pro"
        },
        {
            "grpc_port": 25003,
            "hostname": "m3"
        }
    ]
}

The number of clients will determine the number of model splits.
server.grpc_port: server's grpc port, used for each client to send status data and the last client to send the computed result
server.http_port: server's http port, API interface as well as WebSocket service
server.hostname: server's hostname, can be replaced with IP, such as 192.168.1.10, make sure client can access
client.grpc_port: client's grpc port
client.hostname: client's hostname, ensure server and other client can access

Features

Support Multi-Requests
Engine
- mlx
- torch
Communication
- grpc
- Auto Find Node
  - Simple Get Ip
  - Test Ping
Attention
- xformers
- flash-attn
- Prefill-Cache (Token-Level)
- PageAttention

Performance

In Mac Mini M4

	`mlx-community/Llama-3.2-1B-Instruct-4bit`	`mlx-community/Llama-3.2-1B-Instruct`	`mlx-community/Meta-Llama-3.1-8B-Instruct-4bit`	`mlx-community/Meta-Llama-3.1-8B-Instruct-bf16`
Mac Mini M4 (16G) (Local)	45.36 tok/s	23.60 tok/s	15.80 tok/s	No Memory
Mac Mini M4 (16G) + M3 Pro (18G)		16.33 tok/s	11.06 tok/s	5.64 tok/s

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_EN.md

README_EN.md

Together-LLM

Quick Start

Support model

Advanced

Features

Performance

Files

README_EN.md

Latest commit

History

README_EN.md

File metadata and controls

Together-LLM

Quick Start

Support model

Advanced

Features

Performance