训练出错，参数参照的网上视频，工具也完整，请大佬帮忙看看问题，谢谢 #6733

sgg-comeon · 2025-01-22T00:00:49Z

Reminder

I have read the above rules and searched the existing issues.

System Info

下面是按照视频设置的参数

报错信息：训练出错。

[WARNING|2025-01-21 11:43:17] logging.py:162 >> We recommend enable upcast_layernorm in quantized training.

[INFO|2025-01-21 11:43:17] parser.py:355 >> Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, compute dtype: torch.bfloat16

cuda已安装

Reproduction

Put your message here.

Others

No response

The text was updated successfully, but these errors were encountered:

hiyouga · 2025-01-22T02:05:24Z

Please share the information displayed in your terminal

sgg-comeon · 2025-01-22T03:15:52Z

Please share the information displayed in your terminal

报错信息：
/usr/local/lib/python3.8/dist-packages/torchvision/datapoints/init.py:12: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: pytorch/vision#6753, and you can also check out pytorch/vision#7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
warnings.warn(_BETA_TRANSFORMS_WARNING)
/usr/local/lib/python3.8/dist-packages/torchvision/transforms/v2/init.py:54: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: pytorch/vision#6753, and you can also check out pytorch/vision#7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
warnings.warn(_BETA_TRANSFORMS_WARNING)
[WARNING|2025-01-22 02:32:17] llamafactory.hparams.parser:162 >> We recommend enable upcast_layernorm in quantized training.
[INFO|2025-01-22 02:32:17] llamafactory.hparams.parser:355 >> Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, compute dtype: torch.bfloat16
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/urllib3/connection.py", line 199, in _new_conn
sock = connection.create_connection(
File "/usr/local/lib/python3.8/dist-packages/urllib3/util/connection.py", line 85, in create_connection
raise err
File "/usr/local/lib/python3.8/dist-packages/urllib3/util/connection.py", line 73, in create_connection
sock.connect(sa)
socket.timeout: timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 789, in urlopen
response = self._make_request(
File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 490, in _make_request
raise new_e
File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 466, in _make_request
self._validate_conn(conn)
File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 1095, in _validate_conn
conn.connect()
File "/usr/local/lib/python3.8/dist-packages/urllib3/connection.py", line 693, in connect
self.sock = sock = self._new_conn()
File "/usr/local/lib/python3.8/dist-packages/urllib3/connection.py", line 208, in _new_conn
raise ConnectTimeoutError(
urllib3.exceptions.ConnectTimeoutError: (, 'Connection to huggingface.co timed out. (connect timeout=10)')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/requests/adapters.py", line 667, in send
resp = conn.urlopen(
File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 843, in urlopen
retries = retries.increment(
File "/usr/local/lib/python3.8/dist-packages/urllib3/util/retry.py", line 519, in increment
raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /Qwen/Qwen2.5-7B-Instruct/resolve/main/config.json (Caused by ConnectTimeoutError( to load this file, couldn't find it in the cached files and it looks like Qwen/Qwen2.5-7B-Instruct is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

llamafactory部署在服务器上，本机开了翻墙也没连上抱抱脸

pie2cookie · 2025-01-22T03:36:31Z

Before connecting to Hugging Face, exec
export HF_ENDPOINT="https://hf-mirror.com"

hiyouga · 2025-01-22T03:43:54Z

try hf-mirror or modelscope USE_MODELSCOPE_HUB=1

sgg-comeon · 2025-01-22T10:03:25Z

try hf-mirror or modelscope USE_MODELSCOPE_HUB=1
我的GPU：Ubuntu20.04，cuda：cuda_12.2.0_535.54.03_linux.run。报错少libcusparse.so.11库文件，但是能找到libcusparse.so.12库。cuda卸载过，更新过，还是报这个错误，请指教谢谢
===================================BUG REPORT===================================
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:167: UserWarning: Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

warn(msg)

The following directories listed in your path were found to be non-existent: {PosixPath('https'), PosixPath('//hf-mirror.com')}
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
DEBUG: Possible options found for libcudart.so: {PosixPath('/usr/local/cuda/lib64/libcudart.so')}
CUDA SETUP: PyTorch settings found: CUDA_VERSION=118, Highest Compute Capability: 8.6.
CUDA SETUP: To manually override the PyTorch CUDA version please see:https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...
libcusparse.so.11: cannot open shared object file: No such file or directory
CUDA SETUP: Something unexpected happened. Please compile from source:
git clone https://github.com/TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=118 make cuda11x
python setup.py install
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 185, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/usr/lib/python3.8/runpy.py", line 144, in _get_module_details
return _get_module_details(pkg_main_name, error)
File "/usr/lib/python3.8/runpy.py", line 111, in _get_module_details
import(pkg_name)
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/init.py", line 6, in
from . import cuda_setup, utils, research
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/research/init.py", line 1, in
from . import nn
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/research/nn/init.py", line 1, in
from .modules import LinearFP8Mixed, LinearFP8Global
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/research/nn/modules.py", line 8, in
from bitsandbytes.optim import GlobalOptimManager
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/optim/init.py", line 6, in
from bitsandbytes.cextension import COMPILED_WITH_CUDA
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/cextension.py", line 20, in
raise RuntimeError('''
RuntimeError:
CUDA Setup failed despite GPU being available. Please run the following command to get more information:

    python -m bitsandbytes

    Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
    to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
    and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues

cehao628 · 2025-01-30T16:27:54Z

CUDA安装有问题

sgg-comeon added bug Something isn't working pending This problem is yet to be addressed labels Jan 22, 2025

hiyouga closed this as completed Jan 22, 2025

hiyouga added solved This problem has been already solved and removed bug Something isn't working pending This problem is yet to be addressed labels Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

训练出错，参数参照的网上视频，工具也完整，请大佬帮忙看看问题，谢谢 #6733

训练出错，参数参照的网上视频，工具也完整，请大佬帮忙看看问题，谢谢 #6733

sgg-comeon commented Jan 22, 2025

hiyouga commented Jan 22, 2025

sgg-comeon commented Jan 22, 2025

pie2cookie commented Jan 22, 2025

hiyouga commented Jan 22, 2025

sgg-comeon commented Jan 22, 2025

cehao628 commented Jan 30, 2025

训练出错，参数参照的网上视频，工具也完整，请大佬帮忙看看问题，谢谢 #6733

训练出错，参数参照的网上视频，工具也完整，请大佬帮忙看看问题，谢谢 #6733

Comments

sgg-comeon commented Jan 22, 2025

Reminder

System Info

Reproduction

Others

hiyouga commented Jan 22, 2025

sgg-comeon commented Jan 22, 2025

pie2cookie commented Jan 22, 2025

hiyouga commented Jan 22, 2025

sgg-comeon commented Jan 22, 2025

warn(msg)

cehao628 commented Jan 30, 2025