Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练出错,参数参照的网上视频,工具也完整,请大佬帮忙看看问题,谢谢 #6733

Closed
1 task done
sgg-comeon opened this issue Jan 22, 2025 · 6 comments
Labels
solved This problem has been already solved

Comments

@sgg-comeon
Copy link

Reminder

  • I have read the above rules and searched the existing issues.

System Info

下面是按照视频设置的参数

Image

Image

Image
报错信息:训练出错。

[WARNING|2025-01-21 11:43:17] logging.py:162 >> We recommend enable upcast_layernorm in quantized training.

[INFO|2025-01-21 11:43:17] parser.py:355 >> Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, compute dtype: torch.bfloat16

cuda已安装

Image

Reproduction

Put your message here.

Others

No response

@sgg-comeon sgg-comeon added bug Something isn't working pending This problem is yet to be addressed labels Jan 22, 2025
@hiyouga
Copy link
Owner

hiyouga commented Jan 22, 2025

Please share the information displayed in your terminal

@sgg-comeon
Copy link
Author

Please share the information displayed in your terminal

报错信息:
/usr/local/lib/python3.8/dist-packages/torchvision/datapoints/init.py:12: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: pytorch/vision#6753, and you can also check out pytorch/vision#7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
warnings.warn(_BETA_TRANSFORMS_WARNING)
/usr/local/lib/python3.8/dist-packages/torchvision/transforms/v2/init.py:54: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: pytorch/vision#6753, and you can also check out pytorch/vision#7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
warnings.warn(_BETA_TRANSFORMS_WARNING)
[WARNING|2025-01-22 02:32:17] llamafactory.hparams.parser:162 >> We recommend enable upcast_layernorm in quantized training.
[INFO|2025-01-22 02:32:17] llamafactory.hparams.parser:355 >> Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, compute dtype: torch.bfloat16
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/urllib3/connection.py", line 199, in _new_conn
sock = connection.create_connection(
File "/usr/local/lib/python3.8/dist-packages/urllib3/util/connection.py", line 85, in create_connection
raise err
File "/usr/local/lib/python3.8/dist-packages/urllib3/util/connection.py", line 73, in create_connection
sock.connect(sa)
socket.timeout: timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 789, in urlopen
response = self._make_request(
File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 490, in _make_request
raise new_e
File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 466, in _make_request
self._validate_conn(conn)
File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 1095, in _validate_conn
conn.connect()
File "/usr/local/lib/python3.8/dist-packages/urllib3/connection.py", line 693, in connect
self.sock = sock = self._new_conn()
File "/usr/local/lib/python3.8/dist-packages/urllib3/connection.py", line 208, in _new_conn
raise ConnectTimeoutError(
urllib3.exceptions.ConnectTimeoutError: (, 'Connection to huggingface.co timed out. (connect timeout=10)')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/requests/adapters.py", line 667, in send
resp = conn.urlopen(
File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 843, in urlopen
retries = retries.increment(
File "/usr/local/lib/python3.8/dist-packages/urllib3/util/retry.py", line 519, in increment
raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /Qwen/Qwen2.5-7B-Instruct/resolve/main/config.json (Caused by ConnectTimeoutError( to load this file, couldn't find it in the cached files and it looks like Qwen/Qwen2.5-7B-Instruct is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

llamafactory部署在服务器上,本机开了翻墙也没连上抱抱脸

@pie2cookie
Copy link

Before connecting to Hugging Face, exec
export HF_ENDPOINT="https://hf-mirror.com"

@hiyouga
Copy link
Owner

hiyouga commented Jan 22, 2025

try hf-mirror or modelscope USE_MODELSCOPE_HUB=1

@hiyouga hiyouga closed this as completed Jan 22, 2025
@hiyouga hiyouga added solved This problem has been already solved and removed bug Something isn't working pending This problem is yet to be addressed labels Jan 22, 2025
@sgg-comeon
Copy link
Author

try hf-mirror or modelscope USE_MODELSCOPE_HUB=1
我的GPU:Ubuntu20.04,cuda:cuda_12.2.0_535.54.03_linux.run。报错少libcusparse.so.11库文件,但是能找到libcusparse.so.12库。cuda卸载过,更新过,还是报这个错误,请指教谢谢
===================================BUG REPORT===================================
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:167: UserWarning: Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

warn(msg)

The following directories listed in your path were found to be non-existent: {PosixPath('https'), PosixPath('//hf-mirror.com')}
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
DEBUG: Possible options found for libcudart.so: {PosixPath('/usr/local/cuda/lib64/libcudart.so')}
CUDA SETUP: PyTorch settings found: CUDA_VERSION=118, Highest Compute Capability: 8.6.
CUDA SETUP: To manually override the PyTorch CUDA version please see:https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...
libcusparse.so.11: cannot open shared object file: No such file or directory
CUDA SETUP: Something unexpected happened. Please compile from source:
git clone https://github.com/TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=118 make cuda11x
python setup.py install
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 185, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/usr/lib/python3.8/runpy.py", line 144, in _get_module_details
return _get_module_details(pkg_main_name, error)
File "/usr/lib/python3.8/runpy.py", line 111, in _get_module_details
import(pkg_name)
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/init.py", line 6, in
from . import cuda_setup, utils, research
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/research/init.py", line 1, in
from . import nn
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/research/nn/init.py", line 1, in
from .modules import LinearFP8Mixed, LinearFP8Global
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/research/nn/modules.py", line 8, in
from bitsandbytes.optim import GlobalOptimManager
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/optim/init.py", line 6, in
from bitsandbytes.cextension import COMPILED_WITH_CUDA
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/cextension.py", line 20, in
raise RuntimeError('''
RuntimeError:
CUDA Setup failed despite GPU being available. Please run the following command to get more information:

    python -m bitsandbytes

    Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
    to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
    and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues

@cehao628
Copy link

CUDA安装有问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

4 participants