Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we train or test on single GPU in detection sections? #24

Open
nestor0003 opened this issue Nov 19, 2021 · 1 comment
Open

Can we train or test on single GPU in detection sections? #24

nestor0003 opened this issue Nov 19, 2021 · 1 comment

Comments

@nestor0003
Copy link

nestor0003 commented Nov 19, 2021

If we want to test detection task, or just use the shell code like 'bash dist_test.sh configs/retinanet_alt_gvt_s_fpn_1x_coco_pvt_setting.py checkpoint_file 1 --eval mAP' ?

Or change the lr? and the number of the worker ?
I'm a beginner of the mmdet framework, please help...
this is the error lines:

/home/user/miniconda3/envs/twins/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
logger.warn(
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases.
Please read local_rank from os.environ('LOCAL_RANK') instead.
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : ./test.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 1
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:29500
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_o5bp99y9/none_u2fqutod
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/home/user/miniconda3/envs/twins/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_o5bp99y9/none_u2fqutod/attempt_0/0/error.json
loading annotations into memory...
Done (t=0.52s)
creating index...
index created!
Traceback (most recent call last):
File "./test.py", line 213, in
main()
File "./test.py", line 166, in main
model = build_detector(cfg.model, train_cfg=None, test_cfg=cfg.test_cfg)
File "/home/user/miniconda3/envs/twins/lib/python3.8/site-packages/mmdet/models/builder.py", line 67, in build_detector
return build(cfg, DETECTORS, dict(train_cfg=train_cfg, test_cfg=test_cfg))
File "/home/user/miniconda3/envs/twins/lib/python3.8/site-packages/mmdet/models/builder.py", line 32, in build
return build_from_cfg(cfg, registry, default_args)
File "/home/user/miniconda3/envs/twins/lib/python3.8/site-packages/mmcv/utils/registry.py", line 171, in build_from_cfg
return obj_cls(**args)
File "/home/user/miniconda3/envs/twins/lib/python3.8/site-packages/mmdet/models/detectors/retinanet.py", line 16, in init
super(RetinaNet, self).init(backbone, neck, bbox_head, train_cfg,
File "/home/user/miniconda3/envs/twins/lib/python3.8/site-packages/mmdet/models/detectors/single_stage.py", line 25, in init
self.backbone = build_backbone(backbone)
File "/home/user/miniconda3/envs/twins/lib/python3.8/site-packages/mmdet/models/builder.py", line 37, in build_backbone
return build(cfg, BACKBONES)
File "/home/user/miniconda3/envs/twins/lib/python3.8/site-packages/mmdet/models/builder.py", line 32, in build
return build_from_cfg(cfg, registry, default_args)
File "/home/user/miniconda3/envs/twins/lib/python3.8/site-packages/mmcv/utils/registry.py", line 171, in build_from_cfg
return obj_cls(**args)
File "/home/user/project/Twins/detection/gvt.py", line 482, in init
super(alt_gvt_small, self).init(
File "/home/user/project/Twins/detection/gvt.py", line 419, in init
super(ALTGVT, self).init(img_size, patch_size, in_chans, num_classes, embed_dims, num_heads,
File "/home/user/project/Twins/detection/gvt.py", line 408, in init
super(PCPVT, self).init(img_size, patch_size, in_chans, num_classes, embed_dims, num_heads,
File "/home/user/project/Twins/detection/gvt.py", line 343, in init
super(CPVTV2, self).init(img_size, patch_size, in_chans, num_classes, embed_dims, num_heads, mlp_ratios,
File "/home/user/project/Twins/detection/gvt.py", line 234, in init
_block = nn.ModuleList([block_cls(
File "/home/user/project/Twins/detection/gvt.py", line 234, in
_block = nn.ModuleList([block_cls(
File "/home/user/project/Twins/detection/gvt.py", line 164, in init
super(GroupBlock, self).init(dim, num_heads, mlp_ratio, qkv_bias, qk_scale, drop, attn_drop,
TypeError: init() takes from 3 to 10 positional arguments but 11 were given
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11449) of binary: /home/user/miniconda3/envs/twins/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_o5bp99y9/none_u2fqutod/attempt_1/0/error.json

@nestor0003 nestor0003 changed the title Can we train or test on 1 GPUs in detection sections? Can we train or test on single GPU in detection sections? Nov 19, 2021
@cxxgtxy
Copy link
Collaborator

cxxgtxy commented Nov 19, 2021

We suggest using at least 4 GPUs to train.

We still have some intern offers. If you are interested, please send your CV to chuxiangxiang@meituan.com. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants