Sunday, November 12, 2017

14.04 - Error when loading NVIDIA driver and CUDA after reinstallation

I have a computer with:




  • System: Ubuntu 14.04

  • GPU: NVIDIA GTX1080ti






Around one year ago I installed the system and then installed CUDA8.0 with NVIDIA drivers on this computer. The GPU and CUDA has been working correctly until today when I tried to install a higher version of CUDA.




Because of some reasons I tried to install CUDA10.0 to substitute the current installed CUDA8.0. First I uninstalled the old drivers using nvidia-uninstall. And then uninstalled the old CUDA using /usr/local/cuda-8.0/bin/uninstall_cuda_8.0.pl. After these I installed CUDA10.0 along with the new driver, using the runfile installer downloaded from this page. However the installation was failed. After several unsuccessful debugging, I gave up, uninstalled the new drivers and new CUDA, and reinstall CUDA8.0 with the runfile installer downloaded from this page. The installation was successful. But I can't get anything about CUDA launched anymore, including pycuda, pyopencl and tensorflow. All these packages reported that they cannot find a GPU device.






Update:



I have tried to uninstall all the NVIDIA components by sudo apt-get --purge remove nvidia-*, as well as nvidia-uninstall and uninstall_cuda_8.0.pl. But the problem still remains. While the error reports and the system logs became different. Following are the current system logs:







Here are some of my system logs:



In python CLI, pycuda failed:



Python 2.7.6 (default, Nov 23 2017, 15:49:48) 
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pycuda.driver as cuda
>>> import pycuda.autoinit

Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/pycuda/autoinit.py", line 5, in
cuda.init()
pycuda._driver.RuntimeError: cuInit failed: no CUDA-capable device is detected
>>>


nvidia-smi reports:




+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26 Driver Version: 375.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 ERR! Off | 0000:01:00.0 On | N/A |
| 28% 52C P8 15W / 300W | 43MiB / 11168MiB | 0% Default |
+-------------------------------+----------------------+----------------------+


+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1868 G /usr/lib/xorg/Xorg 40MiB |
+-----------------------------------------------------------------------------+


dmesg | grep nvidia reports:




[    2.370841] nvidia: loading out-of-tree module taints kernel.
[ 2.370844] nvidia: module license 'NVIDIA' taints kernel.
[ 2.374116] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 2.380809] nvidia-nvlink: Nvlink Core is being initialized, major device number 242
[ 2.383631] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 375.26 Thu Dec 8 18:04:14 PST 2016
[ 2.385803] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[ 2.717844] init: nvidia-prime main process (1094) terminated with status 127
[ 7.447032] nvidia-modeset: Allocated GPU:0 (GPU-3727ccd9-f1fc-78c9-f908-5e1edf205194) @ PCI:0000:01:00.0
[ 72.737634] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 241



nvidia-smi -a reports (NOTE that the Product Name column is Unknown Error):



==============NVSMI LOG==============

Timestamp : Thu Sep 27 10:16:41 2018
Driver Version : 375.26

Attached GPUs : 1
GPU 0000:01:00.0

Product Name : Unknown Error
Product Brand : GeForce
Display Mode : Enabled
Display Active : Enabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A

Serial Number : N/A
GPU UUID : GPU-3727ccd9-f1fc-78c9-f908-5e1edf205194
Minor Number : 0
VBIOS Version : 86.02.40.00.2E
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : N/A
Inforom Version
Image Version : G001.0000.01.04
OEM Object : 1.1

ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
PCI
Bus : 0x01
Device : 0x00

Domain : 0x0000
Device Id : 0x1B0610DE
Bus Id : 0000:01:00.0
Sub System Id : 0x11117377
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x

Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P8
Clocks Throttle Reasons

Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Sync Boost : Not Active
Unknown : Not Active
FB Memory Usage
Total : 11168 MiB
Used : 43 MiB
Free : 11125 MiB

BAR1 Memory Usage
Total : 256 MiB
Used : 5 MiB
Free : 251 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 2 %
Encoder : 0 %
Decoder : 0 %

Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A

Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A

Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A

Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A

Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 43 C
GPU Shutdown Temp : 96 C
GPU Slowdown Temp : 93 C
Power Readings
Power Management : Supported
Power Draw : 14.68 W
Power Limit : 300.00 W

Default Power Limit : 300.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 125.00 W
Max Power Limit : 330.00 W
Clocks
Graphics : 240 MHz
SM : 240 MHz
Memory : 405 MHz
Video : 544 MHz
Applications Clocks

Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 1999 MHz
SM : 1999 MHz
Memory : 5505 MHz
Video : 1708 MHz

Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
Process ID : 1868
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 40 MiB






I can't figure out what's wrong with it, and how to solve this. Could anyone help me?

No comments:

Post a Comment

11.10 - Can't boot from USB after installing Ubuntu

I bought a Samsung series 5 notebook and a very strange thing happened: I installed Ubuntu 11.10 from a usb pen drive but when I restarted (...