Linux Server Management Guide

Reset Linux

Easily, just download UltraIso and the needed version of Ubuntu

UltraIso download link:UltraISO - The Ultimate ISO CD/DVD Image Utility

Ubuntu download link:Download Ubuntu Desktop | Download | Ubuntu
Then, you should Burn ISO images to CD/DVD or USB drive
Last, you just shut down the Linux server and reboot.
When the Linux reboots, you should try to Enter F2/F12 to enter the BISO mode.
choose the USB to boot.
Finally, you can start installing the Ubuntu system.

Install ssh

sudo apt-get install openssh-server

Reset cuda&cudnn

Do it by yourself. I just found a link for you. You can follow it. But it may not work.

Ubuntu20.04下配置Nvidia显卡驱动+CUDA+CUDNN

How to save your time when you are a server administrator

Add new user

adduser user_name

Become a sudoer

You must be sudoer first. So that you can use the command below to become a sudoer.

usermod -aG sudo user_name
su
vim /etc/sudoers
# add the code below
username ALL=(ALL:ALL) ALL

Open/Shut graphic interface

To make the GPU memory cleaner. (= =)

init 3 # shut graphic interface
init 5 # open graphic interface

Mount disk

I’m not quite sure about this command because I did it one and a half years ago. So I give you some references below.

ref: linux查看系统未被挂载的磁盘空间的方法

ref: Linux检查未挂载磁盘和为磁盘分区、挂载

ref: Linux的硬盘使用情况、挂载、SSD挂载

# 1.see the mounted disk
df -h
    Filesystem      Size  Used Avail Use% Mounted on
    tmpfs           6.3G  2.2M  6.3G   1% /run
    /dev/sda2       1.8T   38G  1.7T   3% /
    tmpfs            32G     0   32G   0% /dev/shm
    tmpfs           5.0M     0  5.0M   0% /run/lock
    /dev/sda1       511M  5.3M  506M   2% /boot/efi
    tmpfs           6.3G   80K  6.3G   1% /run/user/127
    tmpfs           6.3G   68K  6.3G   1% /run/user/1001
    tmpfs           6.3G   68K  6.3G   1% /run/user/1000
    /dev/sdc1       1.8T   24K  1.7T   1% /mnt/sdc
# 2.see all the disk include unmounted
lsblk
    sda      8:0    0   1.8T  0 disk
    ├─sda1   8:1    0   512M  0 part /boot/efi
    └─sda2   8:2    0   1.8T  0 part /
    sdb      8:16   0 223.6G  0 disk # unmounted disk
    ├─sdb1   8:17   0   3.8G  0 part # unmounted part
    ├─sdb2   8:18   0     1K  0 part # unmounted part
    ├─sdb5   8:21   0  30.5G  0 part # unmounted part
    ├─sdb6   8:22   0 132.5G  0 part # unmounted part
    └─sdb7   8:23   0  56.8G  0 part # unmounted part
    sdc      8:32   0   1.8T  0 disk # unmounted disk
    └─sdc1   8:33   0   1.8T  0 part # unmounted part
# 3.see extractly
fdisk -l | grep "sd[bc]"
    Disk /dev/sdb: 223.57 GiB, 240057409536 bytes, 468862128 sectors
    /dev/sdb1  *         2048   7999487   7997440   3.8G 83 Linux
    /dev/sdb2         8001534 468860927 460859394 219.8G  5 Extended
    /dev/sdb5         8001536  71999487  63997952  30.5G 82 Linux swap / Solaris
    /dev/sdb6        72001536 349802495 277800960 132.5G 83 Linux
    /dev/sdb7       349804544 468860927 119056384  56.8G 83 Linux
    Partition 2 does not start on physical sector boundary.
    Disk /dev/sdc: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
    /dev/sdc1        2048 3907028991 3907026944  1.8T 83 Linux
# 4.mount
mkdir /mnt/sdc1
mount /dev/sdc1 /mnt/sdc1
# 5.automatically mount
vim /etc/fstab
# and add the content below
/dev/sdc1 /mnt/sdc auto defaults,nofail,comment=cloudconfig 0 2
/dev/sdb6 /mnt/sdb6 auto defaults,nofail,comment=cloudconfig 0 2
/dev/sdb7 /mnt/sdb7 auto defaults,nofail,comment=cloudconfig 0 2
# 6.check if it works
sudo mount -a
# 7.umount a disk
sudo umount /dev/sdb1
# 8.mkfs
sudo mkfs.ext4 /dev/sdb
	mke2fs 1.46.5 (30-Dec-2021)
    Found a dos partition table in /dev/sdb
    Proceed anyway? (y,N) y
    Discarding device blocks: done
    Creating filesystem with 58607766 4k blocks and 14655488 inodes
    Filesystem UUID: 1e51f436-77d6-4dac-8807-937e9d566173
    Superblock backups stored on blocks:
            32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
            4096000, 7962624, 11239424, 20480000, 23887872

    Allocating group tables: done
    Writing inode tables: done
    Creating journal (262144 blocks): done
    Writing superblocks and filesystem accounting information: done
# 9.mount
mount /dev/sdb /mnt/sdb
# 10.automatically mount
vim /etc/fstab
/dev/sdb /mnt/sdb auto defaults,nofail,comment=cloudconfig 0 2

Check disk usage

# 查看每个用户占用
sudo du -sh /home/*
# 查看磁盘占用
df -hl
# 查看用户进程
ps -aux
# 查看单个文件夹下的文件占用
du -lh --max-depth=1

3090ti*4

该服务器是最多人使用的（因为性能优于其他服务器），位置为1楼的机房，进入需要向机房管理人员申请然后才可以进入。

该服务器

1080ti*3

该服务器位置为609。

在重启的时候需要手动进入BIOS，否则会无法进入界面导致乱码。

Tips：启动盘为三星的启动盘，其他启动盘可能是不对的。

3080ti*4

该服务器位置为609。

在重启之后需要手动按一下键盘才能够使用，所以在重启之后请耐心等待5-10mins，然后按一下键盘。

Tips：如果有空可以自己解决一下这个问题，让服务器重启之后自动进入系统。

3080ti*2

该服务器维修了3次，位置为609。

并且为上海服务商，所以如果维修需要联系上海的维修人员。

titan*2

该服务器为本人自行重装的系统，位置为407，进入该实验室需要用户名和密码。

2070*2

该服务器处于问题状态，无法联网使用。位置为609。

Location pictures

以下是 N609 的服务器的位置。

以下是 3090 服务器的位置。

1.位置位于信工楼N105，也就是进门的右拐角就有一个机房
2.怎么进去：找N608一个戴眼镜的老师然后让他帮忙开门
3.具体3090位置，如下图，首先进门，然后右手边第一排就是服务器的位置，然后找到网络机柜25，就是我们服务器的位置
4.有时候连接不上可能是因为服务器进入了休眠状态，重新按一下开机键，不需要长按就可以退出休眠

Some useful commands to fix bugs

Could not ssh, maybe network problem

# 0. check the ping, ssh to ensure the network work properly
ping server_ip_addr # if don't work then try to fix the network
ssh username@server_ip_addr # if don't work then try to fix ssh and ufw
# 1. check the firework
sudo ufw status
# 2. check the ssh
sudo ufw allow ssh

How to fix GPU fan ERR!

这个问题产生的原因实在太多了，建议就是先重启，不行的再尝试其他解决方案或者返修。

# just one solution
reboot

这个问题出现的原因不明，解决方案如下（如果你不是管理服务器的人员，请咨询管理服务器的人进行操作）：

# 0.需要登录自己的账号，并修改为强密码，以下是一个可以获取强密码的网页
# 记得多试几次，密码长度10-15
https://tool.ip138.com/random/
# 1.登录上之后改为root账号
su
# 2.为用户的密码设置加上期限
chage -M 99999 username
# 3.修改用户强密码为其他密码
passwd username
# 4.改为登录自己的账号
su username

Failed to initialize NVML: Driver/library version mismatch

版本不匹配问题，因为服务器自动更新导致，解决方案如下：

# 检查服务器的内核版本
cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  535.129.03  Thu Oct 19 18:56:32 UTC 2023
GCC version:  gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.3)
# 上面可以看到是 535.129.03 = driver_version
modinfo nvidia
# 一般在version行会输出不一样的驱动，所以就会导致问题，就算一样
# 也问题不大，删掉就完事
# 删除之前的东西
sudo /usr/bin/nvidia-uninstall
sudo apt-get --purge remove nvidia-*
sudo apt-get purge nvidia*
sudo apt-get purge libnvidia*
# 看看输出
sudo dpkg --list | grep nvidia-*
# 解决方法是去官网查找对应的驱动
https://www.nvidia.cn/geforce/drivers/results/213200/
# 下载之后传送到服务器，并赋予权限
sudo chmod a+x NVIDIA-Linux-x86_64-driver_version.run
# 重装，一路回车
sudo ./NVIDIA-Linux-x86_64-driver_version.run 
# 然后应该就活过来了

Update disable

# 将下面的配置全部设置为0
vim /etc/apt/apt.conf.d/10periodic
# 将下面的配置也全部设置为0
vim /etc/apt/apt.conf.d/20auto-upgrades
# 重启

Unable to determine the device handle for GPU 0000:4C:00.0: Unknown Error

这个问题导致的原因很难确定：

# 查看显卡状态
nvidia-debugdump --list
# 以下为显示内容
Found 2 NVIDIA devices
        Device ID:              0
        Device name:            NVIDIA TITAN Xp   (*PrimaryCard)
        GPU internal ID:        0323917060372

Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x1): Unknown Error

显然是 1 卡坏了，一个最简单的方式就是停用 Bus_Id 为 0000:4C:00.0 的显卡：

# drain 为停用模式， -p 指定显卡的Bus_Id， -m 启动停用模式
sudo nvidia-smi drain -p 0000:4C:00.0 -m 1

但是这只是一个权宜之计，并不能修复问题。

Linux Server Management Guide#

Reset Linux#

Install ssh#

Reset cuda&cudnn#

How to save your time when you are a server administrator#

Add new user#

Become a sudoer#

Open/Shut graphic interface#

Mount disk#

Check disk usage#

Some server-related trivia#

3090ti*4#

1080ti*3#

3080ti*4#

3080ti*2#

titan*2#

2070*2#

Location pictures#

Some useful commands to fix bugs#

Could not ssh, maybe network problem#

How to fix GPU fan ERR!#

You are required to change your password immediately (password aged), You must change your password now and login again!#

Failed to initialize NVML: Driver/library version mismatch#

Update disable#

Unable to determine the device handle for GPU 0000:4C:00.0: Unknown Error#