Linux Server Management Guide

Reset Linux

  1. Easily, just download UltraIso and the needed version of Ubuntu

    UltraIso download link:UltraISO - The Ultimate ISO CD/DVD Image Utility

    Ubuntu download link:Download Ubuntu Desktop | Download | Ubuntu

  2. Then, you should Burn ISO images to CD/DVD or USB drive

  3. Last, you just shut down the Linux server and reboot.

  4. When the Linux reboots, you should try to Enter F2/F12 to enter the BISO mode.

  5. choose the USB to boot.

  6. Finally, you can start installing the Ubuntu system.

Install ssh

sudo apt-get install openssh-server

Reset cuda&cudnn

Do it by yourself. I just found a link for you. You can follow it. But it may not work.

Ubuntu20.04下配置Nvidia显卡驱动+CUDA+CUDNN

How to save your time when you are a server administrator

Add new user

adduser user_name

Become a sudoer

You must be sudoer first. So that you can use the command below to become a sudoer.

usermod -aG sudo user_name
su
vim /etc/sudoers
# add the code below
username ALL=(ALL:ALL) ALL

Open/Shut graphic interface

To make the GPU memory cleaner. (= =)

init 3 # shut graphic interface
init 5 # open graphic interface

Mount disk

I’m not quite sure about this command because I did it one and a half years ago. So I give you some references below.

ref: linux查看系统未被挂载的磁盘空间的方法

ref: Linux检查未挂载磁盘和为磁盘分区、挂载

ref: Linux的硬盘使用情况、挂载、SSD挂载

# 1.see the mounted disk
df -h
    Filesystem      Size  Used Avail Use% Mounted on
    tmpfs           6.3G  2.2M  6.3G   1% /run
    /dev/sda2       1.8T   38G  1.7T   3% /
    tmpfs            32G     0   32G   0% /dev/shm
    tmpfs           5.0M     0  5.0M   0% /run/lock
    /dev/sda1       511M  5.3M  506M   2% /boot/efi
    tmpfs           6.3G   80K  6.3G   1% /run/user/127
    tmpfs           6.3G   68K  6.3G   1% /run/user/1001
    tmpfs           6.3G   68K  6.3G   1% /run/user/1000
    /dev/sdc1       1.8T   24K  1.7T   1% /mnt/sdc
# 2.see all the disk include unmounted
lsblk
    sda      8:0    0   1.8T  0 disk
    ├─sda1   8:1    0   512M  0 part /boot/efi
    └─sda2   8:2    0   1.8T  0 part /
    sdb      8:16   0 223.6G  0 disk # unmounted disk
    ├─sdb1   8:17   0   3.8G  0 part # unmounted part
    ├─sdb2   8:18   0     1K  0 part # unmounted part
    ├─sdb5   8:21   0  30.5G  0 part # unmounted part
    ├─sdb6   8:22   0 132.5G  0 part # unmounted part
    └─sdb7   8:23   0  56.8G  0 part # unmounted part
    sdc      8:32   0   1.8T  0 disk # unmounted disk
    └─sdc1   8:33   0   1.8T  0 part # unmounted part
# 3.see extractly
fdisk -l | grep "sd[bc]"
    Disk /dev/sdb: 223.57 GiB, 240057409536 bytes, 468862128 sectors
    /dev/sdb1  *         2048   7999487   7997440   3.8G 83 Linux
    /dev/sdb2         8001534 468860927 460859394 219.8G  5 Extended
    /dev/sdb5         8001536  71999487  63997952  30.5G 82 Linux swap / Solaris
    /dev/sdb6        72001536 349802495 277800960 132.5G 83 Linux
    /dev/sdb7       349804544 468860927 119056384  56.8G 83 Linux
    Partition 2 does not start on physical sector boundary.
    Disk /dev/sdc: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
    /dev/sdc1        2048 3907028991 3907026944  1.8T 83 Linux
# 4.mount
mkdir /mnt/sdc1
mount /dev/sdc1 /mnt/sdc1
# 5.automatically mount
vim /etc/fstab
# and add the content below
/dev/sdc1 /mnt/sdc auto defaults,nofail,comment=cloudconfig 0 2
/dev/sdb6 /mnt/sdb6 auto defaults,nofail,comment=cloudconfig 0 2
/dev/sdb7 /mnt/sdb7 auto defaults,nofail,comment=cloudconfig 0 2
# 6.check if it works
sudo mount -a
# 7.umount a disk
sudo umount /dev/sdb1
# 8.mkfs
sudo mkfs.ext4 /dev/sdb
	mke2fs 1.46.5 (30-Dec-2021)
    Found a dos partition table in /dev/sdb
    Proceed anyway? (y,N) y
    Discarding device blocks: done
    Creating filesystem with 58607766 4k blocks and 14655488 inodes
    Filesystem UUID: 1e51f436-77d6-4dac-8807-937e9d566173
    Superblock backups stored on blocks:
            32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
            4096000, 7962624, 11239424, 20480000, 23887872

    Allocating group tables: done
    Writing inode tables: done
    Creating journal (262144 blocks): done
    Writing superblocks and filesystem accounting information: done
# 9.mount
mount /dev/sdb /mnt/sdb
# 10.automatically mount
vim /etc/fstab
/dev/sdb /mnt/sdb auto defaults,nofail,comment=cloudconfig 0 2

Check disk usage

# 查看每个用户占用
sudo du -sh /home/*
# 查看磁盘占用
df -hl
# 查看用户进程
ps -aux
# 查看单个文件夹下的文件占用
du -lh --max-depth=1

3090ti*4

该服务器是最多人使用的(因为性能优于其他服务器),位置为1楼的机房,进入需要向机房管理人员申请然后才可以进入。

该服务器

1080ti*3

该服务器位置为609。

在重启的时候需要手动进入BIOS,否则会无法进入界面导致乱码。

Tips:启动盘为三星的启动盘,其他启动盘可能是不对的。

3080ti*4

该服务器位置为609。

在重启之后需要手动按一下键盘才能够使用,所以在重启之后请耐心等待5-10mins,然后按一下键盘。

Tips:如果有空可以自己解决一下这个问题,让服务器重启之后自动进入系统。

3080ti*2

该服务器维修了3次,位置为609。

并且为上海服务商,所以如果维修需要联系上海的维修人员。

titan*2

该服务器为本人自行重装的系统,位置为407,进入该实验室需要用户名和密码。

2070*2

该服务器处于问题状态,无法联网使用。位置为609。

Location pictures

以下是 N609 的服务器的位置。

image-20230428131412923

以下是 3090 服务器的位置。

1.位置位于信工楼N105,也就是进门的右拐角就有一个机房
2.怎么进去:找N608一个戴眼镜的老师然后让他帮忙开门
3.具体3090位置,如下图,首先进门,然后右手边第一排就是服务器的位置,然后找到网络机柜25,就是我们服务器的位置
4.有时候连接不上可能是因为服务器进入了休眠状态,重新按一下开机键,不需要长按就可以退出休眠

Some useful commands to fix bugs

Could not ssh, maybe network problem

# 0. check the ping, ssh to ensure the network work properly
ping server_ip_addr # if don't work then try to fix the network
ssh username@server_ip_addr # if don't work then try to fix ssh and ufw
# 1. check the firework
sudo ufw status
# 2. check the ssh
sudo ufw allow ssh

How to fix GPU fan ERR!

这个问题产生的原因实在太多了,建议就是先重启,不行的再尝试其他解决方案或者返修。

# just one solution
reboot

You are required to change your password immediately (password aged), You must change your password now and login again!

这个问题出现的原因不明,解决方案如下(如果你不是管理服务器的人员,请咨询管理服务器的人进行操作):

# 0.需要登录自己的账号,并修改为强密码,以下是一个可以获取强密码的网页
# 记得多试几次,密码长度10-15
https://tool.ip138.com/random/
# 1.登录上之后改为root账号
su
# 2.为用户的密码设置加上期限
chage -M 99999 username
# 3.修改用户强密码为其他密码
passwd username
# 4.改为登录自己的账号
su username

Failed to initialize NVML: Driver/library version mismatch

版本不匹配问题,因为服务器自动更新导致,解决方案如下:

# 检查服务器的内核版本
cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  535.129.03  Thu Oct 19 18:56:32 UTC 2023
GCC version:  gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.3)
# 上面可以看到是 535.129.03 = driver_version
modinfo nvidia
# 一般在version行会输出不一样的驱动,所以就会导致问题,就算一样
# 也问题不大,删掉就完事
# 删除之前的东西
sudo /usr/bin/nvidia-uninstall
sudo apt-get --purge remove nvidia-*
sudo apt-get purge nvidia*
sudo apt-get purge libnvidia*
# 看看输出
sudo dpkg --list | grep nvidia-*
# 解决方法是去官网查找对应的驱动
https://www.nvidia.cn/geforce/drivers/results/213200/
# 下载之后传送到服务器,并赋予权限
sudo chmod a+x NVIDIA-Linux-x86_64-driver_version.run
# 重装,一路回车
sudo ./NVIDIA-Linux-x86_64-driver_version.run 
# 然后应该就活过来了

Update disable

# 将下面的配置全部设置为0
vim /etc/apt/apt.conf.d/10periodic
# 将下面的配置也全部设置为0
vim /etc/apt/apt.conf.d/20auto-upgrades
# 重启

Unable to determine the device handle for GPU 0000:4C:00.0: Unknown Error

这个问题导致的原因很难确定:

# 查看显卡状态
nvidia-debugdump --list
# 以下为显示内容
Found 2 NVIDIA devices
        Device ID:              0
        Device name:            NVIDIA TITAN Xp   (*PrimaryCard)
        GPU internal ID:        0323917060372

Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x1): Unknown Error

显然是 1 卡坏了,一个最简单的方式就是停用 Bus_Id0000:4C:00.0 的显卡:

# drain 为停用模式, -p 指定显卡的Bus_Id, -m 启动停用模式
sudo nvidia-smi drain -p 0000:4C:00.0 -m 1

但是这只是一个权宜之计,并不能修复问题。