2018年10月11日 星期四

Troubleshooting - curtin version is incorrect on a MaaS region server

Few weeks ago an weird MaaS issue happened to me. When I tried to commission or deploy node with ga-18.04 kernel. The deployment cycle always stops at the grub entry, which shows "Commissioning".

After fighting for few days by stopping in the ephemeral environment when dd the customized image to the hard disk. I noticed that the well-functioned MaaS region server updates the kernel in the ephemeral environment, and the malfunctioned MaaS region server doesn't. To use the new kernel is very important for me to deploy my customized images because I need nls_iso8859-1.ko module to deal with my recovery partition. This code snippet shows how a recent curtin (18.1) updates the kernel


ubuntu@breckenridge-dvt2-201802-26115:/curtin$ grep linux-image -r *
Binary file curtin/deps/pycache/init.cpython-36.pyc matches
curtin/deps/init.py: # linux-image package for this environment
curtin/deps/init.py: kernel_pkg = 'linux-image-%s' % os.uname()[2]
def check_kernel_modules(modules=None):

if modules is None:
modules = REQUIRED_KERNEL_MODULES

# if we're missing any modules, install the full
# linux-image package for this environment
for kmod in modules:
try:
subp(['modinfo', '--filename', kmod], capture=True)
except ProcessExecutionError:
kernel_pkg = 'linux-image-%s' % os.uname()[2]
return [MissingDeps('missing kernel module %s' % kmod, kernel_pkg)]

return [] 


Thus I went to dig in curtin, which takes care of the installation/dd of images, and noticed the version of curtin differs in two different MaaS region server which are installed the same version of MaaS. By updating the curtin I fixed the issue. The mulfunctioned one uses 0.1.0 curtin, and the good server uses 18.1.

In conclusion, the curtin version of the MaaS region server matters. It seems that the curtin will map into the ephemeral environment and be leveraged. Interesting!


Summary of the Debugging Tips




  • Summary of the debugging flow of this case
    • stop at the grub entry
    • check the previous stage and found errors in curtin stage
    • compare good and bad environment to use curtin (ephemeral environment)
    • identified the root cause is lack of nls_iso8859-1.ko
    • notice good environment updates its kernel
    • figure out the curtin source differs
    • found the curtin version differs
  • curtin log is valuable. Read it carefully. Check if it triggers the very first error.
  • Effective Debugging: 66 Specific Ways to Debug Software and Systems by Diomidis Spinellis suggests to compare the buggy system with a well-functioned system may help. So true!






沒有留言:

張貼留言