badco's picture

Good morning from NZ!

I have recently come across TKL with my quest to migrate from unRAID to Proxmox, and the work you guys do is fantastic!

However, I have recently come across a small problem. Since initially spinning up the file-server contianer and shutting down the server for the week, it has now lost its networking. I can not ping the host or the gateway, but I can ping other TKL containers.

I have spun up a debian template to test, and this works fine with static IP (on the same subnet and VLAN) and DHCP. I can ping the host, the gateway, but not the TKL containers. Therefore, this seems to be a specific problem with TKL.

Does anyone have any advice for things to check? See below for additional information.

Here is an error when the container first boots:

Apr 10 22:32:47 file-server inithooks[97]:     code, output = self._perform(args, **kwargs)
Apr 10 22:32:47 file-server inithooks[97]:   File "/usr/lib/python3/dist-packages/dialog.py", line 1504, in _perform
Apr 10 22:32:47 file-server inithooks[97]:     args_file)
Apr 10 22:32:47 file-server inithooks[97]:   File "/usr/lib/python3/dist-packages/dialog.py", line 1469, in _handle_program_exit
Apr 10 22:32:47 file-server inithooks[97]:     child_output_rfd)
Apr 10 22:32:47 file-server inithooks[97]:   File "/usr/lib/python3/dist-packages/dialog.py", line 1421, in _wait_for_program_termination
Apr 10 22:32:47 file-server inithooks[97]:     child_output.strip()))
Apr 10 22:32:47 file-server inithooks[97]: dialog.DialogError: dialog-like terminated due to an error:
   the dialog-like program exited with status 3 (which was passed to it as the DIALOG_ERROR environment variable).
   Sometimes, the reason is simply that dialog was given a height or width parameter that is too big for the terminal in use.
   Its output, with leading and trailing whitespace stripped, was:

Apr 10 22:32:47 file-server inithooks: Confconsole completed, now exiting
INFO: Confconsole completed, now exiting

Here is the MOTD with traceback error:

file-server login: root
Password:
Last login: Sat Apr 10 22:33:03 UTC 2021 on tty1
Traceback (most recent call last):
  File "/usr/bin/turnkey-sysinfo", line 106, in <module>
    main()
  File "/usr/bin/turnkey-sysinfo", line 92, in main
    print(tpl.format(row[0], row[1], col=max_col))
IndexError: list index out of range
Welcome to File-server, TurnKey GNU/Linux 16.0 (Debian 10/Buster)

  System information for Sat Apr 10 22:58:38 2021 (UTC+0000)

    System load:  0.14             Memory usage:  10.7%
    Processes:    31               Swap usage:    0.0%
    Usage of /:   0.0% of 3.51TB   IP address for eth0: 10.10.10.100

  TKLBAM (Backup and Migration):  NOT INITIALIZED

    To initialize TKLBAM, run the "tklbam-init" command to link this
    system to your TurnKey Hub account. For details see the man page or
    go to:

        https://www.turnkeylinux.org/tklbam

    For Advanced commandline config run:    confconsole

  For more info see: https://www.turnkeylinux.org/docs/confconsole
 
Linux file-server 5.4.106-1-pve #1 SMP PVE 5.4.106-1 (Fri, 19 Mar 2021 11:08:47 +0100) x86_64
root@file-server ~#

Here is the output of "ip a":

root@file-server ~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0@if13: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 0e:a9:b3:ab:12:01 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.10.10.100/24 brd 10.10.10.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::ca9:b3ff:feab:1201/64 scope link
       valid_lft forever preferred_lft forever

Here is the /etc/pve/lxc/100.conf:

arch: amd64
cores: 1
hostname: file-server
memory: 512
net0: name=eth0,bridge=vmbr1,gw=10.10.10.1,hwaddr=0E:A9:B3:AB:12:01,ip=10.10.10.100/24,tag=10,type=veth
onboot: 1
ostype: debian
rootfs: data1:subvol-100-disk-0,size=3630G
swap: 512
unprivileged: 1
Forum: 
Jeremy Davis's picture

Thanks for your kind words and welcome to TurnKey! :)

Re the messy output, the first error/stacktrace (when the container first boots) can happen when a terminal window is resized to a size that is invalid to the dialog program (which is what we use for the UI of the firstboot scripts aka Inithooks and our Confconsole config tool). Although TBH, unless you actually resized the window while it was being drawn, then the fact that would occur would certainly be a bit strange... My gut feeling is that it's unrelated, but I can't 100% rule it out.

On face value, the second issue did seem like it may be related, but I'm not so sure after a closer look?! From a glance, I initially just though that it was an issue related to not finding the NIC (a bug which I have since fixed). But looking a bit closer, it doesn't actually appear to be that same issue at all. I can see the IP address noted in the MOTD output (if it was related to the fixed bug, there would be no IP listed). So I'm not really sure why that is occurring?! It seems like you may have discovered a new bug?! It seems too coincidental to not be related, but isn't obviously related...

TBH I'm feeling a little stumped by this one...

Looking at the info you've posted, for starters the network does seem a bit of a non-standard to my eye. I'm no networking guru, but in my experience, usually a 24 bit subnet will use 192.168.1.x (or sometimes 192.168.0.x , etc) IP address. And when using 10.x.x.x IPs, a 8 (or sometimes 16) bit subnet is usually used. Having said that, it shouldn't really matter (so long as it's actually how your network is configured). Also a quick google suggests that in more recent times, some router vendors do default to a 24 bit subnet with 10.x.x.x IPs, so you learn something new every day...! :) Regardless, so long as it's definitely on the same subnet to everything else, then it seems completely valid.

The only other thing that I can suggest off the top of my head is checking out the interfaces file (yes we still use that for configuring networks by default). I.e.:

cat /etc/network/interfaces

TBH, I'm quite stumped. The issue has never been reported before (at least not to us) and I've certainly never experienced it myself (and I use Proxmox lots). Considering that many others also use our containers on Proxmox and this is the first time I've heard of this issue, I can only guess that this is a fairly edge case type scenario. You mention that a vanilla Debian appears to work ok, but have you stopped it and restarted it (as you noted appeared to be related to the issues with the fileserver)? Have you tried another fileserver instance to see if you can reproduce it again (in a new container)? I'd be super interested to hear if you can.

I would like to understand what the stacktraces are about (and ideally fix them) but like I said it's actually not even clear that they're related to the issue you're having... Regardless, if you do try to reproduce this in a new FIleserver instance, before you try to stop it, please update the following 3 packages; turnkey-sysinfo, confconsole & inithooks. I.e.:

apt update
apt install -y turnkey-sysinfo confconsole inithooks

That way you should have all the fixes to date. It will ask about updating the Confconsole config during install, ideally let it do that (although it shouldn't matter if you don't).

You seem to know your way around Linux a bit, so perhaps it might also help if I share some info of a fileserver that I have running on Proxmox (which is running fine)? Maybe there is something that will jump out to you that I'm overlooking?

'ip a':

root@dafileserver ~# ip a
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
12: eth0@if13:  mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether d6:4d:ad:7e:b2:49 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.1.108/24 brd 192.168.1.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::d44d:adff:fe7e:b249/64 scope link 
       valid_lft forever preferred_lft forever

And for good measure; 'ifconfig':

root@dafileserver ~# ifconfig eth0
eth0: flags=4163  mtu 1500
        inet 192.168.1.108  netmask 255.255.255.0  broadcast 192.168.1.255
        inet6 fe80::d44d:adff:fe7e:b249  prefixlen 64  scopeid 0x20
        ether d6:4d:ad:7e:b2:49  txqueuelen 1000  (Ethernet)
        RX packets 110638647  bytes 69256292869 (64.4 GiB)
        RX errors 0  dropped 3748  overruns 0  frame 0
        TX packets 126102968  bytes 80490966158 (74.9 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

And the container conf:

root@pve ~# cat /etc/pve/lxc/108.conf
arch: amd64
cores: 4
hostname: dafileserver
memory: 2048
mp0: /media/pve2-data/storage,mp=/srv/storage
mp1: /media/pve3-data/storage,mp=/srv/storage2
net0: name=eth0,bridge=vmbr0,gw=192.168.1.1,hwaddr=D6:4D:AD:7E:B2:49,ip=192.168.1.108/24,type=veth
onboot: 1
ostype: debian
protection: 1
rootfs: local-lvm:vm-108-disk-1,size=8G
swap: 1024

And the interface file (from within the fileserver):

# UNCONFIGURED INTERFACES
# remove the above line if you edit this file

auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
	address 192.168.1.108
	netmask 255.255.255.0
	gateway 192.168.1.1

#auto eth1
#iface eth1 inet dhcp

So to recap, my server is configured slightly differently, but all seem irrelevant to the issues you're having (i.e. more system resources and mounting external fs - none of which would cause this IMO).

Sorry that I don't have anything clear for you. Please post back with anything else you've got cause I'd love to understand what is going on and hopefully help you fix it.

badco's picture

Re the messy output, the first error/stacktrace (when the container first boots) can happen when a terminal window is resized to a size that is invalid to the dialog program (which is what we use for the UI of the firstboot scripts aka Inithooks and our Confconsole config tool). Although TBH, unless you actually resized the window while it was being drawn, then the fact that would occur would certainly be a bit strange... My gut feeling is that it's unrelated, but I can't 100% rule it out.

Yes, I have not been resizing windows :)

On face value, the second issue did seem like it may be related, but I'm not so sure after a closer look?! From a glance, I initially just though that it was an issue related to not finding the NIC (a bug which I have since fixed). But looking a bit closer, it doesn't actually appear to be that same issue at all. I can see the IP address noted in the MOTD output (if it was related to the fixed bug, there would be no IP listed). So I'm not really sure why that is occurring?! It seems like you may have discovered a new bug?! It seems too coincidental to not be related, but isn't obviously related...

Yeah I came across that bug report, but the interface is here and present.

Looking at the info you've posted, for starters the network does seem a bit of a non-standard to my eye. I'm no networking guru, but in my experience, usually a 24 bit subnet will use 192.168.1.x (or sometimes 192.168.0.x , etc) IP address. And when using 10.x.x.x IPs, a 8 (or sometimes 16) bit subnet is usually used. Having said that, it shouldn't really matter (so long as it's actually how your network is configured). Also a quick google suggests that in more recent times, some router vendors do default to a 24 bit subnet with 10.x.x.x IPs, so you learn something new every day...! :) Regardless, so long as it's definitely on the same subnet to everything else, then it seems completely valid.

I am using pfsense as my gateway and this network setup for over a year with unRAID as well.

The only other thing that I can suggest off the top of my head is checking out the interfaces file (yes we still use that for configuring networks by default). I.e.:

See below, the only difference I noticed between ours was the netmask, but this is configured through proxmox and you are still on v5:

root@file-server ~# ifconfig eth0 eth0: flags=4163 mtu 1500 inet 10.10.10.100 netmask 255.255.255.0 broadcast 10.10.10.255 inet6 fe80::ca9:b3ff:feab:1201 prefixlen 64 scopeid 0x20 ether 0e:a9:b3:ab:12:01 txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 69 bytes 8547 (8.3 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

You mention that a vanilla Debian appears to work ok, but have you stopped it and restarted it (as you noted appeared to be related to the issues with the fileserver)? Have you tried another fileserver instance to see if you can reproduce it again (in a new container)? I'd be super interested to hear if you can.

The TKL file-server was just an example :) I have this with all TKL containers I spin up. I have also tried core and nginx.

ifconfig eth0 output:

root@file-server ~# ifconfig eth0 eth0: flags=4163 mtu 1500 inet 10.10.10.100 netmask 255.255.255.0 broadcast 10.10.10.255 inet6 fe80::ca9:b3ff:feab:1201 prefixlen 64 scopeid 0x20 ether 0e:a9:b3:ab:12:01 txqueuelen 1000 (Ethernet) RX packets 1 bytes 70 (70.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 88 bytes 12869 (12.5 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Other then the version differences, the only other differences is the VLAN. I might try without a VLAN and just DHCP.

Thanks for your replies so far :)

 

Jeremy Davis's picture

I was just having a little more of a look at my container, then I recalled that it was actually an earlier version of TurnKey, so that might not be exactly relevant...

I'll try setting up a new one and see how that goes...

Also my version of Proxmox needs updating, I'm still running v5 (I know...). So that too may be different. I'll make a point of updating that this week hopefully. So I might wait until I've upgraded the PVE host before I try crating a new fileserver to test.

I'm guessing there isn't anything else you've done that may be relevant (as you probably would have already shared) but just thought I'd mention it, just in case it triggers some memory...

And one final thought (for now at least) if you haven't already, it may be worth posting on the Promxox forums too? They may have some ideas on what might be going wrong (even if it is related to a TurnKey bug). Also maybe someone else has experienced a similar issue?

badco's picture

Right so I removed the VLAN tag and that fixed the networking.

I guess because I have the ports on the switch tagged already I don't need to apply the tag on the interface?

I am still getting my head around configuring VLANs on switches...

 

Jeremy Davis's picture

TBH VLAN tagging is not something I've ever engaged in (I think I mentioned networking is not really my strong suit?!). So I'm not going to be much help here...

With that in mind, this may be a dumb question! Did you also configure VLAN tagging on the vanilla Debian container you tested (and was working ok)? In other words, does this seem like a TurnKey shortcoming (or bug)? Or is TurnKey behaving as expected (i.e. same as vanilla Debian)?

SkepticNerdGuy's picture

Apologies for responding to a relatively old post. I just had the same issue where none of my TKL containers I spun up would respond to the internet. I too have my TKL server on a vlan.

I by default have no untagged traffic allowed from my Proxmox install, so not assigning a VLAN would not fix the issue.

By default, Proxmox tells the containers to use "Host DNS Settings" durning install. Thats fine. What I find strange is that when I install a standard debian template to a container, it will assingn the correct DNS Server at that interfaces' gateway IP. For security reasons, i have any web facing containers on the a DMZ segment of my network on a vlan with the range of 10.0.50.1/24. It has firewall rules that allow those servers to connect to the internet but nothing else. All traffic to other VLANs originating from the servers on that vlan is blocked.


My proxmox install is on a VLAN with the range 10.0.40.1/24 with the default gateway and DNS server pointing at 10.0.40.1 (this is my management VLAN and all my "important stuff" is on it.

When I install a fresh debian standard template, it installs and when i use cat /etc/resolv.conf it resolves to pointing to that Vlans/subnet DNS provider.

# --- BEGIN PVE ---
search name.server (redacted)
nameserver 10.0.50.1
# --- END PVE ---

when I install a TKL template it takes the literal defaults from the host and transribes them

# --- BEGIN PVE ---
search name.server (redacted)
nameserver 10.0.40.1
# --- END PVE ---

Now with my DMZ firewall configuration it will never allow 10.0.40.1 to resolve, which means to internet.

My fix was to repair the DNS search config (/etc/resolv.conf), which could be done in the Porxmox container's DNS tab. It instantly worked, internet restored and problem solved.

Hope this helps someone.

Jeremy Davis's picture

I do not understand why the vanilla Debian container works as expected, but not the TurnKey one. If I understand what you've noted correctly, in both cases, this information should simply be handed from Proxmox to the new container, so it shouldn't make a difference?! So either I misunderstand the process, or there is some weird interaction between Proxmox and TurnKey that means Proxmox gives it the wrong DNS details.

FWIW I use Proxmox myself and haven't ever seen had that experience. Having said that, I have a different setup with a single reverse proxy in the DMZ which links to the "real" servers on my LAN (using the same network as other machines/VMs). So other than my reverse proxy (which has 2 interfaces, one in the DMZ and one in the normal LAN), it's no real surprise that I don't hit that with any new servers.

Add new comment