20.04 LTS server in Hyper-V VM frequent segfaults

Ubuntu Version: Ubuntu Server 20.04.6 LTS

Desktop Environment (if applicable): byobu tmuxer, nothing else

Problem Description: Applications have been segfaulting all over the place, and I’m unable to pinpoint a rhyme or reason why.

Relevant System Information:

  • Hardware
    • HP Compaq Microtower dc5800
    • Core 2 Duo E8400 CPU, 2 cores @3.0GHz (no tweaking, stock)
    • BIOS v1.60, Oct. 26 2015
    • 6GB system RAM
  • Host: Windows Server 2012 R2 & Hyper-V (v. 6.3.9600)
    • Ubuntu guest is a Gen. 2, Version 5.0 virtual machine
    • $ uname -a
      Linux ubuntush 5.15.0-1089-azure #98~20.04.1-Ubuntu SMP
    • Docker is installed and running some applications (postgres, mariadb)

Screenshots or Error Messages:
A brief excerpt of journalctl -r | grep -e "traps:" -e "segfault"

May 24 12:01:46 ubuntush kernel: apport[386957]: segfault at 0 ip 00000000005a9075 sp 00007ffe01ca7e00 error 6 in python3.8[423000+295000]
May 24 10:34:04 ubuntush kernel: postgres[371158]: segfault at 56546f8320ae ip 00007fc2d2174c00 sp 00007fff5bd4fb68 error 4 in
libc-2.31.so[7fc2d205b000+15a000]
May 24 01:34:21 ubuntush kernel: php[338919]: segfault at 5622cb98339a ip 00007fc8352544dd sp 00007fff99b0fba8 error 4 in dom.so[7fc83524a000+1a000]
May 23 10:04:07 ubuntush kernel: php[138526]: segfault at 0 ip 00007f199c43da87 sp 00007ffe68553120 error 4 in libc.so.6[7f199c3cb000+155000]
May 22 16:15:50 ubuntush kernel: tmux: server[933]: segfault at 555f8518a6e0 ip 0000555f8518a6e0 sp 00007fff5ef6bd50 error 15
May 21 14:01:21 ubuntush kernel: mariadbd[284037]: segfault at 14d7817fa8 ip 00000014d7817fa8 sp 00007fffd7817cb0 error 14 in mariadbd[559b03e98000+640000]
May 21 13:36:59 ubuntush kernel: traps: php[37760] general protection fault ip:7f8ff57bc06f sp:7ffe73f2e430 error:0 in libc.so.
6[7f8ff574d000+155000]
May 21 13:30:58 ubuntush kernel: .NET TP Worker[25102]: segfault at 0 ip 00007f816b9f2fa5 sp 00007f40413fbd00 error 4 in libcoreclr.so[7f816b5d2000+4b7000]
May 21 13:27:09 ubuntush kernel: traps: pgrep[10378] general protection fault ip:5639d8e12763 sp:7ffe92648b80 error:0 in pgrep[5639d8e11000+3000]
May 21 13:17:09 ubuntush kernel: apport-retrace[298701]: segfault at 7f535d64ed ip 00000000005f62c0 sp 00007ffd99c3c880 error 6 in python3.8[423000+295000]
May 21 13:16:36 ubuntush kernel: apport-retrace[298637]: segfault at 18 ip 000000000050a722 sp 00007ffd2d4dde20 error 4 in python3.8[423000+295000]

Message:

This has been a fun little problem I’ve been dealing with for quite some time now. It’s gotten to the point (finally) where data is actually being corrupted, so I’m hoping for some direction in getting to the bottom of this and putting a bandage on the issues before I blow this VM away and move to a different platform.

It used to be that the init process would segfault/die and be unrecoverable, preventing me from shutting down or rebooting the machine. Somehow that stopped at some point, although I’m still bombarded with a nearly constant stream of segfaults as you can see from my paste above.

Things I’ve tried:

  • Memtest86+ runs without issue in the same virtual machine, which makes me think that processing & memory aren’t the issue.
  • Various fsck runs of the file systems come back clean, plus I’m not seeing any log indications of storage issues
  • debsums command comes back clean for non-configuration files - hoping this speaks to system file integrity
  • Ubuntu is configured according to Microsoft guidelines, specifically, I’m running the linux-azure kernel image which seems to guarantee maximum possible Hyper-V compatibility
    • Although, Hyper-V still complains about an outdated guest communication protocol, which seems to be the responsibility of the hv-kvp-daemon which I believe is the latest.
    • The Ubuntu guide for Hyper-V is less detailed, but I believe I’m in compliance nonetheless.
  • apport is installed and operating (most of the time, as you can see by its own crashes in the logs above). whoopsie is also operating in the hopes that it may provide some useful information. They seem to miss a chunk of the segfaults, however. I find myself getting a little over my head as I follow instructions for opening up the .crash files and attempting to read/understand the backtrace.

Perhaps you should check memory and storage in host system, too. In addition you may take a look at pagefile (host) and swap (guest).

1 Like

Dear @g-schick ,

Thank you for the ideas. Today I shutdown the host system completely and booted into the Memtest86+ v7 image and let it complete a full pass without any errors reported. I also took the opportunity to boot into another LiveCD and ran some additional diagnostics on as many layers of hardware as I could think of to verify no issues. Checkups on both HDDs connected to the host showed old age, but no sudden increase in errors to correlate with what’s going on with the Ubuntu VM.

I’m not quite sure how to investigate the Windows pagefile or Linux swapfile for errors if that was what you meant by take a look at, are there any specifics tools or places I could look in for more info? Thus far haven’t seen any errors directly implicating either of those two things. It is weird to me how the host its self doesn’t seem to experience any issues; only the Ubuntu VM. Which makes me think there’s something wrong with Hyper-V, although I’ve not seen any other cases on the internet much like mine yet.

You may investigate pagefile configuration (size, dynamic or fixed, …), RAM usage and pagefile usage on the host. Are there other programs and/or virtual machines running on this host?

How much RAM is assigned to virtual machine(s)? What are the memory settings in hyper-v?

Does the above mentioned virtual machine have a swap? And how about RAM usage and swap usage on the guest?


Both - your host and your guest - aren’t new any more. Therefore I guess these segmentation faults started after some updates or things like that?

And are there any other errors in the logs at about the same time the segmentation faults occur?

1 Like

Did you install the guest additions for the VM? Might help

1 Like

https://community.veeam.com/blogs-and-podcasts-57/how-to-install-hyper-v-integration-services-in-the-ubuntu-linux-vm-6353
Something like that

1 Like

A lot more explanation is definitely in order here!

As you might’ve guessed from the light specs I gave in the OP, it’s a fairly desperate situation over here. Everything from the ground-up is either past or borderline EOL; such is life with a hobby I guess. I apologize for all the possible variables here that are contributing than just Ubuntu - I feel it’s the most informative place to start troubleshooting in the near term. In the long term, I want to migrate my services and important data to slightly less EOL and more dependable foundations that will hopefully eliminate a lot of potential variables in the whole mix.

To update everyone on one step I tried; I took a snapshot of the VM, then did a do-release-upgrade which took me to 22.04 without much issue. In fact, I think the crashing has subsided once I booted into the 6.8.x kernel image upgrade for azure. It’s been a day since I started writing this draft, and so far the /var/crashes directory remains empty. Problem solved for now? I’m not sure what’s changed - there were a few hv- related changes in the kernel but nothing that really seemed to resonate with my experiences anyhow.

On the host: Memory usage overall is hovering around ~75% of the 6GB RAM capacity. Virtual Memory/pagefile was system-managed and set to just under 2GB, bringing the total commit limit to 7.8GB, current commit hovering around 5.5GB and a peak of 7.1GB (!!) This is with the single Ubuntu VM running, with 2048MB statically allocated. Not to mention a data backup application running directly on the host (unsupported configuration according to Veeam - competition for system resources), as well as the SQL Server instance supporting Veeam.

It’s a mess.

I’m expanding the pagefile to 4096MB on the host to at least give the system a little more breathing room.

On the VM side, it has 2048MB allocated to it (no other fancy settings like dynamic memory are enabled) and a 2GB swapfile is enabled, with ~500MB consumed at the moment. 1.1GB of VM-physical memory is in use, with nearly the rest of it reserved for cache. Feels like I’m scraping by here as well.

I have to imagine you’re right, unfortunately my records (when I setup my current record-keeping system) go back as far as the issues do, so I can’t pinpoint any specific start. That being said, I’ve always been suspicious with how Hyper-V, and to an extent, Veeam interact with Ubuntu as the VM guest.

I did at one point notice a correlation where, around when the VM backup job would be running, I’d see several errors related to the storage device show when Volume Shadow Copy would begin processing and end. Later on (hours, days), the systemd (init process) would eventually enter some kind of unrecoverable state, and the machine would have to be force-stopped and restarted to recover. It seemed to correlate with backups, although I remained unable to find any specific reasons for why Veeam or Hyper-V would be causing the errors I was seeing. I ultimately disabled as much of the advanced processing/shadow copy system in the Backup job, which seemed to alleviate the immediate issues. I’m still concerned I have a misconfiguration, especially with respect to how running databases are handled.

Since I resolved the worst of the issues with backups awhile ago, I haven’t been able to see much of a rhyme or reason for the segfaults, especially when comparing with logs. They happen in my userspace, to root processes, even to some of the Docker containers running on the system. apport in particular seemed to run into a lot of issues, either running automatically in the background or when I invoke its cli manually. It’s just become a difficult issue to pinpoint specifically.

I’ve been struggling for awhile to understand what the “correct” configuration is as far as guest additions/integrations are concerned. It seems like the way integration services are done now is by being built into the (azure?) kernel. Here’s Microsoft’s word on their compatibility page:

Built in - Linux Integration Services (LIS) is included as part of this Linux distribution. The Microsoft-provided LIS download package doesn’t work for this distribution, so don’t install it. The kernel module version numbers for the built in LIS (as shown by lsmod, for example) are different from the version number on the Microsoft-provided LIS download package. A mismatch doesn’t indicate that the built in LIS is out of date.

This is what I see indicated for all listed distributions on the page, including both versions that I’ve been running so far. As far as I can tell, there’s no need to install any explicit guest additions. The linux-azure package of kernel image and tools seems to provide Hyper-V (hv) related-daemons that run in the background and seem to put up mostly green lights on the Hyper-V host side. There is one complaint where the communication protocol isn’t as new as Hyper-V expected, so that one’s a bit of a mystery. Specifically, the message says:

Hyper-V Data Exchange connected to virtual machine ‘Ubuntush’, but the version does not match the version expected by Hyper-V (Virtual machine ID {id}). Framework version: Negotiated (3.0) - Expected (3.0); Message version: Negotiated (4.0) - Expected (5.0). This is an unsupported configuration. This means that technical support will not be provided until this problem is resolved. To fix this problem, upgrade the integration services. To upgrade, connect to the virtual machine and select Insert Integration Services Setup Disk from the Action menu.

Here’s output from the link you gave me later, from the VM-GetIntegrationServices command:

Enabled OperationalStatus PrimaryOperationalStatus SecondaryOperationalStatus StatusDescription Name VMName
True Ok Ok OK Time Synchronization Ubuntush - Private
True Ok Ok OK Heartbeat Ubuntush - Private
True Ok ProtocolMismatch Ok ProtocolMismatch OK The protocol version of the component installed in the virtual machine does not match the version expected by the hosting system Key-Value Pair Exchange Ubuntush - Private
True Ok Ok OK Shutdown Ubuntush - Private
True Ok Ok OK VSS Ubuntush - Private
True Ok Ok OK Guest Service Interface Ubuntush - Private

Research didn’t give me much of a good answer. People seemed to put up with it, it didn’t seem to be causing the issues I was experiencing, and the upstream repository for the hv-kvp daemon didn’t look very active so I didn’t put much further thought into it. I’m interested if you have any further thoughts on this though.

And thank you both for your contributions thus far. I know my situation is quite a mess but I very much appreciate both of your thoughts on the matter so far. With about two days of operation since I upgraded to the 22.04 release, things seem to be relatively stable for now. Perhaps this means case closed for now… but I hope I can call on your help if the problem returns again.

1 Like