[ start | index | login ]
start > knowledgebase > linux > misc > quick and dirty memory checker

quick and dirty memory checker

Created by retep. Last edited by retep, 2 years and 146 days ago. Viewed 3,156 times. #4
[diff] [history] [edit] [rdf]
labels
attachments

Unstable System? It Could Be the Memory

Do you have a server that 'crashes' irregularly?

Some possible causes:

  • A buggy kernel. It could be you have a bad kernel, try the latest Linux stable kernel. Or upgrade to your distro vendors default kernel.
  • A bad CPU. We have not hit this problem yet. So I don't know how its symptoms would show.
  • A bad power supply. Typically a bad power supply will result in a server that just plain will not power up rather than a machine that crashes intermittently.
  • Bad memory
Bad memory is probably the biggest culprit behind those odd, random server restart errors.

Symptoms include random crashes. Inexplicable program segfaults. And (if you are supporting production systems and have the emergency pager) you will probably have had a few sleep interuppted nights.

Checking Memory

The gold standard for checking memory is the memtest86 program. Grab the ISO, burn it, reboot your system with it and wait 6 - 60 hours for a result.

If you need to get a quick idea if memory is at fault before taking a server off line for that long, try this:

#determine the amount of memory in your system
#cat /proc/meminfo
#MemTotal:      2068052 kB
memmb=$(cat /proc/meminfo | grep MemTotal | awk '{print $2}')
# use a 1050 count so that the created file will be 
# a bit bigger than the available memory (1024 or 
# maybe 1000 for the actual memory size)
dd if=/dev/urandom bs=$memmb of=large count=1050; md5sum large; md5sum large; md5sum large

dd will create an output file (called large) filled with random (/dev/urandom) bytes. It will be bs*count big. Then it will output a checksum (twice) for that file.

To output the checksum Linux will have to read the files. It will cache all the data it can in memory. Presuming you have less than 10GB of memory the server will use up all its memory during the file read. And the md5sum will ensure that the same bits can be saved/read from memory consistently.

If the checksums do not match, you have faulty memory (guaranteed).

If the checksums match, try running the md5sum commands a couple more times. If the checksums are consistently the same then you may or may not have faulty memory. Run memtest86 to be sure.

The test can return the same checksum and still have bad memory since not all memory is addressable by the kernel. And because the server will be running software that won't budge out of its currently position in memory (e.g. the kernel and probably most applications) so that memory won't be tested (running the test on a server with minimal applications running would therefore be a good way to improve its accuracy).

请以发表评论身份登录
Powered by snipsnap.org Found a mistake in a howto? Let us know via an email to p.blikibugs at rimuhosting com.