10

I'm running the latest Arch Linux on my ThinkPad t420 laptop, and am having an intermittent heat problem where my temp will rise from the typical ~45°C to ~75-90°C, and stay there until I reboot.

I've run several diagnostic tools, including the obvious htop and others that googling has suggested, including powertop, iostats, and surely others I don't recall. So far I've come up with no obvious issues, and ostensibly no differences in readouts whether my machine is running normally or in this hot state.

I've also killed most suspicious processes one by one (databases and other daemons), hoping to find the hidden culprit, to no avail.

Beyond these attempts, I don't know where to begin troubleshooting. I'm hoping someone could point me in the right direction to begin looking for the deeper issue.

To be precise, my question is not how to cool my machine, but rather what could cause a consistent 30+ degree heat change in a system where (reportedly) CPU and load is normal? And what tools/practices could I use to diagnose it?

Some notes that might be helpful:

  • I can cool the system (e.g. by disengaging the fan control completely), but it immediately heats up again if left alone. This seems to suggest that heat retention is not an issue, but rather something is continually generating heat.
  • CPU usage and load are reported by htop as normal after entering this hot state. This includes kernel threads. According to htop, the system is essentially idle (1-2% sytem wide CPU usage, a load of 0.10).
  • My machine uses Intel HD integrated graphics, and has no other graphics card. An nVidia card was an option for this model, as noted by @braiam; I did not opt for it.
  • @terdon brought up the CPU governor settings. My CPUs are set to powersave
  • My specific processor is a 2.7GHz Core i7-2620M.

Edit: At the time of writing this question, my fancontrol was not functioning properly and ran continually at a middle RPM range (3900 RPMs), even at high temps. At the suggestion of @Alex and @JustDanyul, this has been fixed. The underlying problem, however, still remains.

numbers1311407
  • 113
  • 1
  • 10
  • 1
    As slm implies, if the CPU is idle but the core temperature is too high, the only possible explanation is that excess heat was not dissipated after some event -- the only thing in there that can produce significant heat is the CPU. Think of turning a stove burner on full to boil some water, then *putting a lid on it* and turning the burner down to low: the water remains boiling perpetually because the heat cannot dissipate as fast as the low burner replaces it *as long as the lid is on*. Take it apart and clean it if it is that bad, just blowing air probably won't help much at this point. – goldilocks Aug 07 '13 at 15:46
  • I'll definitely take it apart and clean it as I do suspect I'm having some air flow problems, but as it is the CPU will maintain a constant high temp indefinitely after whatever the event is that causes the extreme temperature change. Even if you leave the lid on the pot, the water temperature will go down at some point. My temp hits a number and simply stays there, consistently at that number, forever until I reboot. There's no gradual building of heat that would suggest it is having dissipation issues. I've been running under a full workload all morning and am sitting at 41 degrees. – numbers1311407 Aug 07 '13 at 16:09
  • To put it another way, my temperature will regulate itself normally under expected stress. If I'm doing some hard compiling it might heat up to 60, 70 degrees, but when it's done it will cool down. – numbers1311407 Aug 07 '13 at 16:14
  • Besides the fan, you may want to reapply the thermal paste. Cheaper thermal paste will dissipate over time. – BlueRaja - Danny Pflughoeft Aug 07 '13 at 18:24
  • numbers1311407, in that case slm is probably right and you have a rogue process running somewhere. If the machine is capable of bringing the temperature down, it is not a problem of the heat paste as @BlueRaja-DannyPflughoeft suggested (though he does indeed raise a valid point). So, if it suddenly shoots up for no reason, it really sounds like there is something using 100% of your CPU or that you are using a governor that doesn't allow scaling down. – terdon Aug 08 '13 at 01:57
  • 1
    Maybe this is a stupid question, but did you by any chance disabled ACPI? I had similar troubles some time ago with an old PC, that didn't boot with this option enabled, after disabling it, it finally booted, but was also overheating. – Alko Aug 08 '13 at 14:43
  • Nope, acpi is working, with the `thinkpad_acpi` module loaded as well. – numbers1311407 Aug 08 '13 at 14:49
  • @numbers1311407 if you shutdown the Xserver, either stopping the desktop manager, etc., do the temperature goes down? It could still be some bad graphics driver making misuse of the CPU. – Braiam Aug 10 '13 at 00:34
  • @braiam I will try this next time it acts up. Today it's been running at normal temps so I can't test it until it occurs again. I use [i3](http://i3wm.org/), which I have tried stopping. I don't recall specifically stopping X, but I feel like I probably did (as it's the next logical step). I'll let you know. – numbers1311407 Aug 10 '13 at 00:42
  • @braiam nope, killing the xserver didn't solve it – numbers1311407 Aug 10 '13 at 21:50
  • Which Kernel-Version do you use? Did you try up-/downgrade? But before trying another kernel you maybe want to try to append ```pcie_aspm=force``` to kernel line via ```/etc/default/grub``` (if you're using grub2) – xx4h Aug 13 '13 at 05:24
  • @xx4h 3.9.9-1-ARCH and no I have not tried to up/downgrade, however, this issue has been a problem for me through several kernel updates. I'll look into pcie_aspm. My problem with such advice, as my issue is intermittent and not reproducible, is that it's impossible for me to really tell if a fix is working without waiting indefinitely to see if the problem comes back. It would be great if there was some place to start looking for irregularities in log files, etc. – numbers1311407 Aug 13 '13 at 16:32
  • @numbers1311407 Don't think someone's mentioned it but have you tried a non-linux OS? Because then we could rule out hardware failure if switching OS solves the problem. Maybe a windows or BSD liveCD? I understand that that is a pretty annoying suggestion to implement if its not already dual booting because you'd still have to wait around for something to not happen. But if its a hardware issue you'd have a tough time diagnosing it. – jmathew Aug 14 '13 at 21:15
  • @jmathew that's not a bad idea. I think it's reasonable to assume it might be a hardware issue. It already dual boots to windows, actually. The trick will be, as you say, waiting around for it to happen. The problem won't arise for days sometimes (it hasn't happened in 2 days now, on linux). As this is my work machine, it'll take some doing to get things rearranged to work somewhere else and monitor this. If nothing ends up coming from this bounty then that's what I'll have to do. – numbers1311407 Aug 15 '13 at 01:53

5 Answers5

6

The fan

Mine does this too, running Fedora 14. Try getting a compressed can of air and blowing out the vents on the back and side of the case.

Also periodically you'll wan to remove the keyboard and blow compressed air directly on the fan's blades. They get caked with dust and start to effect its effectiveness by weighting it down.

The best thing about the Thinkpads are the service manuals! They show you how to tear down your laptop and put it back together.

Bad process

The other thing I've noticed is that I'll occasionally have a process that's gone awry and will be consuming 100% of one of the cores. Kill this process usually brings the temperature back to normal.

You can use htop or top to see what process this is and either kill it from their or from a terminal using it's PID.

what else?

See my answer to this U&L Q&A for more tips on how to get temperature reads for the various components of your laptop. The Q&A is titled: How to get core temperature of haswell i7 cores in i3status.

slm
  • 363,520
  • 117
  • 767
  • 871
  • This is definitely something I should do, but the odd thing to me is why it only goes into the hot state *sometimes* and persists until reboot, which immediately fixes the problem. It seems like *something* must be happening which is triggering the persistent change in heat. I have run htop to no avail. CPU usage is normal. – numbers1311407 Aug 07 '13 at 14:11
  • Maybe some wakes signal with the kernel? – Braiam Aug 07 '13 at 14:18
  • You might have better luck in seeing what it is by making Kernel threads visible in `htop`. It's usually Shift+K. – slm Aug 07 '13 at 14:21
  • @braiam I don't believe so, but I can post that information from powertop next time the machine kicks into the hot state. It seemed fairly normal when I checked, unless I was looking at the wrong information. – numbers1311407 Aug 07 '13 at 14:21
  • 1
    @slm thanks, I have not tried that. I will do so next time it starts running hot and report the results. – numbers1311407 Aug 07 '13 at 14:22
  • 1
    @numbers1311407: Could be the reason rebooting solves the problem quickly is that this turns the CPU off for a few seconds allowing the temperature to drop below the cutoff point. The CPU is *always* hotter than ambient, whatever "ambient" is. – goldilocks Aug 07 '13 at 15:53
  • @goldilocks I'm completely ignorant of the notion of a cutoff point, but all I can tell you is that I've been running at 42-43 degrees consistently all morning under normal workload. When whatever the problem is kicks in, it'll skyrocket up 30+ degrees and stay there, consistently, until I reboot. – numbers1311407 Aug 07 '13 at 16:06
  • @numbers1311407 : Could be a sensor or sensor driver issue too I guess. Does it actually feel hotter to you when that happens? If it's truly that warm, you should be able to notice it very obviously with your hand. If the heat is real and the CPU is idle, it *must* be trapped heat. Even something as innocuous as a browser flash ad can run up the processor long enough; once the heat is there if it can't get out it won't go down -- that's what I meant by the CPU is *always more than ambient temp* even when idle. – goldilocks Aug 07 '13 at 16:35
  • It definitely feels hot. Could it be heat trapped somewhere else besides the CPU that's keeping the ambient temp up? As I noted in the main comments above, there's no problem dissipating heat caused by CPU workload. I can compile software for 10 minutes, hit 75 degrees, then it will dissipate normally when completed. – numbers1311407 Aug 07 '13 at 16:41
  • :( Maybe the GPU? Hard drives generate heat too but I don't think that much. The behavior you describe is certainly bizarre! – goldilocks Aug 07 '13 at 16:48
  • The GPU was my uneducated guess too, now I just wonder how to diagnose it :-) Thanks for your patience – numbers1311407 Aug 07 '13 at 16:54
  • @numbers1311407 - see my update. You can read the temperature using the methods outlined in that other Q&A: http://unix.stackexchange.com/a/85503/7453 – slm Aug 07 '13 at 17:00
  • 1
    Please note that in the `top` command you can press "1" to see the load of individual cores. – Christian Stewart Aug 15 '13 at 02:18
3

This is more of a long comment but you should have a look at thinkwiki.org it is the resource for Linux on ThinkPads. As for the temperature, I had similar problems with my t4500 and sorted it out by playing with

  1. The CPU governor which controls CPU frequency scaling. Your choices are:

    • Performance keeps the CPU at the highest possible frequency
    • Powersave keeps the CPU at the lowest possible frequency
    • Userspace exports the available frequency information to the user level (through the /sys file system) and permits user-space control of the CPU frequency
    • Ondemand scales the CPU frequencies according to the CPU usage (like does the userspace frequency scaling daemons, but in kernel)
    • Conservative acts like the ondemand but increases frequency step by step

    With ondemand, your CPU will only run at its highest speed when necessary. Ideally, this will be completely transparent for you, you machine will simply work as fast as necessary for the current tasks. To activate it do

    sudo echo ondemand > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
    
  2. Fan control. There is a very nice utility called "Simple ThinkPad Fan Control" which allows you to fine tune the trigger temperatures that change the fan's speed. Also have a look through the information here.

terdon
  • 234,489
  • 66
  • 447
  • 667
1

I think that there is a problem about what you perceive is "hot". For the Thinkpad t420 (according to forums) is about 80-85 C and lets forget the nVidia card that might cause a rise of the temperature too (in fact the Optimus configuration might not work well, forcing your CPU to do GPU work). That say, your CPU maximum tolerated temperature is 100C (if it gets there the system will shutdown), while the ambiance is about 10.0°C to 35.0°C. That said, if your laptop is within the range is all ok (except for your battery and laptop life span).

Now you wanted to point out possible suspects. In this aspect, I would say that the poor implementation of nVidia with Linux may throw work to your CPU that won't show in htop (or anywhere) due to their infamous On-Demand system Optimus, which seems to fit your current predicament (it works fine until a moment, it just start heating without control). You should update your installation until you have Bumblebee fully configurated. You can use the bbswitch to dissable the nVidia card at will and see how it goes.

To install Bumblebee for Arch you can find the package on the repository. If you still haven't installed it.

Braiam
  • 35,380
  • 25
  • 108
  • 167
  • I didn't opt for the nVidia card on this machine. Sorry, should have been clearer on that in the question. And while 80-85 isn't *too hot*, it's still hotter than mid 40s, which is what my computer normally runs at on a typical light workload. My issue is trying to determine what phantom condition "kicks in" and causes my consistent 45 to be a consistent 80-85. – numbers1311407 Aug 09 '13 at 20:58
1

sensors shows my fan buzzing along at ~3900 RPM

Even with temp like ~75-90°C?

as manually ramping up the fan will cool the machine temporarily

So one problem is just that the fan speed isn't working automatically?

Forget about the auto, you can read the temperature correctly and you can control the fan speed manually, right? If so all you need to do is to find a working fan control script or roll out your own (poll the temperature and set the speed according to a table temp[i]=speed[i], when you set an higher speed keep it for a while even if the temperature get down, when you need to slow down the speed do it slowly and step by step).

About the poll, the best would be to have a temperature monitor daemon which trigger thermal change events, and the fan control script listening/waiting for those events, I though (maybe wrongly) once it was acpid but nowadays I don't know.

In both cases (your own/existing script) while you are not confident with the solution keep always an eye on temperature and RPM, the fan must not stop.

Solve this auto problem first, and if the overheat persist you can focus on the cause.

edit

You may want to try a tool like lttng to collect stats of the whole system through the time, but could be not easy to setup and could be expensive in terms of storage if you need to collect for a long time.

Alex
  • 2,546
  • 3
  • 20
  • 30
  • You're right that my fancontrol is working poorly, if at all. I'll look into getting it to respond properly, but fixing it isn't going to solve the underlying cause. I could let the thing loose at max 6400ish rpms all day and cool down a bit but its the cause that I'm trying to address. – numbers1311407 Aug 11 '13 at 23:32
1

Since manually ramping up the fan solves the problem, this would be a excellent place to start troubleshooting, since this seems to suggest that the automatic fan control isn't working.

Now, you run arch linux, which is a brilliant distro (yes, i run it too) with a terrific wiki. So, I have to ask, did you RTFM? ;p

https://wiki.archlinux.org/index.php/Lenovo_ThinkPad_T420#Fans

As far as I can see, you need to:

  1. enable the thinkpad_acpi kernel module
  2. install, and configure the thinkfan application from the AUR
  3. enable the thinkfan system service

Has this all been done?

UPDATE I'm glad to hear that your fan is now working correctly, rather than just spinning at a happy medium. (I take it this solved the problem of your machine running at temperatures over 80 degrees?)

In regards to answering whats the underlying cause of a 30 degree temperature change, well, I'm tempted to ask: Couldn't it be because of the fact that your fan wasn't working correctly?

Lets postulate that,

  • the processor is not working harder than normal
  • the temperature fluctuations are now replaced with fan speed fluctuations

Wouldn't it be safe to assume that there might not be a problem at all, and that the problem was simply that you fan was running at a level where it was just barely coping? And small changes in ambient temperature etc, put it over the limit of its capabilities?

For example, I had an Acer laptop, and on warm days, the fan was "constantly" spinning up and down. I bet if my fan wasn't able to actually adjust it self, I would have seen quite large temperature fluctuations as well :)

JustDanyul
  • 309
  • 2
  • 8
  • I just installed thinkfan. Thanks for the heads up on that. Yeah the detailed arch wiki is a great perk of the distro. I've been there a thousand times but I never stumbled across or thought to look for a page dedicated to my exact model, and had never seen thinkfan mentioned. Nice little script. **That being said,** this doesn't solve my actual problem: *how do I diagnose the cause of my unusually high temp?* Even if I disengaged the fancontrol and cut the thing loose at max RPM, it's only putting a bandaid on the real issue of the mystery heat. – numbers1311407 Aug 12 '13 at 15:48
  • No, the problem still exists unchanged. I'm sitting here right now at 42 degrees. When my heat condition occurs I'll be running the same processes, the ambient will be equivalent, nothing ostensibly will have changed, but my temp will rise from a consistent 42 to a consistent 75+. I **can** cool it down: put it on a laptop cooler, disengage the fancontrol allowing to to spin at 6500 RPM, etc. But if I stop, it'll heat right back up to 75+. The fan at max speed is only slightly faster than it was going anyway. Fixing it was a good thing, but not a solution to the underlying problem. – numbers1311407 Aug 13 '13 at 16:24