The problem solving chat
Submitted by
reedstrm.
on 2008-06-09 16:28.
Here be the logs for the collaborative problem debugging from June 5th
Conversation with pdfkibbitz@conference.cnx.rice.edu at Thu 05 Jun 2008 01:46:03 PM CDT on reedstrm@jabber.cnx.rice.edu/Gaim (jabber)
(01:46:03 PM) reedstrm entered the room.(01:46:03 PM) pdfkibbitz
(01:46:23 PM) cbearden: Logging in RhaptosPDFLatexTool would be helpful.
(01:46:40 PM) cbearden: So how does this problem manifest itself to the user?
(01:46:42 PM) reedstrm: I can't seem to set the topic :-(
(01:46:43 PM) cbearden: Hung front-ends?
(01:46:54 PM) cbearden has set the topic to: pdf latex module problems
(01:47:01 PM) cbearden: Maybe because I created teh chat?
(01:47:07 PM) cbearden: I'm shown as the owner or whatever.
(01:47:23 PM) reedstrm: We don't know: I think hung frontends eventually giving either a 0 response or an actuall error
(01:48:08 PM) cbearden: Ross have you killed any of the pdflatex processes on naginata?
(01:48:12 PM) jenn: Do we know if the PDFLatexErrors in the event log are the same things that are hanging front ends, or if they're orthogonal?
(01:48:17 PM) cbearden: There is only one module pdf process now.
(01:48:32 PM) reedstrm: don't know
(01:48:32 PM) cbearden: I don't think we know yet.
(01:48:41 PM) jenn: Well, I've got a time series of PDFLatexErrors since midnight if that would help.
(01:48:50 PM) reedstrm: it might.
(01:49:10 PM) reedstrm: I bet the make processes reparented to init correlate w/ your errors.
(01:49:10 PM) jenn: ~jenn/PDFLatexErrors.dat
(01:49:11 PM) cbearden: I could try to correlate them to left-over /tmp files
(01:49:19 PM) reedstrm: Might help.
(01:49:47 PM) reedstrm: I've got a suspicion re: large-memory-restarted frontends abandoning print-make jobs.
(01:49:51 PM) reedstrm: Which then just spin ...
(01:50:40 PM) reedstrm: nagios is showing me FE timeouts (> 15 sec) but they've actually been resolving themselves.
(01:51:59 PM) jenn: Whoops; I actually included all the PDFLatexErrors starting on 6/1, but that's handy too -- you can see that there was a significant uptick starting yesterday.
(01:52:16 PM) jenn: Except...I didn't roll out until about 6 PM. That's odd.
(01:52:24 PM) cbearden: One thing to do is to sort out those relating to modules that won't build from the others.
(01:52:46 PM) cbearden: Cameron is in now.
(01:52:59 PM) jccooper entered the room.
(01:53:00 PM) cbearden: Just invited him
(01:53:04 PM) cbearden: and he just entered.
(01:53:16 PM) jenn: I see several runs of the same module dying several times; could be retries hitting different front ends.
(01:53:33 PM) reedstrm: likely, actually.
(01:55:05 PM) reedstrm: Cameron, you see the history in the chat?
(01:55:41 PM) cbearden: Cameron, we are seeing module PDF generation processes that hang on live.
(01:55:44 PM) reedstrm: We're getting hung module print processes, hung frontends, actual errors, and orphaned print processes.
(01:56:06 PM) jenn: This wasn't a complex rollout, so rollback looks like it would be relatively easy; just backing down the products.
(01:56:37 PM) jccooper: I'll take a look at the errors.
(01:58:05 PM) reedstrm: I'm looking at nagios trend reports
(01:58:39 PM) jenn: I'm moving my log-file investigations to a subdirectory: ~jenn/pdfkibbitz
(01:59:42 PM) jccooper: The problematic ones are like this?<br xmlns='http://www.w3.org/1999/xhtml'/>2008-06-01T08:29:53 ERROR Zope.SiteErrorLog http://cnx.org/content/m12440/1.1/index_html<br xmlns='http://www.w3.org/1999/xhtml'/>Traceback (innermost last):<br xmlns='http://www.w3.org/1999/xhtml'/> Module ZPublisher.Publish, line 115, in publish<br xmlns='http://www.w3.org/1999/xhtml'/> Module ZPublisher.mapply, line 88, in mapply<br xmlns='http://www.w3.org/1999/xhtml'/> Module ZPublisher.Publish, line 41, in call_object<br xmlns='http://www.w3.org/1999/xhtml'/> Module Products.RhaptosModuleStorage.ModuleView, line 98, in index_html<br xmlns='http://www.w3.org/1999/xhtml'/> Module Products.RhaptosModuleStorage.ModuleView, line 305, in default<br xmlns='http://www.w3.org/1999/xhtml'/> Module Products.RhaptosModuleStorage.ModuleView, line 330, in downloadPDF<br xmlns='http://www.w3.org/1999/xhtml'/> Module Products.RhaptosPDFLatexTool.PDFLatexTool, line 97, in convertFSDirToPDF<br xmlns='http://www.w3.org/1999/xhtml'/>PDFLatexError: No output file<br xmlns='http://www.w3.org/1999/xhtml'/>
(02:00:01 PM) jenn: Don't know; that's a common error that's showing up, but we don't know if they're 1:1 with the hangs.
(02:00:01 PM) reedstrm: Perhaps.
(02:00:02 PM) cbearden: We don't know if these correlate to the hung processes yet.
(02:01:10 PM) jenn: A time series of those errors, at one per line, is at ~jenn/pdfkibbitz/PDFLatexErrors.dat (don't know if you got the whole chat history).
(02:01:16 PM) reedstrm: 'k here's a real data point: Only the cnx instances (not cnx1 or cnx2) on each server show any history of actually hanging.
(02:01:33 PM) jenn: That wouldn't surprise me at all; it would mean a config problem.
(02:01:53 PM) reedstrm: nagios definition of hanging: 4 socket timeouts in a row > 15 sec.
(02:02:13 PM) jenn: Could happen if, say, all printing is being dumped on the cnx instance.
(02:02:31 PM) reedstrm: right
(02:02:41 PM) reedstrm: or image retrieval ...
(02:02:51 PM) reedstrm: I'm suspicious of imagefix, right now.
(02:02:53 PM) cbearden: Does the PDF make use cnx.org?
(02:02:55 PM) cbearden: Ah.
(02:03:12 PM) jenn: I sent that email with the config as rolled out...let me find it....
(02:03:15 PM) cbearden: Are the processes hanging in pdflatex as well?
(02:05:11 PM) jenn: Erm...I'm not actually finding that email in my outbox. Has anyone else seen it? I announced the rollout , mentioned some post-rollout testing plans, and listed the ZMI config parameters we were running with.
(02:05:55 PM) cbearden: I'm checking my printing mailbox
(02:06:06 PM) cbearden: Man, ece mail is slow
(02:06:17 PM) jenn: Precisely because I wanted to make sure I had guessed right on everything. There were some "maybes" in the rollout "procedure".
(02:06:29 PM) reedstrm: Hmm.
(02:06:34 PM) reedstrm: I see no such email.
(02:06:49 PM) jccooper: The PDFLatexError, "No output file" errors are not entirely unexpected; they will happen when no PDF is generated, and we know that happens sometimes.
(02:06:52 PM) cbearden: I'm not nfinding one
(02:06:57 PM) jccooper: Nor I.
(02:06:57 PM) reedstrm: (blushes, 'cause he hadn't actually read email yet today ...)
(02:06:59 PM) jenn: Well, glargh.
(02:07:16 PM) jenn: Well, okay, here are the parameters that are live:
(02:07:16 PM) cbearden: Right, though there were some of those that yielded good PDFs on my dev instance running the same code.
(02:07:34 PM) jccooper: Yes; they may have failed in data fetching or something.
(02:07:36 PM) jenn: rhaptos_collection_print_config has servicehost cnx.org.
(02:07:36 PM) cbearden: Some of those errors I think don't relate to un-PDFableness
(02:07:38 PM) reedstrm: BTW, I think a panic rollback is not required: once I cleared the stuck pdflatex processes from midnight, the FE's 'unstuck'
(02:08:09 PM) reedstrm: I've just watched a number of FE's restart for memory size.
(02:08:17 PM) jenn: RCPrinter's parameters:<br xmlns='http://www.w3.org/1999/xhtml'/> Makefile path: /opt/instances/cnx/Products/RhaptosCollection/printing/Makefile<br xmlns='http://www.w3.org/1999/xhtml'/> Portal path: /plone<br xmlns='http://www.w3.org/1999/xhtml'/> Download host: cnx.org
(02:09:19 PM) cbearden: That looks good to mel
(02:09:24 PM) jccooper: The module printing sets its Makefile host param as the host from the RCPrinter.
(02:09:27 PM) jenn: Who's Mel, and what does he knw?
(02:09:29 PM) cbearden: I think I've captured the tmpfile of a stuck build process.
(02:10:33 PM) reedstrm: jenn: haha
(02:11:05 PM) cbearden: process 24777 on naginata, witth child process 9294 (pdflatex)
(02:11:29 PM) cbearden: I can't tell which instance it may have run on.
(02:11:42 PM) cbearden: module.log ends in the middle.
(02:11:45 PM) reedstrm: Yup, that's the lasty stuck one. It was launched just past midnight.
(02:12:42 PM) cbearden: It's a PDF that does successfully build, I believe.
(02:15:14 PM) reedstrm: There were a bunch from right around midnight.
(02:16:18 PM) cbearden: Re: the stuck process I'm investigating: I built the same module on naginata, using HOST=cnx.org and the module makefile from the cnx instance, and the LaTeX files of the two differ in unexpected ways.
(02:16:26 PM) cbearden: They should be very nearly the same.
(02:17:22 PM) cbearden: Oh, wait.
(02:17:54 PM) cbearden: Never mind; someone was viewing an older version of the module.
(02:17:57 PM) cbearden: That was what failed.
(02:18:00 PM) cbearden: I wonder if that could be a problem.
(02:18:06 PM) cbearden: Not sure why, though.
(02:18:25 PM) jccooper: The module printing should run on every host; it's synchronous.
(02:18:44 PM) jccooper: Dunno why we'd only see differences on the first instance.
(02:18:55 PM) jccooper: Unless there's a software difference.
(02:19:58 PM) cbearden: I don't think so.
(02:20:13 PM) cbearden: Well, but I'm only checking RhaptosCollection, sorry
(02:24:13 PM) cbearden: Interesting.
(02:24:32 PM) cbearden: I started a manual build of m11196 version 1.1 on naginata, and it seems to have stalled in pdflatex
(02:24:59 PM) cbearden: PID 31073
(02:25:54 PM) cbearden: pdflatex has written nothing to its log file.
(02:26:51 PM) cbearden: Since I'm using the same version of content this time as was used in the earlier stuck process, the LaTeX files are identical
(02:27:03 PM) cbearden: I need to try building this one on my dev instance.
(02:27:48 PM) reedstrm: chuck: I'm stracing the stuck pdflatex on naginata
(02:28:08 PM) reedstrm: It's not doing any system calls, at least.
(02:28:38 PM) cbearden: I'll be curious to find out what you learn, but oddly, on my unloaded dev instance, the process seems to be stuck after clicking on the link.
(02:28:50 PM) cbearden: So maybe for somereason the PDF generation doesn't like earlier versions.
(02:29:04 PM) cbearden: I don't know why that would be.
(02:29:49 PM) reedstrm: well, it just wrote a bit, then read some more.
(02:30:16 PM) cbearden: When I killed the manual build on naginata, it left me at an interactive pdflatex prompt.
(02:30:42 PM) jenn: First parsed that as "unattractive pdflatex prompt".
(02:30:51 PM) cbearden: It is that, too.
(02:30:53 PM) reedstrm: ltrace says it's received 'SIGSTOP'
(02:30:58 PM) cbearden: entering extended mode<br xmlns='http://www.w3.org/1999/xhtml'/>! Interruption.<br xmlns='http://www.w3.org/1999/xhtml'/><to be read again> <br xmlns='http://www.w3.org/1999/xhtml'/> \unvbox <br xmlns='http://www.w3.org/1999/xhtml'/>l.459 \tabularnewline\hline<br xmlns='http://www.w3.org/1999/xhtml'/> <br xmlns='http://www.w3.org/1999/xhtml'/>? <br xmlns='http://www.w3.org/1999/xhtml'/>
(02:31:04 PM) jenn: How...unattractive.
(02:31:08 PM) reedstrm: THat's you?
(02:31:18 PM) reedstrm: Oh, you interrupted the one I was tracing!
(02:31:34 PM) cbearden: Oh, sorry
(02:31:39 PM) cbearden: That was one I started.
(02:31:41 PM) cbearden: Let me restart it.
(02:31:44 PM) cbearden: I assumed it was mine :)
(02:31:57 PM) reedstrm: Right, which is why I warned you above that I Was tracing it.
(02:32:14 PM) cbearden: I thought you were tracing one that had been started by zope
(02:32:19 PM) cbearden: I can hit that module again for you.
(02:32:26 PM) reedstrm: no need.
(02:32:35 PM) cbearden: Let me know if you want me to fire one off for you.
(02:32:43 PM) cbearden: I need to be able to kill mine at will.
(02:33:06 PM) jccooper: It seems to be doing the same thing on boole:8281
(02:33:15 PM) jccooper: At least it's repeatable.
(02:33:21 PM) cbearden: Same module and version, right?
(02:33:25 PM) jccooper: Yes.
(02:33:42 PM) cbearden: So now I'm going to see at what point in the revisions of this module the problem stops.
(02:33:47 PM) cbearden: Because the latest rev is fine.
(02:33:59 PM) jccooper: I'll check historical version of other modules.
(02:34:13 PM) reedstrm: yup, the stuck on from midnight also claims to have received SIGSTOP
(02:34:21 PM) jccooper: The symptom is a spinning pdflatex, correct?
(02:34:26 PM) cbearden: It could be an "any older version" problem, or it could be some feature of this older version.
(02:34:27 PM) cbearden: Yes.
(02:34:28 PM) reedstrm: sleeping.
(02:34:31 PM) cbearden: Well, not spinning.
(02:34:33 PM) cbearden: Right sleeping.
(02:34:34 PM) reedstrm: Stopped, actually.
(02:34:45 PM) cbearden: 'E's not dead, 'e's restin'.
(02:34:52 PM) jenn: The general case should've been taken care of by Scott printing all the official test modules; there's an old version in there.
(02:34:57 PM) cbearden: pinin' for the fiords
(02:35:01 PM) jccooper: 27804 jccooper 25 0 29212 7188 1792 R 98 0.3 2:13.62 pdflatex
(02:35:20 PM) jccooper: That 98 is %CPU. It
(02:35:30 PM) jccooper: 's been going on for far too long.
(02:35:32 PM) cbearden: m11196 version 1.3 works.
(02:35:53 PM) cbearden: I'm thinking that there's some feature of the earlier version that is not liked.
(02:36:22 PM) reedstrm: Cameron: is that pid 6364 on plantinga? ltrace also says that's gotten a SIGSTOP
(02:36:48 PM) reedstrm: Sorry, you said boole.
(02:37:25 PM) reedstrm: Chuck, pid 6364 on plantiga, which also shows 94% CPU, anmd a runtime of over 9 min, is at SIGSTOP
(02:38:01 PM) cbearden: ok
(02:38:02 PM) cbearden: let me see
(02:38:30 PM) jccooper: 27804 on boole
(02:38:33 PM) cbearden: That's probably the one I started through the instance on 8180.
(02:39:30 PM) jenn: Need to restart to placate vmware; back in a bit.
Conversation with pdfkibbitz@conference.cnx.rice.edu at Thu 05 Jun 2008 02:43:29 PM CDT on reedstrm@jabber.cnx.rice.edu/Gaim (jabber)
(02:43:30 PM) reedstrm entered the room.(02:43:30 PM) pdfkibbitz
(02:35:08 PM) cbearden: Well, not spinning.
(02:35:10 PM) cbearden: Right sleeping.
(02:35:11 PM) reedstrm: Stopped, actually.
(02:35:22 PM) cbearden: 'E's not dead, 'e's restin'.
(02:35:29 PM) jenn: The general case should've been taken care of by Scott printing all the official test modules; there's an old version in there.
(02:35:34 PM) cbearden: pinin' for the fiords
(02:35:38 PM) jccooper: 27804 jccooper 25 0 29212 7188 1792 R 98 0.3 2:13.62 pdflatex
(02:35:57 PM) jccooper: That 98 is %CPU. It
(02:36:07 PM) jccooper: 's been going on for far too long.
(02:36:09 PM) cbearden: m11196 version 1.3 works.
(02:36:30 PM) cbearden: I'm thinking that there's some feature of the earlier version that is not liked.
(02:36:59 PM) reedstrm: Cameron: is that pid 6364 on plantinga? ltrace also says that's gotten a SIGSTOP
(02:37:25 PM) reedstrm: Sorry, you said boole.
(02:38:02 PM) reedstrm: Chuck, pid 6364 on plantiga, which also shows 94% CPU, anmd a runtime of over 9 min, is at SIGSTOP
(02:38:38 PM) cbearden: ok
(02:38:39 PM) cbearden: let me see
(02:39:07 PM) jccooper: 27804 on boole
(02:39:10 PM) cbearden: That's probably the one I started through the instance on 8180.
(02:40:07 PM) jenn: Need to restart to placate vmware; back in a bit.
(02:40:10 PM) jccooper: <br xmlns='http://www.w3.org/1999/xhtml'/>jccooper@boole:/tmp/col10532$ ltrace -p 27804<br xmlns='http://www.w3.org/1999/xhtml'/>--- SIGSTOP (Stopped (signal)) ---<br xmlns='http://www.w3.org/1999/xhtml'/>--- SIGSTOP (Stopped (signal)) ---
(02:43:30 PM) cbearden has set the topic to: pdf latex module problems
(02:43:41 PM) reedstrm: Ah there I am
(02:43:48 PM) jenn: Heh.
(02:43:51 PM) reedstrm: Note: bad idea to ltrace your x server.
(02:44:04 PM) ***jenn laughs!
(02:44:15 PM) reedstrm: Just to see if that SIGSTOP thing is what ltrace does.
(02:44:18 PM) reedstrm: Answer: nope!
(02:44:39 PM) reedstrm: my server spends a lot of time deciding how long the string 'nearest' is.
(02:45:01 PM) jenn: cooker cant cownt so gud
(02:45:22 PM) jccooper: <br xmlns='http://www.w3.org/1999/xhtml'/>...<br xmlns='http://www.w3.org/1999/xhtml'/>xsltproc -o module.bib /opt/instances/aod/Products/RhaptosCollection/printing/latex/bibtexml2bibtex.xsl module.tmp1<br xmlns='http://www.w3.org/1999/xhtml'/>cp module.bib index.bib<br xmlns='http://www.w3.org/1999/xhtml'/>pdflatex --interaction batchmode -shell-escape module.tex<br xmlns='http://www.w3.org/1999/xhtml'/>This is pdfeTeX, Version 3.141592-1.21a-2.2 (Web2C 7.5.4)<br xmlns='http://www.w3.org/1999/xhtml'/> \write18 enabled.<br xmlns='http://www.w3.org/1999/xhtml'/>entering extended mode<br xmlns='http://www.w3.org/1999/xhtml'/>! Interruption.<br xmlns='http://www.w3.org/1999/xhtml'/>\@endpbox ...\ST@dimen \fi \ST@dimen =\z@ \unvbox<br xmlns='http://www.w3.org/1999/xhtml'/> \ST@pbox \egroup \hfil<br xmlns='http://www.w3.org/1999/xhtml'/>l.460 \tabularnewline\hline<br xmlns='http://www.w3.org/1999/xhtml'/><br xmlns='http://www.w3.org/1999/xhtml'/>? 2008-06-05 14:45:16 INFO Signals Caught signal SIGINT<br xmlns='http://www.w3.org/1999/xhtml'/>2008-06-05 14:45:16 INFO Z2 Shutting down<br xmlns='http://www.w3.org/1999/xhtml'/>
(02:45:37 PM) jenn: Ouch.
(02:45:39 PM) cbearden: Someone is building a PDF on plantinga, right?
(02:45:56 PM) reedstrm: !I
(02:46:01 PM) jccooper: <br xmlns='http://www.w3.org/1999/xhtml'/>jccooper@boole:/opt/instances/aod$ Type <return> to proceed, S to scroll future error messages,<br xmlns='http://www.w3.org/1999/xhtml'/>R to run without stopping, Q to run quietly,<br xmlns='http://www.w3.org/1999/xhtml'/>I to insert something, E to edit your file,<br xmlns='http://www.w3.org/1999/xhtml'/>H for help, X to quit.<br xmlns='http://www.w3.org/1999/xhtml'/>?<br xmlns='http://www.w3.org/1999/xhtml'/>! Emergency stop.<br xmlns='http://www.w3.org/1999/xhtml'/>\@endpbox ...\ST@dimen \fi \ST@dimen =\z@ \unvbox<br xmlns='http://www.w3.org/1999/xhtml'/> \ST@pbox \egroup \hfil<br xmlns='http://www.w3.org/1999/xhtml'/>l.460 \tabularnewline\hline<br xmlns='http://www.w3.org/1999/xhtml'/><br xmlns='http://www.w3.org/1999/xhtml'/>No pages of output.<br xmlns='http://www.w3.org/1999/xhtml'/>Transcript written on module.log.<br xmlns='http://www.w3.org/1999/xhtml'/>make: [module.pdf] Error 1 (ignored)<br xmlns='http://www.w3.org/1999/xhtml'/>
(02:46:03 PM) cbearden: which module is that?
(02:46:15 PM) cbearden: I'm suspecting table stuff that pdflatex can't chew.
(02:46:20 PM) jccooper: m11196/1.1
(02:46:22 PM) reedstrm: --interaction batchmode!!!
(02:46:37 PM) reedstrm: Does'nt that mean "don't try to interact with me"!?
(02:46:40 PM) cbearden: That is needed I think so that it can run without input from the user.
(02:46:48 PM) cbearden: Right.
(02:47:03 PM) reedstrm: seems to not be working.
(02:47:03 PM) cbearden: So perhaps pdflatex is returning to interactive mode
(02:47:06 PM) ***jenn considers putting --interaction batchmode on her door sometimes
(02:47:10 PM) cbearden: ahd hanging
(02:47:18 PM) cbearden: ls
(02:47:23 PM) cbearden: oops
(02:47:28 PM) cbearden: meant for my temr
(02:47:30 PM) cbearden: term
(02:48:18 PM) Ed entered the room.
(02:48:34 PM) cbearden: I should have thought to invite you, Ed.
(02:48:45 PM) Ed: pdf problems?
(02:48:48 PM) cbearden: I think v 1.2 of m11196 is hanging as well.
(02:48:52 PM) cbearden: It also has the table.
(02:48:54 PM) cbearden: Yes.
(02:49:00 PM) reedstrm: you got the history of the chat?
(02:49:10 PM) jenn: Looks like maybe certain very old versions of modules aren't building right.
(02:49:11 PM) Ed: yes
(02:50:10 PM) cbearden: I have confirmed that versions 1.1 and 1.2 of m11196 do hang, even from the command line.
(02:50:23 PM) reedstrm: -interaction mode
Sets the interaction mode. The mode can be either batchmode,
nonstopmode, scrollmode, and errorstopmode. The meaning of
these modes is the same as that of the corresponding \commands.
(02:50:29 PM) cbearden: Versions 1.3 and 1.4 don't.
(02:50:31 PM) reedstrm: So, where are \commands defined?
(02:50:42 PM) cbearden: probably in pdflatex docs.
(02:51:05 PM) cbearden: I haven't really pursued them, since this hasn't given us problems ( to my knowledge) before
(02:51:14 PM) jenn: My canonical old-version module, btw, contains figures and math but no tables.
(02:51:25 PM) cbearden: Well, tables in general shouldn't be a problem.
(02:51:36 PM) cbearden: I'm going to look more closely at the tables.
(02:51:48 PM) cbearden: One thing to do would be to try to idenfity the target modules for processes that hang.
(02:51:57 PM) cbearden: Then we can look for commonalities in the LaTeX files.
(02:52:14 PM) reedstrm: random makefile samples I'm finding on the net seem to use 'nonstopmode'
(02:52:24 PM) cbearden: Well, we could try that.
(02:52:30 PM) cbearden: But not without testing. ;)
(02:52:37 PM) Ed: Any idea why a user is generating a pdf on a old version of the module?
(02:52:52 PM) reedstrm: I'm guessing bots that don't honor robots.txt
(02:52:59 PM) jccooper: cbearden: I confirm your working/non-working versions.
(02:53:00 PM) jenn: A user probably isn't; it's almost certainly a bot.
(02:53:10 PM) Ed: ok
(02:53:10 PM) jenn: Which isn't obeying our PDF exclusion.
(02:54:20 PM) reedstrm: Though it's a valid thing to do, if you like the older version for some reason.
(02:54:30 PM) cbearden: Though I could imagine a user really wanting to see version 1.1 of a module, and clicking on it several times to try to get the PDF.
(02:54:47 PM) jenn: A list of the ~30 modules that have errored, from rollout yesterday to about 1 PM, is in ~jenn/pdfkibbitz/error-modules.txt
(02:54:56 PM) cbearden: I think we don't know how extensive the problem is until we find out how many modules are represented by the hangs.
(02:55:07 PM) cbearden: Do we know that these processes yield errors?
(02:55:15 PM) jenn: No clue.
(02:55:28 PM) cbearden: Let me see if the ones I triggerdd in my dev instnace did.
(02:56:25 PM) jenn: Can we look at the recent logs to see if the hangs we've generated manually have logged errors in event.log?
(02:56:36 PM) cbearden: That's what I'm doing on my dev instance.
(02:56:54 PM) reedstrm: someone's ran wget on that particular module, BTW.
(02:56:54 PM) jenn: Ah.
(02:57:05 PM) cbearden: We could also collect the ids of the ones that do leave errors, and try to build them offline to see if they actually yield PDFs.
(02:57:18 PM) reedstrm: Jenn: no version means latest?
(02:57:23 PM) jenn: Ross: right.
(02:57:56 PM) cbearden: I don't think the ones that hang that I've invoked on plantinga:8180 are leaving messages in the log file, unless they come quite a while after the attempt.
(02:58:18 PM) jenn: depot:8080 is a clean clone of production as of last night's backup, if anyone wants to run tests there.
(02:59:45 PM) cbearden: The one that I confirmed to hang (m11196/1.1) on plantinga:8180 was started at 14:28, and it still hasn't left an error message in Z2.log
(03:00:09 PM) jenn: So..."my" PDFLatexErrors may be unrelated, or secondary.
(03:00:30 PM) reedstrm: As for user-visiblity of all this, the errors Jenn reported might show up, and we seem to have a bit of a spike in '0 0' entries in squid access.log (which are 'nothing to return' errors: FE timeout)
(03:01:02 PM) reedstrm: Otherwise, the load balancing seems to be handling this.
(03:02:02 PM) jenn: Ah. m11196 is a Java applet module.
(03:02:30 PM) reedstrm: is imagefix trying to hard to render a javaapplet?
(03:02:51 PM) cbearden: Yes, but I think it's in latest as well.
(03:03:05 PM) jenn: The applet is in all the versions, but only 1.1 and 1.2 have the table layout.
(03:03:06 PM) cbearden: I don't think imagefix does anything with javav, and in any case the hang in in pdflatex which comes after imagefix
(03:03:13 PM) cbearden: Right.
(03:03:17 PM) reedstrm: 'k
(03:03:21 PM) cbearden: And the table uses an entrytbl
(03:03:27 PM) cbearden: That's what I'm suspecting at present.
(03:04:50 PM) cbearden: Now I've got two spinning pdflatex on plantinga.
(03:04:55 PM) cbearden: Ross, are you stracing them?
(03:05:07 PM) reedstrm: nope
(03:06:37 PM) reedstrm: I find no ready doco on what batchmode vs. nonstopmode means.
(03:07:01 PM) reedstrm: Just lots of copies of the man page.
(03:07:03 PM) cbearden: Not toally surprised.
(03:07:17 PM) cbearden: The package may be known also as pdfetex
(03:07:26 PM) cbearden: I may have a manual around here somewhere.
(03:07:29 PM) cbearden: as in a PDF.
(03:08:10 PM) cbearden: Well, it appears not to have batch or nonstop in it.
(03:08:51 PM) cbearden: Ah, okay, I finally did get PDFLatexErrors in my log file
(03:09:04 PM) cbearden: But only when I killed the pdflatex processes.
(03:09:22 PM) reedstrm: I'm going to kill the last hanging make on naginata.
(03:09:26 PM) cbearden: ok
(03:09:37 PM) cbearden: Is that the one I rescued?
(03:09:49 PM) reedstrm: I've just watched claymore restart two FE, and we have no additional 'victims'
(03:10:09 PM) reedstrm: gone
(03:10:18 PM) jenn: So do we think all the unpleasantness stemmed from just a couple of print attempts?
(03:10:22 PM) reedstrm: Chuck, how did you chose batchmode in the dirsplace?
(03:10:28 PM) reedstrm: firstplace?
(03:10:29 PM) cbearden: I didn't.
(03:10:36 PM) cbearden: I accepted the choice made before me.
(03:10:48 PM) reedstrm: Oh. So, anyone know how batchmode got chosen in the first place?
(03:10:50 PM) reedstrm: :-)
(03:11:06 PM) jenn: !I. :)
(03:11:14 PM) cbearden: 7965 brentmh %.pdf: %.tex %.bib<br xmlns='http://www.w3.org/1999/xhtml'/> 7987 brentmh -pdflatex --interaction batchmode -shell-escape $<<br xmlns='http://www.w3.org/1999/xhtml'/> 7987 brentmh -bibtex $*<br xmlns='http://www.w3.org/1999/xhtml'/> 7987 brentmh -pdflatex --interaction batchmode -shell-escape $<<br xmlns='http://www.w3.org/1999/xhtml'/> 7965 brentmh pdflatex --interaction batchmode -shell-escape $<<br xmlns='http://www.w3.org/1999/xhtml'/> 7965 brentmh <br xmlns='http://www.w3.org/1999/xhtml'/>
(03:11:20 PM) cbearden: That's svn blame
(03:11:24 PM) cbearden: BLAME BRENT!
(03:11:29 PM) jenn: Badump-bump. :)
(03:12:00 PM) jenn: Too bad Ed wasn't here when Brent came through town to visit.
(03:12:11 PM) reedstrm: :-----------------------------------------------------------------------
r7987 | brentmh | 2005-08-02 17:29:29 -0500 (Tue, 02 Aug 2005) | 4 lines
- Set pdflatex and bibtex commands to not cause make to fail even if
they "error" out
------------------------------------------------------------------------
(03:12:19 PM) reedstrm: That's the log.
(03:12:27 PM) cbearden: Ahah
(03:12:30 PM) jenn: Ooo. Documentation.
(03:12:42 PM) cbearden: Wow, maybe I'm just interested in blame.
(03:12:44 PM) cbearden: ;)
(03:16:05 PM) cbearden: I'm experimenting to see if by modifying the table I can make m11196 1.1 build.
(03:16:25 PM) cbearden: Hmm, appears still to hang.
(03:16:27 PM) reedstrm: could you try nonstopmode and see if it at least blows past the error?
(03:16:32 PM) cbearden: sure
(03:17:29 PM) cbearden: It doesn't
(03:17:36 PM) jenn: Does it die instead?
(03:17:37 PM) reedstrm: doesn't what?
(03:18:06 PM) cbearden: Wait ...
(03:18:30 PM) cbearden: Let me try this again.
(03:19:22 PM) cbearden: It doesn't blow past.
(03:19:35 PM) cbearden: It just spits out all the stuff that would go to the log file to stderr or stdout.
(03:19:37 PM) cbearden: Didn
(03:19:39 PM) cbearden: 't check which.
(03:19:57 PM) cbearden: This is the only case in which I've seen pdflatex hang like this.
(03:21:43 PM) cbearden: Here's one idea: collection all the module IDs represented in PDFLatexError log messages since rollout, and evaluate them to see which ones die normally, and which ones hang.
(03:22:14 PM) jenn: Or if they're all just fine, and the problem was with load somehow.
(03:22:27 PM) cbearden: Well, some of them were this module and perhaps one or two like it.
(03:22:34 PM) jenn: I can do that test easily on depot; I have scripts.
(03:22:38 PM) cbearden: Both versions 1.1 and 1.2 of this module create the same effect.
(03:23:04 PM) cbearden: Okay. So you will extract the module IDs and versions of those in the PDFLatexError messages, and build those?
(03:23:04 PM) jenn: Which module; m11196?
(03:23:09 PM) cbearden: That would be helpful.
(03:23:11 PM) cbearden: Yes, m11196.
(03:23:15 PM) reedstrm: So:
@d batch_mode=0 {omits all stops and omits terminal output}
@d nonstop_mode=1 {omits all stops}
@d scroll_mode=2 {omits error stops}
@d error_stop_mode=3 {stops at every opportunity to interact}
@d print_err(#)==begin if interaction=error_stop_mode then wake_up_terminal;
print_nl("! "); print(#);
end
(03:23:16 PM) jenn: But that one's not on the list of ones that PDFLatexErrored.
(03:23:37 PM) cbearden: No log is written to zope log until the pdflatex process is killed, if my experience is any guide.
(03:24:00 PM) jenn: I've test-built a few from the error list, and they're fine.
(03:24:03 PM) reedstrm: Looks like batchmode is correct.
(03:24:30 PM) jenn: I'll build them all and see what I get.
(03:24:33 PM) cbearden: So I think the hang phenomenon is distinct, and related to things that pdflatex chokes on in a very unusual fashion.
(03:24:40 PM) reedstrm: Chuck, can you get the pdflatex run to hang from direct commandline execuation, or only from make?
(03:25:05 PM) cbearden: If we think the PDFLatexError modules are different cases, then let's try to identify and save /tmp filefs for hung processes.
(03:25:10 PM) cbearden: Let me try.
(03:25:42 PM) cbearden: It hangs on the command line.
(03:25:54 PM) reedstrm: /aside I've never Used The Source! before on web sources.
(03:26:12 PM) cbearden: ?
(03:26:14 PM) reedstrm: cool. We can try a direct ltrace or debug run, then.
(03:26:30 PM) cbearden: It's still hung on naginata, if you want to try it.
(03:26:38 PM) reedstrm: TeX is written in a Knuth specific thing called Web.
(03:26:39 PM) cbearden: ps auxw | grep cbearden | grep make or pdflatex
(03:26:44 PM) cbearden: Ah
(03:26:47 PM) cbearden: Right.
(03:26:50 PM) cbearden: I sort of knew that.
(03:27:12 PM) reedstrm: What are you doing testing on naginata? Jenn offered depot 8080 as a production clone!
(03:27:25 PM) cbearden: I was running on the command line.
(03:27:35 PM) cbearden: Should I not be?
(03:27:39 PM) cbearden: I'll stop, if you like.
(03:28:05 PM) reedstrm: In general, stay off the production machines unless we can't replicate it elsewhere. Justa best-practice sort of thing.
(03:28:09 PM) cbearden: ok
(03:28:14 PM) cbearden: shall I re-run on depot?
(03:28:21 PM) reedstrm: Once you said you'r replicated on plantinga, I figured ...
(03:28:21 PM) jenn: I'm running on depot at the moment, though.
(03:28:31 PM) reedstrm: Let me: I want to actually capture the whole run.
(03:28:37 PM) cbearden: ok
(03:28:54 PM) jenn: They're all taking a while, but building.
(03:28:56 PM) reedstrm: So, got a command for me?
(03:29:12 PM) cbearden: pdflatex --interaction batchmode -shell-escape m11196.tex<br xmlns='http://www.w3.org/1999/xhtml'/>But you have to have a tex file already.
(03:29:15 PM) cbearden: and all the images.
(03:29:20 PM) jenn: May I suggest somewhere besides depot while I'm running the error modules?
(03:29:27 PM) cbearden: make a copy of <br xmlns='http://www.w3.org/1999/xhtml'/>/home/cbearden/printing/tmp/stalls/m11196
(03:29:31 PM) reedstrm: No, the load may actually help. :-)
(03:29:32 PM) cbearden: Sure, use plantinga
(03:29:42 PM) reedstrm: 'k I'll try plantinga first. :-)
(03:29:56 PM) cbearden: but please make a copy; don't use my dir
(03:30:14 PM) jenn: Oh, no wonder it's taking a while; this is the copy of the script I modified to actually download and keep the PDFs.
(03:30:17 PM) reedstrm: already doing so.
(03:32:17 PM) reedstrm: Chuck, it's running in /tmp/m11196
(03:32:21 PM) cbearden: ok
(03:32:38 PM) reedstrm: end of m11196.log indicates something about tables:
I can't figure out why you would want to use a tab mark
or \cr or \span just now. If something like a right brace
up above has ended a previous alignment prematurely,
you're probably due for more error messages, and you
might try typing `S' now just to see what is salva
(03:32:45 PM) cbearden: I'm going to check my as-yet unsynced, unupdated dev instance on 8280 to see if this is a regression or a new bug.
(03:32:48 PM) jenn: How polite.
(03:32:57 PM) cbearden: Wow, load is really high on my system.
(03:33:16 PM) cbearden: Just let me know when you're done so I can work again.
(03:33:37 PM) reedstrm: As, actual lag causing load?
(03:33:46 PM) reedstrm:
(03:33:49 PM) cbearden: Yes.
(03:34:00 PM) cbearden: I'm trying to retrieve a module from 8280
(03:34:13 PM) reedstrm: There I kilt it.
(03:34:18 PM) cbearden: Maybe my firefix went south
(03:34:26 PM) cbearden: Still have one CPU pegged.
(03:34:41 PM) reedstrm: that's be firefox, according to top
(03:34:42 PM) cbearden: Perhaps that;'s my firefox
(03:34:45 PM) cbearden: ruight
(03:34:55 PM) cbearden: bye bye
(03:35:04 PM) reedstrm: Want m,e ti kill it?
(03:35:08 PM) cbearden: done die
(03:35:09 PM) cbearden: done did
(03:35:13 PM) cbearden: (sorry)
(03:35:26 PM) reedstrm: oh, I thought you meant you were going to log off, or something.
(03:35:35 PM) cbearden: