ADS to BibDesk: Command Line & PDF Ingest

October 31, 2012

In the last few weeks I’ve been rolling out improvements to the venerable ADS to BibDesk service. Today I’m announcing version 3.0.6. What’s new?

  1. A full-fledged command line edition, installable with pip,
  2. A PDF ingest mode, great for getting your legacy folder of PDFs into BibDesk, and
  3. Lots of bug fixes to make ADS to BibDesk more robust against the peculiarities of some papers.

The Command Line Edition

It is now possible to run ADS to BibDesk from the command line. This opens up new possibilities for hacking your own workflows: from automatic scripts to integration with Mac OS X launchers like Alfred. To get started, you can pip-install the latest release (you may need to run this as sudo):

pip install adsbibdesk

Then check out the help:

adsbibdesk --help

The command line edition takes the very same tokens as the Service edition: an ADS or arXiv URL, an ADS bibcode, an arXiv pre-print ID, or a DOI. For example:

adsbibdesk 1998ApJ...500..525S

Ingesting a Folder of PDFs

BibDesk is becoming more popular with astronomers. One request I’ve received from new users is an easier way to add folders-full of papers downloaded from ADS and arXiv into BibDesk (with matching the BibTeX and abstract data). ADS to BibDesk is good at downloading papers, BibTeX and abstracts; the challenge here is reliably identifying a paper given its PDF.

The approach I’ve taken is borrowed from an older script by Dr Lucy Kim. The first step is to extract text from a PDF, and second, to extract a DOI string from that text. ADS to BibDesk can then act on that DOI as usual.

To extract text from a PDF, I’ve opted for the pdf2json program.1 It can easily be installed with Homebrew on your Mac. Before you try the PDF ingest mode, go ahead and install pdf2json.

Next, we need to extract a DOI from the paper’s text: a perfect job for regular expressions. The solution is written by Alix Axel in this excellent StackOverflow post, and the Python implementation is:

import re
regStr = r'\b(10[.][0-9]{4,}(?:[.][0-9]+)*/(?:(?!["&\'<>])\S)+)\b'
pattern = re.compile(regStr)
doiMatches = pattern.findall(paperTxtData)

Reading through that StackOverflow post, it appears that DOI is a tricky format to parse. Fortunately, this regular expression seems to work with the astronomical literature.

You can give this PDF ingest workflow a try via:

adsbibdesk -p my_pdf_dir/

where my_pdf_dir/ is a directory containing PDFs that you want to ingest into BibDesk.

Note that DOIs are not present in all papers; particularly ones only a few years old. You can easily find the DOI text on the first page of newer papers.

Bug Fixes

Personally, I’m most excited about some of the bugs we’ve been able to fix (mostly with the prodding of Issues posted on GitHub).

First, we’ve fixed a lot of problems caused by unicode characters and LaTeX markup in BibTeX data. The point of failure was how this data was escaped and passed via pipes between the Python scraper code and the AppleScript interface script to BibDesk. The solution was simple: don’t try to escape characters passed on the command line—just pass data through a temporary file.

The second bug was harder to identify. Some papers would work fine with the command line edition, but crash the Service edition. Thanks to a bug report we determined that the problem is triggered by papers with quotation marks in the paper title, such as The “True” Column Density Distribution in Star-Forming Molecular Clouds. It turns out the problem was ultimately with the HTML served by ADS. Abstract pages are laden with helpful metadata, but these metadata fields are not escaped! Thus in the header of the aforementioned paper’s HTML page you’ll find the line:

<meta name="dc.title" content="The "True" Column Density Distribution in Star-Forming Molecular Clouds" />

Those extra unescaped quotation marks break the HTMLParser module in Python—except not always. With the command line edition I run Python 2.7.3, whose HTMLParser is robust against this type of malformed HTML. But the Service edition uses the default Python provided by Apple (version 2.7.1 for Mountain Lion). In this version of Python, HTMLParser is stopped cold by such HTML errors. To make HTMLParser happy, ADS to BibDesk pre-processes the ADS HTML to remove these metadata lines.

Roadmap

My wish-list for future updates includes: integrating the arXiv-updater script into the command line interface, and being more careful when updating papers to not lose BibTeX data (e.g. the notes field). In the meantime, I have papers to write. But do tweet me, @jonathansick, or post an Issue to GitHub if you have problems or suggestions.


  1. I find that pdf2json loses word spaces in its output. If you know of a better text extraction program, I’m open to suggestions. Tweet @jonathansick. 

October 10, 2012

This weekend we’re commissioning a new 14” Celestron for the Queen’s Observatory.

Photo by Prof Stéphane Courteau.

High resolution.

This weekend we&#8217;re commissioning a new 14&#8221; Celestron for the Queen&#8217;s Observatory.

Photo by Prof Stéphane Courteau.

Continuous LaTeX Compilation with a Python Watchdog

September 2, 2012

I recently came across the Watchdog python package that allows scripts to act on changes in the filesystem. An obvious application is continuous integration: running make whenever a source file changes.1 Even more pertinent for academics is continuous compilation of LaTeX documents.

Here’s the gist (borrowing ideas from the Watchdog example and this GITS Blog post):

To run, simply execute the script from the same directory as your LaTeX project. Whenever a file changes in the directory watched by the Observer instance, the on_any_event() method of the FileSystemEvenHandler instance is called. If the event is due to a *.tex file, the subprocess module is used to call make. If you don’t use make files to manage your LaTeX compilation, perhaps a direct to call something like latexmk with

subprocess.call('latexmk -f -pdf -bibtex-cond paper.tex', shell=True)

would work.


  1. Other applications are numerous; a Dropbox-style uploader is also possible, for example. 

July 4, 2012
For many of us, the most shocking revelation to come out of CERN’s Higgs boson announcement today was quite unrelated to the science itself. Rather, we were blown away by the fact that a team made up of some of the most undoubtedly brilliant people in the world believe that Comic Sans is an appropriate font for such a historic occasion.

Sam Byford.

I concur (and hat tip to the Panda).

July 3, 2012

Herbig-Haro 110, seen by HST.

High resolution.

Herbig-Haro 110, seen by HST.
June 27, 2012

Why yes, those are roasted marshmallow and s’mores milkshakes.

Thanks Stand 4.

High resolution.

Why yes, those are roasted marshmallow and s&#8217;mores milkshakes.

Thanks Stand 4.
June 14, 2012

Jean-Eric Vergne launches his Torro Rosso. Reminds me of the opening scenes from Top Gun.

High resolution.

Jean-Eric Vergne launches his Torro Rosso. Reminds me of the opening scenes from Top Gun.
June 14, 2012

Narain Karthikeyan locks up his HRT into Turn 1.

High resolution.

Narain Karthikeyan locks up his HRT into Turn 1.