Monday, May 14, 2012

The Challenges of Sequencer Comparison

I've slipped back into a lack of posting again.  Some of it is a residual from my recovery this spring from surgery (physical therapy can really be draining), and a lot is due to being really busy in the day job (finding highly rare substances like dilithium is not easy), but in the end those are incomplete explanations.  What should be really embarrassing is that I got access to Nick Loman and colleagues' comparison of benchtop sequencers several days before it published, and here it is weeks later and I'm finally getting around to covering it.

Well, I do have a small excuse for not jumping on it right away: I was on vacation the week before it came out, attempting to minimize my professional activities.  Yes, I had my smartphone connected to the local WiFi, but there's no way I'm going to Swype a whole post!    I

The paper has been covered extensively in various places (such , so I'm going to comment a bit on the meta-issues it raises.    Serious comparison's like Nick's are really, really needed, as I suggested somewhat tongue-in-cheek in my April 1st post.  Deciding on what sequencing platform to use is an important question, and deciding which to invest in is even harder.  

But, there are a series of challenges which make the perfect comparison impossible.  Even when Consumer Reports reviews dishwashers, the reviews are out of date not long after they are published; manufacturers are constantly changing their lines.  There are fewer sequencers, but they are changing at a more rapid pace.  Any analysis is a snapshot in time, and with the challenge of analyzing things well and then getting it through peer review, the timeliness is even more fleeting in the sequencing arena.  Indeed, Life Technologies has released a rebuttal to Loman et al, and one of the major complaints is that they used old kits.  Of course, at the time the work was done they were the current kits, but the Ion platform has undergone both chemistry and software improvements since then.    Life's complaint is at one level fair and another level unfair.  As Lex Nederbragt points out, Loman's analysis was capturing a slice in time and data from that time will continue to be published -- though papers rarely carefully specify the precise reagent and software versions used so it's hard to tell what data belongs to what era.  On the other hand, if you are designing an experiment today, indeed the data in Loman et al is problematic, as so much as changed.  MiSeq is scheduled to have a major upgrade sometime this summer, with much longer read lengths, and a host of new benchtop sequencers are hoping to invade the market.  Trying to keep these analyses up-to-date would make the Red Queen thankful for her situation.

Another of the Ion team's complaints also gets down to the challenge of executing a comparison the way you'd like to, given practical realities.  While the Ion PGM and 454 Jr were in the authors' lab, MiSeq was just showing up in the world when they did the work and that data was generated   So the MiSeq data was trimmed and filtered in a manner not entirely under the authors' control, and this is certainly an undesirable deviation.  But, I understand it well..  When I was at Millennium I was trying to compare three specialized computing systems for running protein searches, and it wasn't really practical to get the three installed and run the tests myself (though all three makers would have been willing to do it).  So I tried to design a test protocol for each manufacturer to run on data I supplied.  It seemed like an airtight solution, but each manufacturer delivered some different set of deviations.  In some cases they were trying to bend the rules to their advantage, but most times it was a case of my protocol not mapping unambiguously or precisely to their system.

A more subtle issue with the Loman paper is that it turns out the same bacterial sample wasn't used for all runs.  There's been a fascinating thread on the MIRA assembler mailing list on how you can get coverage differences due to the growth phase of the bacteria.  It's sort of obvious why this should be (nearer the origin will be at higher copy number), but I certainly never thought of it and not many papers have paid attention to it.  Since coverage is something these sequencer comparisons look at, that could be a major issue.  It turns out, this isn't obviously true in the Loman paper, but their Supplementary Figure 5 shows a much more spectacular effect -- a prophage apparently went lytic in the sample that went on MiSeq but not the other sequencers, leading to an enormous copy number for that one tiny region of the chromosome.

Another significant issue facing anyone wanting to put sequencers head-to-head is the choice of sample and analysis protocol.  Loman et al chose E.coli bacterial genome sequencing and assembly, which is certainly a legitimate experiment.  But is it really a good guidepost for sequencing other bacteria, such as the 70+% G+C world I play in?  Probably, but there are also quite likely some issues with the G+C content that are going to be hard to find in E.coli data.  Going farther afield, suppose you are only interested in sequencing amplicons?  Illumina potentially has inherent challenges in this if your amplicon set (or barcodes) are insufficiently diverse, but certainly PGM can have issues too.  

Is there any way out of all this complexity?  I think the only way would be for some standard test DNAs to be defined (and continuously expanded), and for labs (perhaps via the ABRF) to commit to periodically sequencing them and posting the results publicly.  Such results wouldn't be of much value unless you could fully automate the sort of analysis that Nick and colleagues performed.  Such a resource would be very valuable, but clearly it's a complex undertaking and would not be without real costs.  In the meantime, papers such as Loman et al will give us snapshots in time which are better than throwing hands up in despair, and as long as we pay heed to their limitations they will serve a vital purpose.


1 comment:

aggp11 said...

Keith,

Nice post!! I am a newbie in the NGS field. I have read quite a few discussions on comparing the different benchtop sequencers. I really liked the Nick's paper but did find some issues with it. You covered all those.

I feel like with benchtop sequencers, people have to start somewhere. Be it a PGM or MiSeq. All the platforms have their own drawbacks and it is for the researcher to decide what they want. Surely you can't wait for months in a queue to get onto a HiSeq. We use PGM in our lab and are well aware of its short comings. But we think that more often than not the pros out weigh the cons (e.g. homopolymer issues). So we try to keep these in mind when we get to the analysis part of sequencing. I think having a benchtop sequencer in the lab is pretty cool :) as long as one is able to weigh the pros and cons (based on the experiment). Otherwise everybody will just keep waiting for that perfect sequencer to come.

Thanks