BOSC 2016: MultiQC, the next big thing, and ☀️😎
July 12, 2016
Orlando, FL. With temperatures spiking at 37°C and ACs dialed to max, 30-some developers gather to hack on future tools in medicine and research. The venue is FamiLAB (or "4am lab"), a hackerspace hidden in an industrial complex north of O-Town. The event is the annual Codefest, a pre-conference meet-up to kickstart the 2016 Bioinformatics Open Source Conference (BOSC).
Codefest: CWL, MultiQC, Spark/ADAM, and Nextflow
This is my second year attending and there's quite a few familiar faces from last year's Dublin-edition. Workflow tools still dominate the agenda. The "Common Workflow Language"-group worked hard to push out their 1.0 release. There were some ambitious work to integrate Spark and ADAM with Nextflow.
As for myself, together with Heather, and Lorena, we worked on adding modules to MultiQC with remote help from (Tall) Phil. I felt rather productive the first day and a half followed by an expected lull. I made improvements to how
samtools stats output is visualized, added a draft GATK VariantEval module, and discussed the possibilities of distributing plugins on GitHub. It's been nice to finally get into learning in the ins and outs of MultiQC and the excellent documentation has made it a breeze! 💎
There's a few more female attendees (5/32) compared to last year and the laptop-diversity is on the same level.
Last but not least, I want to give the organizers and specifically Brad Chapman a huge thanks for putting together a relaxed and inspiring meet-up. Thanks also to FamiLAB for hosting, Curoverse and PLOS for sponsoring with tasty food and the ever so essential coffee, and of course the other attendees for being very open and helpful! 👏
BOSC: Open data, CWL, and humidity
As usual, I refer you to Brad Chapman's comprehensive notes if this is what you are looking for. Here, I'll try to focus on the things that stuck with me personally.
Each day kicked off with a keynote on the benefits of open data and draw backs of closed data. First up was Jennifer Gardy who gave compelling arguments for the importance of sharing data on infectious diseases. She made it clear that it's a matter of life and death and that we shouldn't discard non-standard options like tracking Twitter trends 🐣
A memorable moment was when second keynote speaker, Steven Salzberg, nailed GATK on the Closed Source Wall of Shame (👻 buuuu!) His talk touched on a lot of topics like licensing that I personally haven't dealt with much but understand is at the core of open source and open data. One of his main take-aways was that publicly funded projects should release their raw data early. Now, many such efforts embargo the findings until the main institute has had time to publish their results. This isn't align with open data.
CWL had a big presence as expected. They are charging along with community engineer Michael R. Crusoe leading the way. First off, I want to congratulate them for all the hard work that went into publishing version 1.0! Many people are betting on CWL now (while the rest seem to use Nextflow). However, I still feel like it's slightly hard to nest out what CWL actually is. The talks generally don't mention concrete use cases. I hope we will soon see projects built on top of the specification that exposes more to-the-point examples.
Trend spotting: Spark
Ladies and gentlemen; the dust has settled and Docker is still standing strong. Now it's time to sink your teeth into SPARK + ADAM. I've heard about Apache Spark for a while and remember Roman and Johan excitedly hacking on it during last year's Codefest. To be honest, I've never really understood it or it's supposed greatness. However, after a slew of impressive talks at BOSC and learning that GATK 4 is taking advantage of it, I'm starting to feel like it's time to jump on the train.
Highlight: MyVariant.info and MyGene.info
I enjoyed Chunlei Wu's talks on his BioThings APIs project. BioThings is a framework to build scalable bio-focused APIs and seemed like a great developer resource. However, I was most intrigued with the actual variant and gene APIs which are free to use without limitations! You simply make queries to their endpoints and get back up-to-date information about a gene or variant. By combining them you can get even more creative. It looked really impressive and the completely free model makes jumping in super easy.
Let's touch on this topic real quick. Orlando in July: good or bad for ISMB/BOSC 2016? I don't know if I heard a single person answer yes to that question. The climate is simply exhausting. The Disney-tax is prevalent. The cab ride from the airport, for example, was more expensive than going to Miami by bus! Ah well.. Prague 2017 looks oh-so inviting post-Orlando 😉🇨🇿
- 2016-07-14: updated to mention PLOS as sponsor of Codefest
- 2016-07-26: updated with highlight - myvariant.info