Our objectives were to:
Identify an appropriate python development environment for data science
Select the right package manager for data science in python
Choose the right parallel computing library to scale our data science work
Set up a cloud infrastructure to deal with big data challenges
Identify tools to support storage, management and versioning of code and data
For each of the objectives mentioned above, we investigated and selected a list of candidate tools. These investigations were conducted by acquiring information from reliable web sources e.g. searching through google, wikipedia and other knowledge sources, and knowledge actors. We also performed multiple tests and experiments to further concretely evaluate the adequacy of some of the features of these tools for our use cases. We then produced qualitative comparative tables to recap on the work done. These tables are presented in this document and provide a comparative rationale for our decision making. The comparisons are based on criteria that capture our needs for each objective.
We present the list of tools selected based on the series of investigations and evaluations we conducted:
|Python development environment||Jupyter Notebook|
|Cloud infrastructure platform||AWS EC2 / ECS & S3|
|Code and data management||Git (with Git LFS)|
Table 1 - The chosen infrastructure stack
Although this selection highlights our primary toolset, it is not totally exclusive as some of the competing tools we evaluated were judged to be beneficial as complements to the above list for particular scenarios.
2. Python Development Environment
Our requirements were to find a tool that can provide:
Experiment support: A tool which can support research experiment workflows. This means code could typically organized in a storyline fashion rather than traditional structures and code organization we find in python modules. The experiment code follows a purposeful ordering to facilitate a workflow for exploring an hypothesis, making a point or communicating an insight. This is the fundamental difference!
Productivity: This is influenced by the overall development environment, how it is efficiently organized and how compatible it is with the objectives and abilities of the data scientist.
Interactive: The ability for the data scientist to easily interact with his code and the data. To be able to go back and forth between ideas in a way that is initially free from any strict order.
Lightweight: How complex or large is the tool to set.
Reporting: The ability to minimally develop reports based on work done and to easily share them. This also means the ability to support/help support various useful reporting formats (e.g. pdf, html, txt, etc).
Portability: The ability to easily port the development environment to another OS or physical infrastructure.
Based on the presented requirements, the following tools were selected and a series of try-outs and qualitative evaluations were conducted against our criteria:
Table 2 - Comparison of Python Development Environments.
Legend: A: Excellent, B: Good, D: Limited, F: Nonexistent or Unacceptable
We selected jupyter notebook, with the option of using spyder, pycharm or any other python editor whenever necessary. Jupyter’s client-server architecture makes it extremely portable; we can access the notebook remotely from a web browser. This portability aspect is crucial in our workflow and is less easily achievable using Spyder, Pycharm or any similar IDE. With JupyterHub, we can support a collaborative workflow, which will help our cloud requirements as we will see later on.
3. Package Manager
We were looking for a solution that can deliver along the following criteria:
Quality of library selection: The richness of the package manager in terms of pre-installed libraries and available updates.
Ease-of-use: How easy it is to set up the package manager, install additional libraries, navigate libraries and alter configurations.
Documentation: How accessible and useful is the existing documentation.
Adoption and Maturity: The level of community and global adoption of the package manager, its maturity as a product and the level of activity in development and the level of involvement of the community.
Python distribution: How well are python distributions and their versions supported? Are there any tweaks required to be performed by the user to get things to work? How good is the support for updates?
OS support: The level of support towards multiple operating systems (Linux, Mac, Windows).
We evaluated our requirements between Enthought Canopy and Anaconda:
|Quality of library selection||A||A|
|Adoption and maturity||A||A|
Table 3 - Comparison of Package Managers
We selected Anaconda. This is mostly due to its overall ease-of-use, but both solutions are comparable in features and offerings. However, Canopy is much more oriented towards a hands-on data-centric interactive workflow, whereas Anaconda with Spyder offers a more traditional programming-oriented interactive workflow. The latter fits our approach best where we can focus on building learning algorithms. Anaconda also supports Python 2 and 3 which Canopy does not. Again, using Jupyter is an option for both.
4. Distributed Computing
Our requirements were for the library to support the following features:
Arbitrary graph: The ability to compute the result of an arbitrary graph
MapReduce: The ability to compute the result of mapreduce operations
Lightweight setup and ease-of-use: How straightforward is the setup of the library and how hard is the learning curve? How convenient is the library usage?
Multi-threading on single machine: The ability for the tool to support parallelization of computation across multiple threads within a given process.
Multi-process on single machine: The ability for the tool to support parallelization of computation across multiple processes on a given machine
Distributed setup (cluster): The ability for the tool to support parallelization of computation across multiple machines on a cluster over a network.
Machine Learning (ML) training support: The ability to support parallelization of the training phase of machine learning algorithms
Community adoption and maturity: The level of community adoption and the maturity of the library.
Scalability: How well is the library capable of scaling computations up to 1GB, 10GB, 100GB, 1TB or 1PB of data and over 1, 10, 100, 1000 or 10000 machines ?
Python integration: How well integrated with the python ecosystem is the library ?
Performance: How much ROI is delivered in processing speed by the library given a particular computation problem ?
In this category, we have evaluated the following libraries: PySpark, Dask, iPyParallel, Joblib and Disco.
|Lightweight setup and Ease-of-use||F||A||D||B||B|
|Adoption and maturity||A||D||D||D||D|
Table 4 - Comparison of Distributed Computing Libraries
We selected Dask. This is mainly because of its combination of lightweight setup and execution (it is a Python native tool), ease-of-use of the API, and expressive power / versatility. Dask supports parallelization of numpy arrays and pandas dataframe operations by implementing a “mirror” API on top of them. This means there is virtually zero learning curve to use dask arrays and dask dataframes, though one gets the full benefits of parallel computing over numpy or pandas data.
One weakness and risk with Dask is that it is still a fairly young library and is still growing its community within the python ecosystem. Certain operations are not well optimized (e.g. shuffling operations), and some others are not supported at the moment. It is developed by a very small team.
However, it is a promising library part of the Python family and it is lightweight enough that it is easy to evolve our stack if our needs demand it in the future. For now, it gives us just what we need to get started and to iterate on the research work very quickly.
Note that other libraries such as dispy or pyMP were discarded because of their very limited community adoption. There is also more investigation required to better understand our distributed ML training goals and options. For example, in the scikit-learn library, dask can be plugged in as a replacement to joblib for multithreading requirements.
5. Cloud Infrastructure
Our requirements were as follows:
Compute resources: The availability of decent cloud compute offerings.
Storage resources: The availability of decent cloud storage offerings.
Python API: The level of API support for the python language.
Documentation and support: The accessibility of the documentation and general support.
Ecosystem: The size and richness of the ecosystem around the services.
Adoption and Maturity: The level of global adoption and maturity of the services.
Availability: The level of availability and resilience of the services.
Pricing: How affordable and flexible are the pricings?
For this qualitative evaluation, we selected the following services: Amazon Web Services (AWS), Google Cloud and Microsoft Azure.
|Documentation and Support||A||A||A|
Table 5 - Comparison of Cloud IaaS Solutions
We selected AWS with EC2/ECS and S3. We chose AWS ecosystem mainly because of its known popularity, mature ecosystem and extensive third-party support. Additionally, we have used it on previous occasions and are thus already well familiar with the offerings and the tools. Otherwise, Google Cloud and MS Azure are decent competitors and all three excel in various areas of their offerings. It is just that one has to relearn an entirely different IaaS API and tools. There is also the effect of lock-in to a particular vendor to be considered. We also found pricing for some of Amazon services to be slightly cheaper and more flexible than the other competitors (e.g. S3, ECS).
6. Code and Data Management
The required features for a code and data management tool were the following:
Code versioning: The extent to which maintaining multiple versions of the codebase is supported.
Data versioning: The extent to which maintaining multiple versions of the data is supported.
Collaboration: The support for collaborative workflow on the code and data.
Data lifecycle support: The ability to handle the entire lifecycle of the data. This means for example, the ability to track move, copy and modification to the data.
Programmatic access: The support for an API for programmatic access.
Large file handling: The extent to which the features in this list scale to large data files (GB, TB).
Flexible file structure: How flexible is the way code and data can be organized ?
File formats: The extent to which the tool supports many useful file formats.
Performance: The extent to which the tool is efficient at performing the various functions: moving or copying large files, transferring data to our cloud (AWS), etc.
The table below shows a comparison between the evaluated solutions: Git with Git LFS, Google Drive and Amazon S3.
|Git + LFS||Google Drive||Amazon S3|
|Data Lifecycle Support||A||A||A|
|Large File Handling||A||A||A|
|Flexible File Structure||A||D||B|
Table 6 - Comparison of Code and Data Management Services
We selected Git with Git LFS. Git obviously delivers better than Google Drive or S3 for our use cases. One issue we had was to be able to deal with large files (e.g. giga files datasets, etc). And although, we can now comfortably push large files to Git using Git-LFS, solutions like Amazon S3 are still useful as temporary storage (or cache), especially in cloud scenarios where we want to run the same data against some code on an EC2 instance.
Using primarily Git means we can have a centralized image of our repository. Experiments can be canonically represented by a folder containing all associated artefacts: the datasets, the algorithms, serialized trained models, expensive intermediary representations, the reports and transient artefacts (when necessary). This facilitates traceability of work, reproducibility of experiments (which is a crucial requirement for us), portability (delivered by the adequate relative location of the artifacts) and fast access.
Additionally, centralizing our data using Git with Git LFS obviously adds additional security risks that need to be considered and addressed. These can be mitigated through robust security practices within our organization.
We have spent a considerable amount of resources setting up an infrastructure stack to support our research goals.
We spent these resources to make sure the team does not get slowed down by issues outside the core data science work. Our infrastructure choices primarily aim at maximizing research productivity and output. Some of the decisions are also influenced by our particular circumstances: a very small team, the necessity for rapid iterations, the necessity for flexibility or low commitment to structure (minimize technical debt), the opportunity for growth and limited availability of resources.
It is important to remember that the qualitative evaluations presented throughout this document were made with respect to our specific needs for an infrastructure that supports our data science and machine learning research work which involve big data manipulation with a heavy focus on natural language processing (NLP), text processing, information extraction and language understanding tasks. These evaluations are reported at quite a high-level because it was not possible to provide detailed rationales for every single decision without making the document heavy to read. We hope to be able to dive into specific aspects in the future.
We will update about specific points as we encounter inevitable issues and are forced to make adjustments. We also expect that as the data science ecosystem begins to mature there will be winners and losers as infrastructure providers which will make some decisions for us.