How to import 200+ million rows into MongoDB in minutes
MongoDB is a great document-oriented no-sql database. The interest in using MongoDB is increasing as it allows storing documents of dynamic schemas in a json-like format. Also, MongoDB is good for handling big data while providing high availability and high insertion rates.
This post will focus on how to import and insert huge data rapidly into MongoDB. The data can be a MySQL database dump (.csv), data collected, processed data or whatever. MongoDB import tool provides importing files in the following formats: json, csv, and tsv. The version of MongoDB used in this post is 3.0.7. The following experiment was conducted on my laptop which has an i5 processor and 16GB of ram.
The data
The data I will use in this post is text-based data and it will be in tsv, tab-separated values, format. Each row in the data will have the following four fields (id, url, visits, unique_visits). The total number of rows is around 240 million.
The data I had was combined in one giant file (~5gb) where each line represents a row and the values for the fields were space separated. The data file didn’t contain header row.
As the data format isn’t any of formats acceptable by mongoimport, we must convert it. I chose to convert it into tsv as it is fast and easy to replace every space with a tab. Additionally, we won’t need to worry about escaping quotes in case of choosing csv, and we don’t have spaces in the values of our data.
The conversion can be done in the terminal by executing the following command.
tr ' ' \\t < data.txt > data.tsv
Now we have our data ready, let’s start the import process.
My first, obvious, trial
Initially, I tried to import this data directly into MongoDB using mongoimport tool. The command I executed was.
./mongoimport -d db_name -c collection_name --type tsv --file data.tsv -f id,url,visits,unique_visits
Reading the displayed log of mongoimport, I realised that the insertion time is really slow and it would take around 6-7 hours to import the entire data. Hence, I cancelled the import process and started the optimization phase which I will explain in the next section.
Notes:
- MongoDB daemon process was running with default configurations
- -f parameter indicates the field names of our data. The order matters and must be consistent in the whole data file.
Optimizing database import
disabling journal
Journalling is used in MongoDB by default. It enables mongodb to keep track of what changes were made to the database so it can apply those changes if mongod process crashed before writing them into the database. A good explanation of MongoDB’s Journalling can be found here.
As our main goal is to import the database, disabling journalling would make MongoDB focus more on inserting and apply that directly to the database. To disable journalling, append –nojournal to mongodb’s daemon as following:
./mongod --nojournal &
optimizing mongoimport parameters
After making the database daemon concentrate more on inserting. Having one process inserting the entire database is not enough. Lucky us, MongoDB has introduced a new feature into mongoimport tool called insertion workers. Basically, using this feature makes mongoimport divide the insertion tasks among N number of workers (threads). The value of N should be used carefully. Don’t use a high number as the operating system would spend more time scheduling and running tasks, nor use a very small number as you won’t be using the most of your machine. In my case, I chose N to be 8; however, a simple rule would be to set N to the number of cores.
The insertion workers feature can be used by appending –numInsertionWorkers to mongoimport command. Now, our command becomes:
./mongoimport -d db_name -c collection_name --type tsv --file data.tsv -f id,url,visits,unique_visits --numInsertionWorkers 8
During the experiment, MongoDB used to run out of memory. Unlimiting all the resources a process can use solved the issue which was achieved by using the following command:
ulimit -a
Indexing
Indexing increases the speed of querying documents but as we will not execute any query, we should not have any indexes before inserting as indexing would take some time. Whereas, we can index the documents after our insertion process is complete.
Result
Using the simple optimizations stated above, the total insertion time was around 40 minutes. In terms of indexing time, I have indexed the database using a compound index on id and visits fields and it took ~20 minutes.
Possible further optimizations (not tried)
- If you have access to a cluster or more machines at home, dividing the entire data file based on the number of machine and utilizing them to insert the data into the server would increase the speed.
- mongoimport has to convert the text data into binary-json then inserts it into database. While obtaining the data, storing (or converting, preferably using parallelism as well, if data is already obtained) it in a mongoimport friendly json format might increase the speed. Additionally, after the storing/converting phase, we can combine multiple documents into one json-array and import it using the parameter –jsonArray; however, we must state the _id field and provide a unique value for it in each document. For more details on the combining approach along with a simple tool for performing it, see this article.
- Host MongoDB on a managed AWS server. If you have trouble hosting it on AWS you could take it as an opportunity to learn AWS. This post will help you to pass the AWS Solution architect associate 2018 exam.
Overall, in my opinion, such results are good given my laptop’s specifications. Now I have to wait less amount of time, and I hope you would too 🙂