Skip to content Skip to sidebar Skip to footer

How To Save An Json File Using Gridfs

I have a huge dataset, I am using mongoose schemas, and each data element looks like this: { field1: “>HWI-ST700660_96:2:1101:1455:2154#5@0/1”: field2: “

Solution 1:

It's very very likely not worth storing the data in Mongo using GridFS.

Binary data never really belongs in a database, but if the data is small, the benefits of putting it in the database (ability to query) outweigh the drawbacks (server load, slow).

In this case, it looks like you'd like to store document data (JSON) in GridFS. You may do this, and store it the way you would store any other binary data. The data, however, will be opaque. You cannot query JSON data stored in a GridFS document, only the file metadata.

Querying big data

As you mentioned that you wanted query the data, you should check the format of your data. If your data is in the format listed in the example, then it seems like there is no need for complicated queries, only string matching. So there are several options.

Case 1: Large Data, Few Points

If you have not many data sets (pairs of field1 and field2) but the data for each one is large (field2 contains many bytes), store these elsewhere and store only a reference to that. A simple solution would be to store the data (formerly field2) in a text file on Amazon S3 and store then store the link. e.g.

{
  field1: “>HWI-ST700660_96:2:1101:1455:2154#5@0/1”,
  field2link: "https://my-bucket.s3.us-west-2.amazonaws.com/puppy.png"
}

Case 2: Small Data, Many Points

If your each data set is small (less than 16 MB) but there are many data sets, store your data in MongoDB (without GridFS).

Specifics

In your case, the data is quite large and storing it using GridFS is inadvisable.

This answer provides a benchmark towards to bottom. The benchmark seems to indicate that the retrieval time is more or less directly proportional to the file size. With the same setup, it would take 80 seconds to retrieve a document from the database.

Possible optimisations

The default chunk size in GridFS is 255 KiB. You may be able to reduce large file access times by increasing the chunk size to the maximum (16 MB). If the chunk size is the only bottleneck, then using the 16 MB chunk size would reduce the retrieval time from 80 seconds to 1.3 seconds (80 / (16MB/255KiB) = 1.3). You can do this when initialising the GridFS bucket.

new GridFSBucket(db, {chunkSizeBytes: 16000000})

A better strategy would be to store the only file name in Mongo and retrieve the file from the filesystem instead.

Other drawbacks

Another possible drawback of storing the binary data in Mongo comes from this site: "If the binary data is large, then loading the binary data into memory may cause frequently accessed text (structured data) documents to be pushed out of memory, or more generally, the working set might not fit into RAM. This can negatively impact the performance of the database." [1]

Example

Saving a file in GridFS, adapted from the Mongo GridFS tutorial

const uri = 'mongodb://localhost:27017/test';

mongodb.MongoClient.connect(uri, (error, db) => {
  const bucket = new mongodb.GridFSBucket(db);

  fs.createReadStream('./fasta-data.json')
    .pipe(bucket.openUploadStream('fasta-data.json'))
    .on('finish', () =>console.log('done!'))
  ;
});

Solution 2:

I have found a better way to solve this problem than the one I have implemented, the one in the question description. I just need to use Virtuals!

First I thought that using ForEach for adding an extra element to the Fasta file would be slow, it is not, it is pretty fast!

I can do something like this for each Fasta file:

{
  Parentid: { type: mongoose.Schema.Types.ObjectId, ref: "Fasta" }//add this new line with its parent id
  field1: “>HWI-ST700660_96:2:1101:1455:2154#5@0/1”: 
  field2: “GAA…..GAATG”

}

Then something like this:

FastaSchema.virtual("healthy", {
  ref: "FastaElement",
  localField: "_id",
  foreignField: "parent",
  justOne: false,
});

Finally populates:

  Fasta.find({ _id: ObjectId("5e93b9b504e75e5310a43f46") })
    .populate("healthy")
    .exec(function(error, result) {          
      res.json(result);
    });

And the magic is done, no problem with subdocument overload! Populate applied to Virtual is pretty fast and causes no overload! I have not done that, but it would interesting to compare with conventional populate; however, this approach has the advantage of no need to create hidden doc to store the ids.

I am speechless with this simple solution, that came up when I was answering another question here, and it just came up!

Thanks to mongoose!

Post a Comment for "How To Save An Json File Using Gridfs"