i've set mongodb on ubuntu aws instance. have 920 files ranging in size 5mb 2gb or so.
once each unzipped text file uniq'd uniq
, run following script insert them db:
require 'mongo' require 'bson' mongo::logger.logger.level = ::logger::fatal puts "working..." db = mongo::client.new([ 'localhost:27017' ], :database => 'supers') coll = db[:hashes] # suppressors = file.open('_combined.txt') suppressors = dir['./_uniqued_*.txt'] count = suppressors.count puts "found #{count}" suppressors.each_with_index |fileroute, i| suppressor = file.open(fileroute, 'r') percentage = ((i+1) / count.to_f * 100).round(2) puts "working on `#{fileroute}` (#{i+1}/#{count} - #{percentage})" c = 0 suppressor.each_line |hash| c+=1 coll.update_one({ :_id => hash }, { :$inc => { :count => 1 } }, { upsert: true} ) puts "processed 50k records #{fileroute}" if c % 50_000 == 0 end end
the idea is, if record exists, $inc
set count 2 or 3 i'll able find duplicates running query on db later.
i connected instance via robomongo , @ first every time refreshed following query:
db.getcollection('hashes').count({})
i'd see filling db quickly. there's lots of files figured i'd leave overnight.
however after time result got stuck @ 3788104
. worried there hard size limit (df
says i'm using 35% of hdd space)
is there in config file automatically limits amount of records can inserted or something?
ps: me or either upsert
or .each_line
incredibly slow?
mongodb's update model based on write concerns, meaning calling function updateone
alone not guarantee success.
if version of mongodb @ least 2.6, function updateone
return document information errors. if version of mongodb older, explicit call of getlasterror
command return document possible errors.
if database not contain desired documents, returned document contain errors.
in both cases, write concern can adjusted desired level, i.e., gives control how many mongo instances must have propagated change considered success.
(note: not familiar ruby driver, assuming behaves shell).
Comments
Post a Comment