Database Mirroring

From FamiLAB Wiki
Jump to: navigation, search

Our world is filled with magnificent, useful databases. Our cell phones connect to a distributed database of phone numbers, iphone and android apps usually connect to some sort of database, our contact list is a little mini-db. Myspace, facebook, and linked-in are social networking databases that improve social mobility for those who are willing to try. Most every website, blog, and forum are run by databases, and search engines like Google, Yahoo, and Bing are just databases of database-driven websites.

Goal

For useful database-driven services:

  • Write initial database mirroring/dumping script
  • Write incremental-update scripts
  • Set and manage cronjob to update hourly, daily, or every other day
  • Provide datasets and dataset diffs to public for free
  • Provide datasets in easily-accessible, normalized database for member projects
  • Build MySQL accessory to do ML class prediction on datasets

Why

Structured Query Language was originally designed to be an end-user language. We have this trend of oversimplification and big-red-button pushing that often takes most of the power of these databases away from us. We can't run aggregate functions on a record-by-record web frontend. It's costly to rip an entire database record-by-record, but sometimes it must be done.

Caution

Cursorily search for a database-dump download on the website before you rape their bandwidth. Wikimedia offers database downloads, and web2.0 apps are likely to have APIs to facilitate ripping. Database dumps will be fantastic, but rarely available, and APIs are likely to be limited to avoid database ripping. You might need to resort to DOM Parsing.

Databases I'd Like To Mirror

  • Correctional Facility Inmate Databanks
    • John E. Polk
    • Local jails
  • Myspace
  • Facebook
  • Twitter
  • Linked-In
  • Blogs
  • Government Public Records
    • traditional 'public records': loan info, marriage certificates
    • tax info
    • Circuit Court arrest/offense record
    • DBA registrations
    • Incorporations
  • License plate :: person database?
  • Local businesses: locations, owners, etc.
    • Can we somehow derive expected income or market share from census data or something?
  • Census data
  • SpringerLink or other academic paper repository
  • Datasheet pdf repository
  • Google Books
  • Consumer hardware information and compatability
  • Cell phone models
  • Stock data
  • Market data
  • Spam databases
  • Exploit/bugtrack databases
  • Wikipedia
  • MusicBrainz audio track information
  • Lyrics database
  • Newspaper, magazine, periodical publications
  • Bookmarks database
  • Thoughts Database