Quantcast
fastlane insider
Results 1 to 14 of 14

Thread: Need to Collect a HUGE amount of data

  1. #1
    adiakritos is offline
    Fastlane Driver
    Reputation Speed
    10 kph

    Joined
    May 2011
    Age
    22
    Posts
    177
    Blog Entries
    32
    adiakritos's Avatar

    Default Need to Collect a HUGE amount of data

    I'm building a web service that requires a HUGE database of information.

    The thing is that there are already lots of websites out there that provide this information. My goal is to have the same information and to display it in a specific way and be able to manipulate certain attributes.

    For example, there are TONS of websites out there that list addresses to people or businesses. There are lots of websites that list product attributes like price, weight..etc

    Instead of writing up this database manually how could I do upload the info instead and write a program to search and manipulate the data instead?

  2. #2
    adiakritos is offline
    Fastlane Driver
    Reputation Speed
    10 kph

    Joined
    May 2011
    Age
    22
    Posts
    177
    Blog Entries
    32
    adiakritos's Avatar

    Default

    The databases I can find seem to be encrypting the data so that it can only be searched using the programs offered by the organization offering the database.

    I guess that is because they put all the hard work into building the database it just wouldn't be fair to allow others to take for free. Plus people need a way to search the databases.

    Even if I have one, I don't know what format to get it in. one sketchy site is offering to sell a database I'd like in MySQL, CVS and Ms access and other formats.

  3. #3
    adiakritos is offline
    Fastlane Driver
    Reputation Speed
    10 kph

    Joined
    May 2011
    Age
    22
    Posts
    177
    Blog Entries
    32
    adiakritos's Avatar

    Default

    Ok I found an excel database. =DDDD

    I know someone was going to help me if I couldn't find the right answer in time. Thanks anyways

  4. #4
    Pat
    Pat is offline
    Fastlane Driver
    Reputation Speed
    20 kph

    Joined
    Jun 2011
    Locale
    World Traveler
    Age
    24
    Posts
    186

    Default

    You can always scrape the data if you can find it on the internet.

    Good to hear you found the db, makes things much easier

  5. #5
    adiakritos is offline
    Fastlane Driver
    Reputation Speed
    10 kph

    Joined
    May 2011
    Age
    22
    Posts
    177
    Blog Entries
    32
    adiakritos's Avatar

    Default

    Quote Originally Posted by Pat View Post
    Good to hear you found the db, makes things much easier
    yup!

  6. #6
    healthstatus is offline
    Fastlane Veteran
    Reputation Speed
    125 kph

    Joined
    Apr 2011
    Locale
    Indianapolis, IN
    Posts
    931

    healthstatus's Avatar

    Default

    many of the business listing databases are available from the government if you know where and have the patience to dig.

  7. #7
    awjt is offline
    Fastlane Rookie
    Reputation Speed
    5 kph

    Joined
    Dec 2011
    Posts
    63

    Default

    The problem isn't the db format. Every web hosting service has mysql or some type of sql database, built in. The problem you'll run into is designing the scraper to drill down and get only the data you want (hard), and then if you ever do amass a huge amount of data, you'll find sql to be sorely lacking in speed, unless you go to Oracle. And then if you get more supermassive data, you'll find that even Oracle becomes inhibitive and you'll have to develop your own data structure mechanism to serve it up efficiently (even harder).

    The people who make these kinds of things happen just dive in and learn everything they need to know on their own to scrape, shred and serve massive data. There is no ideal solution, because every situation is different. If you're talking health data, keep in mind that all the public data is already served fairly efficiently, but all the good stuff (Medicare/Medicaid/Hospital & Payer databases) are all under strict lock and key and you'll have to pay a LOT to get it and jump through a lot of hoops and could still come up short if you aren't a PhD researcher representing an established research organization.

    Good luck. Don't let me discourage you. I'm just telling it to you like it is. If you have a snazzy solution to some of the bullshit that health data researchers face every day, then more power to you and I wish you success.

  8. #8
    adiakritos is offline
    Fastlane Driver
    Reputation Speed
    10 kph

    Joined
    May 2011
    Age
    22
    Posts
    177
    Blog Entries
    32
    adiakritos's Avatar

    Default

    Yea it's 'huge' relative to the amount of time it would take me to manually copy or collect the information myself. Although it's only going to be around 10,000-20,000 items to search through.

    I found one database that has 7,500 items in it and I'm hoping that will sustain my effort long enough to buy a larger database.

  9. #9
    awjt is offline
    Fastlane Rookie
    Reputation Speed
    5 kph

    Joined
    Dec 2011
    Posts
    63

    Default

    7500? I'm confused. I regularly work with sql dbs with millions of records, and other kinds of files with billions.

  10. #10
    adiakritos is offline
    Fastlane Driver
    Reputation Speed
    10 kph

    Joined
    May 2011
    Age
    22
    Posts
    177
    Blog Entries
    32
    adiakritos's Avatar

    Default

    Yea I'm not going to be searching through millions of bits of data. several thousand.

    I mean... if you multiply that by the number of details I want to scrape from each listing then sure, it might scale to the millions.

  11. #11
    awjt is offline
    Fastlane Rookie
    Reputation Speed
    5 kph

    Joined
    Dec 2011
    Posts
    63

    Default

    Btw, you don't need to buy anything. There are open source dbs like postgresql and all kinds of other stuff. The time is spent figuring out how to use them.

  12. #12
    adiakritos is offline
    Fastlane Driver
    Reputation Speed
    10 kph

    Joined
    May 2011
    Age
    22
    Posts
    177
    Blog Entries
    32
    adiakritos's Avatar

    Default

    Quote Originally Posted by awjt View Post
    Btw, you don't need to buy anything. There are open source dbs like postgresql and all kinds of other stuff. The time is spent figuring out how to use them.
    thanks this will definitely save me a lot of time.

    At some point I'm going to want a search field that allows me to search the db. I want results to be listed immediately underneath what the person types so that if they see what they want to choose early on they can click the exact item.

    If anyone's used the MLXchange it does this. Google does it when you search something and it tries to guess what you'r about to write before you finish by displaying a list of search options underneath.

    I'm guessing I'll have to pick up some ajax for that.

  13. #13
    awjt is offline
    Fastlane Rookie
    Reputation Speed
    5 kph

    Joined
    Dec 2011
    Posts
    63

  14. #14
    BeachBoy is offline
    Fastlane Driver
    Reputation Speed
    20 kph

    Joined
    Jul 2011
    Posts
    250

    Default

    mysql is no problem, especially if you don't need transactional.

    you can have millions and millions of records.

    of course it depends on your hardware speed/config and also how you store and fetch the data (multiplejoins, indexing, type of columns, etc)

    I would not even blink and worry at the numbers of levels you're talking about..

  15. Speed Up Your Fastlane Process! MJ Recommends The Following Books...

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Similar Threads

  1. Anybody Collect Things?
    By Rem in forum Current Events, Sports, Off-Topic
    Replies: 12
    Last Post: Feb 27th, 2010, 11:58 AM
  2. Best Way To Collect Email Addresses?
    By Knowledge Kick in forum Internet / Mobile Apps / Software
    Replies: 4
    Last Post: Sep 1st, 2009, 10:08 PM

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •