Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

lxml is nice. i would as suggested parse and scrape in different threads so you can speed up a bit, but it's not required per se. if you can't get the data you see on the website using lxml there might be ajax or other stuff implemented. to capture these streams / datas use a headless browser like phantomJS or so. Article looks good to me for 'simple' scrapings and is a good base to start playing with the concepts.

The nice thing about making a scraper from scratch like this is that you get to decide it's behaviour and fingerprint ,and you wont get blocked as some known scraperr. that being said, most people would appreiciate if you parse their robots.txt , but depending on your geographical locatin this might be an 'extra' step which isnt needed... (i'd advise to do it anyway if you are a friendly ;) and maybe put in user agent for requests something like 'i don't bite' to let ppl know you are benign...) if you get blocked while trying to scrape you can try to fake site into thinking you are browser just by setting user agent and other headrs appropriately. if you dont know which these are, open nc -nlvp 80 on your local machine and wget or firefox into it to see headers...

Deciding on good xpath or 'markers' to scrape can be automated, but it's often ,. if you need good accurate data from a singlular source, a good idea to manually go through the html and seek some good markers...

an alternate method of scraping is automating wget --recursive + links -dump to render html pages to txt output and grep or w/e these for what data you need... tons of methods can be devised... depending on your needs some will be more practical and stable than others.

saving files is only usefull if you need assurance on data quality and if you want to be able to tweak the results without having to re-request the data from the server. (just point to local data directory instead...). this way you can setup a harvester and parsers fr this datas.

if you want to scrape or harvest LARGE data sets consider a proxy network or something like a tor connection jugling docker instance or so to ensure rate limiting is not killing your hrvesters...

good luck have fun and don't kill peopels servers with your traffic spam, that's a dick move.... (throttle/humanise your scrapings...)



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: