From: Eric P. <Eric@Peters.org> - 2010-01-07 21:53:14
|
You effectively need to enumerate over values for *Year - Term* then * Department* then *Course Location*, save all of those permutations into a request queue, then iteratively go through each of the permutations to hit "search results", then loop through the results with an XPath to save the data out. If you find any competent java/scala programmer they should be able to do this no problemo. Here's some scala code (uses the HTMLUnit Java libraries) to login to the amazon associates/use the javascript navigation/drop downs and download an earnings report. def amazonData(startDate:Date, endDate:Date) = { // Initialize Date Range Information var yearFormat = new SimpleDateFormat("yyyy") var monthFormat = new SimpleDateFormat("M") var dayFormat = new SimpleDateFormat("d") var startYear:String = yearFormat.format(startDate.getTime()) var startMonth:String = monthFormat.format(startDate.getTime()) var startDay:String = dayFormat.format(startDate.getTime()) var endYear:String = yearFormat.format(startDate.getTime()) var endMonth:String = monthFormat.format(startDate.getTime()) var endDay:String = dayFormat.format(startDate.getTime()) // Fix Month - requires 0-11 startMonth = (startMonth.toInt - 1).toString endMonth = (endMonth.toInt - 1).toString val webClient = new WebClient(BrowserVersion.FIREFOX_2); val page = webClient.getPage[HtmlPage](" http://affiliate-program.amazon.com"); logger.info("titleText (page1) => {}", page.getTitleText()) val form:HtmlForm = page.getFormByName("sign_in") //val submit:HtmlSubmitInput = form.getInputByName("submitbutton"); val userField:HtmlTextInput = form.getInputByName("email"); userField.setValueAttribute("username") val passField:HtmlPasswordInput = form.getInputByName("password"); passField.setValueAttribute("password") val button:HtmlImageInput = form.getInputByValue("Sign In") val associateHome:HtmlPage = button.click() logger.info("titleText (page2) => {}", associateHome.getTitleText()) //gp/associates/network/reports/report.html?ie=UTF8&reportType=earningsReport&periodType=preSelected&preSelectedPeriod=monthToDate // val link:HtmlAnchor = associateHome.getFirstByXPath("//div[@id='mini-report']//a") var earningsReport:HtmlPage = link.click() logger.info("titleText (earnings report) => {}", earningsReport.getTitleText()) val combinedForm:HtmlForm = earningsReport.getFormByName("idbox_combined_reports_form") val combinedCheck:HtmlCheckBoxInput = combinedForm.getInputByName("combinedReports") if(!combinedCheck.isDefaultChecked()) { logger.info("clicking combined") earningsReport = combinedCheck.click() } logger.info("titleText (earnings report) => {}", earningsReport.getTitleText()) val reportForm:HtmlForm = earningsReport.getFormByName("htmlReport") val radioExact:HtmlRadioButtonInput = reportForm.getInputByName("periodType") radioExact.setValueAttribute("exact") val selectStartYear:HtmlSelect = reportForm.getSelectByName("startYear") val selectStartMonth:HtmlSelect = reportForm.getSelectByName("startMonth") // 0-11 val selectStartDay:HtmlSelect = reportForm.getSelectByName("startDay") selectStartYear.setSelectedAttribute(startYear, true) selectStartMonth.setSelectedAttribute(startMonth, true) selectStartDay.setSelectedAttribute(startDay, true) val selectEndYear:HtmlSelect = reportForm.getSelectByName("endYear") val selectEndMonth:HtmlSelect = reportForm.getSelectByName("endMonth") val selectEndDay:HtmlSelect = reportForm.getSelectByName("endDay") selectEndYear.setSelectedAttribute(endYear, true) selectEndMonth.setSelectedAttribute(endMonth, true) selectEndDay.setSelectedAttribute(endDay, true) val downloadCSVSubmit:HtmlImageInput = reportForm.getInputByName("submit.download_CSV") var in:InputStream = downloadCSVSubmit.click[Page]().getWebResponse().getContentAsStream() //var = p val reader = FlatFileReader(in) reader.hasHeaders = true reader.skipLines = 1 // def foreach(f: FlatFileRowReader => Unit) = { // pass in a function that reader.foreach( AmazonReader ).toString } On Thu, Jan 7, 2010 at 1:13 PM, mark douglas <bad...@gm...> wrote: > Hi Dave, > > My thoughts exactly. For someone who's familiar with java/htmlunit, > this is probably extremely easy. However, I'm not familiar with either > one! If I could talk/email with someone who's done this I could get a > working test to see if this is really the right approach. > > By the way, I'm more than willing to do a complete writeup/tutorial > for someone to post if this works... I've seen bits/pieces of articles > that discuss parts of how to use htmunit for fetching/parsing, but > nothing that goes from start to end.. > > If you have the skills for this kind of thing, get back in touch with > me. I'd like to discuss where this is all going.. > > thanks > > -bruce > > > On Thu, Jan 7, 2010 at 12:02 PM, Gable, David > <dav...@bo...> wrote: > > >From my experience (500+ HtmlUnit tests and a couple of similar site > > search tools) this kind of thing would be almost trivial to implement in > > HtmlUnit or one of the script-based variants. Where you write it would > > depend on what you want to do with the data once it is extracted, > > assuming you have a developer to work on it. > > > > Dave > > > >> -----Original Message----- > >> From: mark douglas [mailto:bad...@gm...] > >> Sent: Thursday, January 07, 2010 2:44 PM > >> To: asa...@ya...; asa...@us...; Htmlunit- > >> us...@li... > >> Subject: [Htmlunit-user] Looking for Ahmed Asahour to discuss a > >> project! > >> > >> Hi Ahmed, > >> > >> We're working on a project that deals with parsing sites that have > >> dynamic content via javascript. As you already know, we basically need > >> a headless browser, which seems to be the role of htmlunit. > >> > >> We'd like to talk to you about what we're doing, and whether HtmlUnit > >> would be/should be the right tool for what we're dealing with. > >> > >> As an example, one of the sites we're looking at is the florida state > >> (fsu) course site. The url is: > >> http://apps.oti.fsu.edu/RegistrarCourseLookup/SearchForm > >> > >> Our goal is to be able to simulate fetching the departments from the > >> dept select/option list, and then to be able to fetch the generated > >> course list. > >> > >> Both the dept/course list are dynamically generated via javascript. > >> > >> We're alos open to talk to anyone ese who might reply to the HtmlUnit > >> list, although we haven't seen a great deal of traffic on the mailing > >> list!. > >> > >> Thanks > >> > >> Tom.. > >> > >> > > ----------------------------------------------------------------------- > >> ------- > >> This SF.Net email is sponsored by the Verizon Developer Community > >> Take advantage of Verizon's best-in-class app development support > >> A streamlined, 14 day to market process makes app distribution fast > > and > >> easy > >> Join now and get one step closer to millions of Verizon customers > >> http://p.sf.net/sfu/verizon-dev2dev > >> _______________________________________________ > >> Htmlunit-user mailing list > >> Htm...@li... > >> https://lists.sourceforge.net/lists/listinfo/htmlunit-user > > > > > ------------------------------------------------------------------------------ > > This SF.Net email is sponsored by the Verizon Developer Community > > Take advantage of Verizon's best-in-class app development support > > A streamlined, 14 day to market process makes app distribution fast and > easy > > Join now and get one step closer to millions of Verizon customers > > http://p.sf.net/sfu/verizon-dev2dev > > _______________________________________________ > > Htmlunit-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlunit-user > > > > > ------------------------------------------------------------------------------ > This SF.Net email is sponsored by the Verizon Developer Community > Take advantage of Verizon's best-in-class app development support > A streamlined, 14 day to market process makes app distribution fast and > easy > Join now and get one step closer to millions of Verizon customers > http://p.sf.net/sfu/verizon-dev2dev > _______________________________________________ > Htmlunit-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlunit-user > |