Today, on the example of site http://www.chegg.com/ we learn how to parse the information that is loaded by JS-script.
1) In this link, as described above, the content really is loaded by JS script. But in the end it too somewhere is receiving data, and therefore make somewhere request. We just need to find this query, and then use it. For this we use Developer Tools (Ctrl + Shift + I, tab Network), which are available in the browser Chrome. Open the link above and analyze it. We will see a lot of different requests and responses. After reviewing all of the answers on the subject information that interests us (for example you can search by the first name: "Cengage Advantage Books ..."), we find the desired us query.
2) Check and analyze the founded query. For a start it will open in a new browser tab and see whether it gives what we need.
As you can see - yes, there is the name of the book and link to it (highlighted in green), as well as a lot of other information. If you look closely - you can see that this data in format JSON. You can verify this by inserting data into Template tester at A-parser and clicking Pretiffy JSON (highlighted green square).
3) As you can see, it was only the first page, in which only 10 results. But what if you need all the rest? Analyzing found earlier query:
If instead of the bold one put 2 and open the link in a browser - we see that the data out is what we need (2-page) in already familiar JSON. Well, now there are all loaded into the A-parser.
4) For parsing we will use
Net::HTTP. For obtaining the necessary data from all content using Parse custom result and regular expressions. By the way, the regular expression in this example, you must be very careful, because a result there are has "relatedItems" (something like similar books), and with them on the first page, you can get 18 results instead of 10.
Notes:
1) In this link, as described above, the content really is loaded by JS script. But in the end it too somewhere is receiving data, and therefore make somewhere request. We just need to find this query, and then use it. For this we use Developer Tools (Ctrl + Shift + I, tab Network), which are available in the browser Chrome. Open the link above and analyze it. We will see a lot of different requests and responses. After reviewing all of the answers on the subject information that interests us (for example you can search by the first name: "Cengage Advantage Books ..."), we find the desired us query.
2) Check and analyze the founded query. For a start it will open in a new browser tab and see whether it gives what we need.
As you can see - yes, there is the name of the book and link to it (highlighted in green), as well as a lot of other information. If you look closely - you can see that this data in format JSON. You can verify this by inserting data into Template tester at A-parser and clicking Pretiffy JSON (highlighted green square).
3) As you can see, it was only the first page, in which only 10 results. But what if you need all the rest? Analyzing found earlier query:
http://www.chegg.com/_ajax/book/search/books/1?trackid=2aed0035&strackid=1514181a
If instead of the bold one put 2 and open the link in a browser - we see that the data out is what we need (2-page) in already familiar JSON. Well, now there are all loaded into the A-parser.
4) For parsing we will use
Net::HTTP. For obtaining the necessary data from all content using Parse custom result and regular expressions. By the way, the regular expression in this example, you must be very careful, because a result there are has "relatedItems" (something like similar books), and with them on the first page, you can get 18 results instead of 10.
Notes:
- The input is previously found request, but instead numbers substitute macroes of busting numbers. In this example, all pages will be passed with the 1st of 1000-th.
- The result is a file with following information: Author - Title: Link to the page. Optionally you can change the format of the result as you like, changing the regular expression, and the data they collected, as well as changing himself Result format. Please note, there is still using the Result builder is replaced "\ /" to "/" because in JSON symbol "/" shielded.
Gilbar, Steven - Good Books: http://www.chegg.com/textbooks/good-books-1st-edition-9780899191270-0899191274?trackid=2aed0035&strackid=1514181a&ii=342
Singer, A. - Atrocious Books: http://www.chegg.com/textbooks/atrocious-books-1st-edition-9780952753797-0952753790?trackid=2aed0035&strackid=1514181a&ii=343
Lear, Edward - Nonsense Books: http://www.chegg.com/textbooks/nonsense-books-1st-edition-9781406741186-1406741183?trackid=2aed0035&strackid=1514181a&ii=344
Van Dyke, Henry - Companionable Books: http://www.chegg.com/textbooks/companionable-books-1st-edition-9781434489807-1434489809?trackid=2aed0035&strackid=1514181a&ii=345
Van Dyke, Henry - Companionable Books: http://www.chegg.com/textbooks/companionable-books-1st-edition-9781434489814-1434489817?trackid=2aed0035&strackid=1514181a&ii=346
Rodgers, Paul - Mind Books: http://www.chegg.com/textbooks/mind-books-1st-edition-9781596820258-159682025x?trackid=2aed0035&strackid=1514181a&ii=347
- Board Books: http://www.chegg.com/textbooks/board-books-1st-edition-9781858301129-1858301122?trackid=2aed0035&strackid=1514181a&ii=348
Ash, Russell - Bizarre Books: http://www.chegg.com/textbooks/bizarre-books-1st-edition-9781862051027-186205102x?trackid=2aed0035&strackid=1514181a&ii=349
Desantis, Christopher - Wicked Books: http://www.chegg.com/textbooks/wicked-books-1st-edition-9780615154800-0615154808?trackid=2aed0035&strackid=1514181a&ii=350
Sherman, William H. - Used Books: http://www.chegg.com/textbooks/used-books-1st-edition-9780812220841-0812220846?trackid=2aed0035&strackid=1514181a&ii=251
Larue, Monique - Between Books: http://www.chegg.com/textbooks/between-books-1st-edition-9780864925343-0864925344?trackid=2aed0035&strackid=1514181a&ii=252
Doyle, Robert P. - Banned Books: http://www.chegg.com/textbooks/banned-books-1st-edition-9780838985472-0838985475?trackid=2aed0035&strackid=1514181a&ii=253
Bierman, Larry - Two Books: http://www.chegg.com/textbooks/two-books-1st-edition-9781453715291-1453715290?trackid=2aed0035&strackid=1514181a&ii=254
Thackeray, William Makepeace - Sketch Books: http://www.chegg.com/textbooks/sketch-books-1st-edition-9781434414373-143441437x?trackid=2aed0035&strackid=1514181a&ii=255
Thackeray, William Makepeace - Sketch Books: http://www.chegg.com/textbooks/sketch-books-1st-edition-9781434414380-1434414388?trackid=2aed0035&strackid=1514181a&ii=256
...
Code:
eyJwcmVzZXQiOiJodHRwOi8vYS1wYXJzZXIuY29tL3RocmVhZHMvMTY5OSIsInZh
bHVlIjp7InByZXNldCI6Imh0dHA6Ly9hLXBhcnNlci5jb20vdGhyZWFkcy8xNjk5
IiwicGFyc2VycyI6W1siTmV0OjpIVFRQIiwiZGVmYXVsdCIseyJ0eXBlIjoib3Zl
cnJpZGUiLCJpZCI6ImZvcm1hdHJlc3VsdCIsInZhbHVlIjoiJGJvb2tzLmZvcm1h
dCgnJGF1dGhvciAtICR0aXRsZTogaHR0cDovL3d3dy5jaGVnZy5jb20vJGxpbmtc
XG4nKSJ9LHsidHlwZSI6ImN1c3RvbVJlc3VsdCIsInJlc3VsdCI6ImRhdGEiLCJy
ZWdleCI6IlssfFxcW117XCJ0eXBlXCI6Lis/XCJwcmltYXJ5QXV0aG9yXCI6XCIo
Lio/KVwiLis/XCJ0aXRsZVwiOlwiKC4rPylcIi4rP1wicGRwXCI6XCIoLis/KVwi
IiwicmVnZXhUeXBlIjoiZyIsInJlc3VsdFR5cGUiOiJhcnJheSIsImFycmF5TmFt
ZSI6ImJvb2tzIiwicmVzdWx0cyI6WyJhdXRob3IiLCJ0aXRsZSIsImxpbmsiXX1d
XSwicmVzdWx0c0Zvcm1hdCI6IiRwMS5wcmVzZXQiLCJyZXN1bHRzU2F2ZVRvIjoi
ZmlsZSIsInJlc3VsdHNGaWxlTmFtZSI6IiRkYXRlZmlsZS5mb3JtYXQoKS50eHQi
LCJhZGRpdGlvbmFsRm9ybWF0cyI6W10sInJlc3VsdHNVbmlxdWUiOiJubyIsInF1
ZXJ5Rm9ybWF0IjpbIiRxdWVyeSJdLCJ1bmlxdWVRdWVyaWVzIjpmYWxzZSwic2F2
ZUZhaWxlZFF1ZXJpZXMiOmZhbHNlLCJpdGVyYXRvck9wdGlvbnMiOnsib25BbGxM
ZXZlbHMiOmZhbHNlLCJxdWVyeUJ1aWxkZXJzQWZ0ZXJJdGVyYXRvciI6ZmFsc2V9
LCJyZXN1bHRzT3B0aW9ucyI6eyJvdmVyd3JpdGUiOmZhbHNlfSwiZG9Mb2ciOiJu
byIsImtlZXBVbmlxdWUiOiJObyIsIm1vcmVPcHRpb25zIjpmYWxzZSwicmVzdWx0
c1ByZXBlbmQiOiIiLCJyZXN1bHRzQXBwZW5kIjoiIiwicXVlcnlCdWlsZGVycyI6
W10sInJlc3VsdHNCdWlsZGVycyI6W3sic291cmNlIjpbMCxbImJvb2tzIiwibGlu
ayJdXSwidHlwZSI6InN0cmluZ1JlcGxhY2UiLCJhcnJheSI6ImJvb2tzIiwic2Vh
cmNoIjoiXFwvIiwicmVwbGFjZSI6Ii8iLCJ0byI6ImxpbmsifV0sImNvbmZpZ092
ZXJyaWRlcyI6W119fQ==