Character Encoding Problem?

scrapefun

A-Parser Enterprise License
A-Parser Enterprise
I think I'm having a character encoding problem but I can't tell if it is with how A-parser is setup or possibly something on my server.

I have a custom Net::HTTP parser that queries Google and then saves the results page as a html file. It works great but for some reason a few queries will not save as a file.

They appear to work fine in the parser test and the queries aren't failing they simple aren't being saved. All the queries that are failing are non-english words/phrases.

I've include the parser code below but the one way I was able to get it to work is change this line for the results file name:
serp_raw/[% IF p1.info.success == 1 %][% USE Math; "test_4"_ Math.int(query.num / 2500) _"/"_ query _".html" %][% END %]

To this:
serp_raw/[% IF p1.info.success == 1 %][% USE Math; "test_4"_ Math.int(query.num / 2500) _"/test.html" %][% END %]

With the updated line I can perform one query at a time and the file will be generated but of course the file naming is no longer dynamic.

And I can then manually re-name the files with the correct query name with no problems.


Here is the parser code that includes the queries failing:
Code:
eyJwcmVzZXQiOiJUZXN0IC0gUkFXIEhUTUwiLCJ2YWx1ZSI6eyJwcmVzZXQiOiJU
ZXN0IC0gUkFXIEhUTUwiLCJwYXJzZXJzIjpbWyJOZXQ6OkhUVFAiLCJkZWZhdWx0
Iix7InR5cGUiOiJvdmVycmlkZSIsImlkIjoidXNlci1hZ2VudCIsInZhbHVlIjoi
TW96aWxsYS81LjAgKFdpbmRvd3MgTlQgNi4xOyBXT1c2NCkgQXBwbGVXZWJLaXQv
NTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgQ2hyb21lLzQ3LjAuMjUyNi4xMTEg
U2FmYXJpLzUzNy4zNiJ9LHsidHlwZSI6Im92ZXJyaWRlIiwiaWQiOiJmb3JtYXRy
ZXN1bHQiLCJ2YWx1ZSI6IlslIElGIGluZm8uc3VjY2VzcyA9PSAxICVdJHBhZ2Vz
LmZvcm1hdCgnJGRhdGFcXG4nKVslIEVORCAlXSJ9LHsidHlwZSI6Im92ZXJyaWRl
IiwiaWQiOiJwcm94eXJldHJpZXMiLCJ2YWx1ZSI6IjIwMCJ9LHsidHlwZSI6Im92
ZXJyaWRlIiwiaWQiOiJ1c2Vwcm94eSIsInZhbHVlIjp0cnVlfSx7InR5cGUiOiJv
dmVycmlkZSIsImlkIjoiZ29vZENvZGUiLCJ2YWx1ZSI6WzIwMF19LHsidHlwZSI6
Im92ZXJyaWRlIiwiaWQiOiJwcm94eWJhbm5lZGNsZWFudXAiLCJ2YWx1ZSI6IjAi
fSx7InR5cGUiOiJvdmVycmlkZSIsImlkIjoicXVlcnlmb3JtYXQiLCJ2YWx1ZSI6
Imh0dHBzOi8vd3d3Lmdvb2dsZS5jb20vc2VhcmNoP3E9JHF1ZXJ5JnB3cz0wJnV1
bGU9dytDQUlRSUNJTlZXNXBkR1ZrSUZOMFlYUmxjdyJ9LHsidHlwZSI6Im92ZXJy
aWRlIiwiaWQiOiJyZXF1ZXN0ZGVsYXkiLCJ2YWx1ZSI6IjAifSx7InR5cGUiOiJv
dmVycmlkZSIsImlkIjoidGltZW91dCIsInZhbHVlIjoiMzAifV1dLCJyZXN1bHRz
Rm9ybWF0IjoiJHAxLnByZXNldCIsInJlc3VsdHNTYXZlVG8iOiJmaWxlIiwicmVz
dWx0c0ZpbGVOYW1lIjoic2VycF9yYXcvWyUgSUYgcDEuaW5mby5zdWNjZXNzID09
IDEgJV1bJSBVU0UgTWF0aDsgXCJ0ZXN0XzRcIl8gTWF0aC5pbnQocXVlcnkubnVt
IC8gMjUwMCkgX1wiL1wiXyBxdWVyeSBfXCIuaHRtbFwiICVdWyUgRU5EICVdIiwi
YWRkaXRpb25hbEZvcm1hdHMiOltbImZhaWxlZC9mYWlsZWQudHh0IiwiWyUgSUYg
cDEuaW5mby5zdWNjZXNzID09IDAgJV0kcXVlcnlcXG5bJSBFTkQgJV0iXV0sInJl
c3VsdHNVbmlxdWUiOiJubyIsInF1ZXJpZXNGcm9tIjoidGV4dCIsInF1ZXJ5Rm9y
bWF0IjpbIiRxdWVyeSJdLCJ1bmlxdWVRdWVyaWVzIjp0cnVlLCJzYXZlRmFpbGVk
UXVlcmllcyI6ZmFsc2UsIml0ZXJhdG9yT3B0aW9ucyI6eyJvbkFsbExldmVscyI6
ZmFsc2UsInF1ZXJ5QnVpbGRlcnNBZnRlckl0ZXJhdG9yIjpmYWxzZSwicXVlcnlC
dWlsZGVyc09uQWxsTGV2ZWxzIjpmYWxzZX0sInJlc3VsdHNPcHRpb25zIjp7Im92
ZXJ3cml0ZSI6ZmFsc2V9LCJkb0xvZyI6ImRiIiwia2VlcFVuaXF1ZSI6Ik5vIiwi
bW9yZU9wdGlvbnMiOmZhbHNlLCJyZXN1bHRzUHJlcGVuZCI6IiIsInJlc3VsdHNB
cHBlbmQiOiIiLCJxdWVyeUJ1aWxkZXJzIjpbXSwicmVzdWx0c0J1aWxkZXJzIjpb
XSwiY29uZmlnT3ZlcnJpZGVzIjpbXSwicXVlcmllcyI6Ilx1YmU0NVx1YmM0NW1v
bnN0ZXJcdWI0ZTNcdWFlMzBcblx1MDQzNFx1MDQ0ZFx1MDQ0MyBcdTA0NDJcdTA0
MzhcdTA0M2FcdTA0M2UgXHUwNDNlXHUwNDQyXHUwNDM3XHUwNDRiXHUwNDMyXHUw
NDRiXG5nXHUyNjZkIG1ham9yXG5kciBwZXJvIHZyXHUwMTdlb2dpXHUwMTA3In19
 
I'm also having trouble with queries like:

The "&" "+" and "#" characters don't seem to be passed/encoded properly.

Also, when saving the file if a query has a "%" it won't be used in the filename even thought that is an acceptable character for Windows filenames.

Even if I encode the query myself it's still not working properly but I need to be able to keep the query in orginal form in my query list and not encoded but just wanted to see what would happen if tested already encoded.
 
Last edited:
And I can then manually re-name the files with the correct query name with no problems.
I'm working on this is issue, new version will be released soon

The "&" "+" and "#" characters don't seem to be passed/encoded properly.
you have to apply escape filter:

lcpyj.png


Also, when saving the file if a query has a "%" it won't be used in the filename even thought that is an acceptable character for Windows filenames.
this will be fixed also
 
Thanks for the help!

For the escape filter, I would need to create a filter for each character I am having issues with or is there a way to specify multiple characters in a single filter?

Also, I'm not clear on where I put this in my task settings. The screenshot providing the example is for Google parser but I am using NET::Http parser?

Looking forward to the update for the other issues. Fantastic support as always!
 
For the escape filter, I would need to create a filter for each character I am having issues with or is there a way to specify multiple characters in a single filter?

Also, I'm not clear on where I put this in my task settings. The screenshot providing the example is for Google parser but I am using NET::Http parser?

Just replace in our Query format $query to [% query | uri %]
 
haha...couldn't be much easier than that :-)

Will you post here when the update is ready or should I just check the RU forum for the latest updates?

Thanks for the help!
 
I tested with the latest update and the file naming issues I was having seem to be fixed and most of the characters are passed fine after using the escape filter but I'm still having problems with some words.

Mainly those containing "+", "&", "<", and ">" characters.

Here are some examples with the original phrase on the left of the "=" and what is returned in the Google result page I'm downloading on the right:



Granted some of these are pretty much nonsense for testing purposes but I need to be able to properly submit these characters. Could very well be I'm doing something wrong on my end of course :)
 
Last edited:
<a href="url">click here to view pdf file</a> = &lt;a href="url"&gt;click here to view pdf file&lt;/a&gt;

It isn't proper symbols for HTML, you can't use < >(and several other symbols) directly in html. This is because you will get &lt; &gt; &amp; etc... This called "HTML-entities"

Exactly same you will get from google in browser:

2cb8e.png
 
Thanks for the explanation on that. I was just checking in my text editor instead of browser so didn't see those rendered correctly.

This explains what I was seeing with all the characters except the "+".

When I view files for those queries in either my text editor or browser the "+" are not there. It's like they have been left off.

I see this for all the queries containing a "+" . They don't seem to be there.

Thanks for all the help and patience.
 
Last edited:
Back
Top