A Scam Email with the Subject "Confirm Bio - Mem No. 512-564677" was received in one of Scamdex's honeypot email accounts on Fri, 23 May 2014 11:04:17 -0700 and has been classified as a Advance Fee Fraud/419 Scam. The sender was Updated Registry Info <Ethan__Cole@recyclerye.com>, although it may have been spoofed.
Dear Fckyou, ** Notice: Your nomination has been accepted into the 2014 Business Who's Who Registry. Please take a moment and confirm your acceptance: http://www.recyclerye.com/Registry/Acceptance/Membership/accepted.index Member No. 97-4652 You have demonstrated leadership and earned a membership. Please confirm the info we have for your Bio. We look forward to working with you. Regards, Membership Admissions /\/\/\ notification settings w/Alert System Notifications \/\/\ can be altered -by- writing1107 Valeria Dr.Marion.TX.78124 http://www.recyclerye.com/wi34h/fke3.ng4ot True. And it works with PHP's built-in XPath and XSLTProcessor classes, which are great for extracting content. porneL Nov 27 '08 at 13:28 7 For really mangled HTML, you can always run it through htmltidy before handing it off to DOM. Whenever I need to scrape data from HTML, I always use DOM, or at least simplexml. Frank Farmer Oct 13 '09 at 0:41 4 I've be re-researching this, and discovered that the problem I was having with DomDocument's loadXML method was due to an older linked version of libxml. I've been working on more up-to-date systems and DomDocument::loadHTML works like a charm. Alan Storm Nov 21 '09 at 18:04 7 Another thing with loading malformed HTML i that it might be wise to call libxml_use_internal_errors(true) to prevent warnings that will stop parsing. Husky May 24 '10 at 17:51 5 Well, just a comment about your "real-world consideration" standpoint. Sure, there ARE useful situations for Regex when parsing HTML. And there are also useful situations for using GOTO. And there are useful situations for variable-variables. So no particular implementation is definitively code-rot for using it. But it is a VERY strong warning sign. And the average developer isn't likely to be nuanced enough to tell the difference. So as a general rule, Regex GOTO and Variable-Variables are all evil. There are non-evil uses, but those are the exceptions (and rare at that)... (IMHO) ircmaxell Sep 7 '10 at 12:11 @mario: Actually, HTML can be properly parsed using regexes, although usually it takes several of them to do a fair job a tit. Its just a royal pain in the general case. In specific cases with well-defined input, it verges on trivial. Those are the cases that people should be using regexes on. Big old hungry heavy parsers are really what you need for general cases, though it isnt always clear to the casual user where to draw that line. Whichever code is simpler and easier, wins I have used DOMDocument to parse about 1000 html sources (in various languages encoded with different charsets) without any issues. You might run into encoding issues with this, but they aren't insurmountable. You need to know 3 things: 1) loadHTML uses meta tag's charset to determine encoding 2) #2 can lead to incorrect encoding detection if the html content doesn't include this information 3) bad UTF-8 characters can trip the parser. In such cases, use a combination of mb_detect_encoding() and Simplepie RSS Parser's encoding / converting / stripping bad UTF-8 characters code for workarounds. Vasu Sep 19 '10 at 6:58 Yes, but DOMDocument does not support CSS a4095yh459yj4956hynd XPATH queries, just getElementById or getElementsByTagName? umpirsky Nov 16 '10 at 9:22My problem with loadHTML is the extra nodes it inserts, which are presumably there to "fix" the HTML but aren't actually required by the DOM spec. As such, the result of a loadHTML call is ill defined. Would have been much better to have this sort of thing happen on saveHTML. CurtainDog Mar 3 '11 atDOM does actually support XPath, take a look aVincent That is not what the docs mean by "safe" in this context. It is safe to raise SyntaxError or ValuError (which the calling code can catch and handle appropriately if necessary), rather than going ahead and evaling "import os; do_evil_stuff.." or whatever other string was passed in... wim 2 days ago But that doesn't make it any "safer" than using int("31") or float("545.2222"). The only advantage that I can see is that you don't have to know beforehand what type of mathematical expression you've got (which can be useful under certain circumstances, but is not what the OP was asking)1e3 is a number in python, but a string according to your code. Cees Timmerman Oct 4 '12 at 13:24 It's good to have a decent, peer-reviewed roll-your-own version next to a good recommendation for a standard library. Sometimes I don't want to pull in another library for that one place where I need to parse urlencoded strings, and sometimes I might even have that library already in my dependency list. That both alternatives are listed as top answers is once again a great testimony to the SO community. Hanno Fietz May 19 '11 at 10:38 1 @Hanno Fietz you mean you trust these alternatives? I know they are buggy. I know pointing out the bugs I see will only encourage people to adopt 'fixed' versions, rather than themselves look for the bugs I've overlooked. Will May 19 '11 at 10:57 1 @Will - well, I would never just trust copy-and-paste snippets I got from any website, and no one should. But here, these snippets are rather well reviewed and commented on and thus are really helpful, actually. Simply seeing some suggestions on what might be wrong with the code is already a great help in thinking for myself. And mind you, I didn't mean to say "roll your own is better", but rather that it's great to have good material for an informed decision in my own code. Hanno Fietz May 23 '11 at 10:55 nyway, assuming you are using UTF-8 or some other multi-byte character encoding, now that you've decoded one encoded byte you have to set it aside until you capture the next byte. You need all the encoded bytes that are together because you can't url-decode properly one byte at a time. Set aside all the bytes that are together then decode them all at once to reconstruct your characterPlus it gets more fun if you want to be lenient and account for user-agents that mangle urls. For example, some webmail clients double-encode things. Or double up the ?&= chars (for example: . If you want to try to gracefully deal with this, you will need to add more logic to your I imagine parse returns a list so that it maintain positional ordering and more easily allows duplicate entries Aside from that, it's almost 5 times as fast as a nested try, except! Using lambda instead of def also saves 5% execution time. Tested with 32-bit Python 3.2 on 64-bit Windows 7. Cees Timmerman Oct 4 '12 at 13:55 Good point, Cees. Thanks. I appreciate benchmarking too :) How about a modified version of parseStr using regular expressions? It will probably hurt performance but someone might find it useful. The new parseStr function: parseStr = lambda x: x.isalpha() and x or x.isdigit() and int(x) or re.match('(?i)^-?(\d+\.?e\d+|\d+\.\d*|\.\d+)$',x) and float(x) or x krzym Oct 9 '12 at 11:20 Using re is almost twice as slow as the try, except method, even with the 3% faster version that uses only match. Tested using time.time() and range(1000000) on a quadcore Intel Xeon 2.93 GHz. Cees Timmerman Oct 9 '12 at 12:17I ran a few tests using: parseStrRE = lambda x: x.isalpha() and x or x.isdigit() and int(x) or re.match('(?i)^-?(\d+\.?e\d+|\d+\.\d*|\.\d+)$', x) and float(x) or x and the try/except method modified to return strings if both int and float raise ValueError for the following test cases: ['1e3', '1.e3', '123', '-1234.12', 'e', 'ee', '1e', 'e2', '3hc1']. The execution time is as 2.7 (try/except) : 1.25 (parseStrRE) : 0.85 (original parseStr). Short-circuit expressions I employed speed things up since the result might actually be returned by evaluating only a part of the expression