Wikipedia talk:WikiProject Wiki Syntax/archive01

Bad syntax

Every page says that "Fix those 5 problems with bad syntax". I'd rather use good or appropriate one :-) - Skysmith 07:56, 10 Nov 2004 (UTC)

Definitely ambiguous! I'll get it changed for next time to read "Fix those 5 syntax problems". All the best, Nickj 23:05, 10 Nov 2004 (UTC)

nowiki

Just note - if the sample text contains '<nowiki>' tags itself, they're not escaped (for example see (now deleted) second entry on 'Phylogenetic tree' on square-brackets-018.txt page) - this should be fixed (eg. replacing them with '<nowiki>') before next run, as it renders incorrectly and may be confusing. JohnyDog 13:54, 10 Nov 2004 (UTC)

Good point, thank you for that. For the next run I'll first do a replace all greater-than and less-than symbols with their HTML codes (as you suggest), and then surround it with nowiki tags (as per usual), which should prevent this. All the best, Nickj 23:05, 10 Nov 2004 (UTC)
- I suggest you also replace occurences of & with & so other entities don't get translated. Eric119 07:00, 18 Nov 2004 (UTC)

This is a bit silly. nowiki tags should be used for when some markup should not render, not to help out bots. One can simply add articles with intentionally misplaced brackets to an exclusion lists, in future. Dysprosia 03:21, 13 Dec 2004 (UTC)

I feel your pain, but if you take a step back, you might see things differently. The meaning of the <nowiki> tag is really that anything inside should be displayed literally rather than interpreted as wiki-markup, which is a reasonable abstraction of what the project is partly intended to address. On the gripping hand, simply marking a whole article as "don't touch this" is bound to cause problems if other errors come to be introduced further down the line—I don't think any of us are fooling ourselves that no contributor is ever going to bork up the mark-up in a fixed article ever again. HTH HAND --Phil | Talk 15:26, Dec 13, 2004 (UTC)

Erm - I don't really understand what's "a bit silly"? The original subject (from JohnyDog) was that tags that appear in the sample text on the WikiSyntax pages should be escaped, so they don't make the sample text all wonky. This doesn't seem to have anything to do with intentionally misplaced brackets or anything on article pages. It seems like you're objecting to the suggestion to put nowiki tags around wikisyntax that should not be rendered as wikisyntax on a page, which I think Phil Boswell answered. But I'm not sure if that's it. Er. JesseW 17:42, 13 Dec 2004 (UTC)

Semantics - nowiki means don't parse as markup. There are instances where intentional misplacement can occur and markup be used, which makes the wikitext a mess. Dysprosia 23:01, 13 Dec 2004 (UTC)

Perhaps instead of <nowiki></nowiki>, one could use <tt>, <code>, <pre> or leading-space markup for exclusion? (What cases can arise where intentional misplacement occurs outside these tags?) Dysprosia 23:01, 13 Dec 2004 (UTC)

Because:

This line has a leading space, but links and formatting still work, so the Wiki syntax of this line still matters.

This line is in tt tags, but links and formatting still work, so the Wiki syntax of this line still matters.

And code tags are already treated as special (they get treated identically to nowiki tags) - so if people would rather use <code> instead of <nowiki>, then that's no problem - from the point of view of this project it's all the same. All the best, -- Nickj 23:25, 13 Dec 2004 (UTC)

So leading-space markup is excluded or isn't it? I say this because most cases where intentionally misplaced bracketing is used, is used with leading-space markup. Dysprosia 06:25, 17 Dec 2004 (UTC)

If it has a leading space, it is still being checked. You've haven't told us specifically what it is that you're concerned about, but reading between the lines, is it that you're mostly concerned about source code examples? Things like:

c = x[2];
b = y[4][2];

If so, they're OK (because the brackets are balanced). It's only where the brackets aren't balanced, like:

z = x[2]]

or are split across lines :

x = y [
        2 ];

that we'll notice it. In which case, it can be surrounded by nowiki or code tags. Note also for square brackets that a complete pass was made of all articles in November (full lists made, and all problems fixed), so if the articles you're thinking about are older than November, then they've probably already been done. Basically in all types of checks (with the single exception of standard parentheses), we've now done a first pass, and for some things (e.g. redirects, double quotes) we've now done 2 passes, and for some things we've even done third passes (triple quotes, braces-tables, headings). Each successive pass gets smaller and smaller, until eventually we're only fixing the recently added stuff that was malformed (things seem to reach this stage after completing 3 passes). Hope that helps. If not, can you maybe indicate which article/articles and which part of them you're concerned about? (At the moment we're both talking in the abstract). All the best, -- Nickj 07:10, 17 Dec 2004 (UTC)

I noticed this originally on Eicar test file. But what I am more concerned about is Objective-C code, for one, eg

[[SomeObject something:[someInstance somethingElse] method:x] method3]

is legal Objective-C. Clearly the braces aren't double-bracket balanced, but having to wrap these in nowiki tags is not helpful. I've noticed people changing these to contain nowikis before, so I'm not sure if bracket balancing (not double-bracket balancing) was added later. Dysprosia 21:53, 17 Dec 2004 (UTC)

Well, the current situation for the Objective-C line shown above is that it would be listed in the Square Brackets category, as having an unbalanced "[[ and ] and ]", as the two middle square brackets cancel each other out, and the double square brackets are different to two single square brackets (so they don't cancel each other out). When someone went to "fix" this, they would probably either do this:

<nowiki>[[SomeObject something:[someInstance somethingElse] method:x] method3]</nowiki>

Or possibly even this:

<nowiki>[[</nowiki>SomeObject something:[someInstance somethingElse] method:x<nowiki>]</nowiki> method3<nowiki>]</nowiki>

Err ... Which is, of course, extremely easy to read! ;-)

OK, so that's the current situation, but how about this: At the moment we don't treat <pre> tags as special. However testing them has shown that they actually are special - they're basically a leading space plus a nowiki tag, all in one easy short tag. For example:

On leading space lines, links work (i.e. we should check the wiki syntax).

Whereas in <pre> tags, [[links]] do not work (i.e. we should not check the wiki syntax).

So what if instead of this (line with leading space):

[[SomeObject something:[someInstance somethingElse] method:x] method3]

You did this (line surrounded by pre):

[[SomeObject something:[someInstance somethingElse] method:x] method3]

And I changed the software so that handling of pre tags was improved. That way your source code example is still clean and readable, and we're eliminating false positives. Would that be an acceptable compromise? All the best, -- Nickj 22:59, 17 Dec 2004 (UTC)

It's kind of annoying to have to wrap things in pre tags, isn't it? That's why the leading-space markup is so useful (and should be used most of the time in preference to pre because it is blindingly clear in the wikitext, plus you can use markup for highlighting also). I understand you want to trap all the cases where mismatched links can arise, but realistically, when are links actually used in leading-space markup? I can't immediately see any other reasonable way out of this. Dysprosia 11:33, 18 Dec 2004 (UTC)

Perhaps it is possible to use a bit of special logic to exclude the Objective-C cases? For example, you could do something like the following

if possible syntax problem on line

if brackets mismatched (eg, "[[ ]")

flag article

if brackets not mismatched

if space between opening/closing brackets (eg, "[[ ] ]" or "[ [ ]]")

flag article

if single bracket with parenthesis mismatch (eg, "[ )" or "( ]")

if no comma between [ and ) or ( and ]

flag article

which may be trap the cases for you already, and excludes the Objective-C cases and half-open interval cases automatically (of course, implementing it is another matter ;) Dysprosia 02:01, 21 Dec 2004 (UTC)

May I get some feedback on this please? Dysprosia 08:03, 31 Dec 2004 (UTC)

I've been on Christmas break, hence the slow reply. To be honest, the answer to your question is "no, I don't think pre tags are annoying". You might wish that leading space tags meant "don't apply any wiki syntax", but the fact is, they don't mean that. Pre tags do. Nowiki tags do. But leading space lines don't. Consequently, it is appropriate to check the syntax of those lines. Also, with the checking, this project (unashamedly) is applying a more stringent level of syntax checking than the Wikipedia itself does. This might seem strange, but the facts are that the vast bulk of things that get caught as errors really are wiki syntax errors. Furthermore, being stringent is also useful because nobody can truly guarantee that every future version of the Wikipedia will be feature-for-feature and bug-for-bug rendering compatible with the current version - so things that display fine at the moment with invalid syntax may display wrong in the future. I saw examples of this during the recent 1.4 upgrade on user pages - things that rendered fine in 1.3, now rendered wrong (and I suspect part of the reason I didn't see this on article pages was in part because of this project). Consequently, I think being strict about syntax, and tagging things (as unobtrusively as possible) that look like syntax errors (but are not) is the correct approach. Personally, I don't think your suggestions are the right approach, but if you'd like to implement your suggestions, you can - I'll send you a ZIP of the source code, and you can send me a patch. Just send me a quick email. -- All the best, Nickj (t) 00:09, 2 Jan 2005 (UTC)

I think you've unfortuately misunderstood what I'm getting at, on both counts. I'm not saying that I wish leading space markup meant don't apply wiki syntax, but I was saying that on a practical level, where are links used in leading-space markup anyway? (that is to say, I am guessing that it is highly uncommon for there to be many instances)

However, I recognize this is contrary to your aims, so I suggested a bit of pseudocode to make the search smarter, not less stringent. With a smarter search that excludes false positives, this will ultimately make the case easier for you and participants whilst still flagging syntax errors and alleviates the need to introduce inappropriate tags. That serves both our needs!

Yet, my workload is somewhat high at the moment and I probably won't be able to have a good muck around with your code for some time. What language is it written in? When I get a chance, I might take you up on your offer :) Thanks, and I hope you had a nice Christmas and New Year. Dysprosia 07:26, 2 Jan 2005 (UTC)

Good marketing, dude

I love the [[Wikipedia:WikiProject Wiki Syntax|Please return the favour by clicking here to fix someone else's Wiki Syntax]] innovation, which seems to work well. Mind, I'd like to see your Data Protection registration for the Thank You to Contributors list ... I have a sneaking suspicion that the names will be dragooned into later projects, having shown themselves susceptible to this sort of appeal ;) --Tagishsimon (talk)

Maybe we need to start a [[Wedipedia:Do Not Call]] list? --pjf 02:38, 11 Nov 2004 (UTC)

Well i think anybody trying to abuse this list will get more bad publicity than help. Besides, Wikipedia is community project, so everybody with an account here could be viewed as 'susceptible' :) - JohnyDog 15:59, 11 Nov 2004 (UTC)

Dashes

It would also be good to replaces all those double hyphens in the Wikipedia articles with proper dashes. I'm not sure where to suggest this or how to start, but this seems a good place. Shantavira 10:52, 11 Nov 2004 (UTC)

Not sure of the current status, but there's been a lot of debate about dashes, including whether the wiki software could "automatically" convert double-hyphens to em-dashes or en-dashes, whether there should be an official proclamation on the One True Dash in the style guide, and so on. I'd avoid making dashes part of this particular project, as it's focused specifically on problems with wiki syntax, not style guide issues (which are inherently a lot more controversial). I recommend you head over to Wikipedia:Manual of Style (dashes) for the latest on this topic. -- Avaragado 21:17, 11 Nov 2004 (UTC)

1911 Britannica

I have come across something whilst working on the double-quotes sections. I have seen the unclosed ''This article incorporates text from the public domain 1911 Encyclopædia Britannica. I think it's better to replace that line with {{1911}} which results in This article incorporates text from the public domain 1911 Encyclopædia Britannica. and also lists the article in the 1911 Britannica category. I have taken it upon myself to add the info to each of the remaining double-quotes pages. --Martin TB 12:05, 11 Nov 2004 (UTC)

Good idea! As part of the third batch I've done a search for articles that look to include that text rather than being in the 1911 category as preferred. There were 52 articles all up, so I've made them into a list, and added the list to the main project page. All the best, --Nickj 22:52, 29 Nov 2004 (UTC)

Please be careful

Please be careful. I've seen many problem edits in several mathematics articles as a result of this page. For example some editors considering set notations like: {1, 2, {3, 4}} to be wrong, or replacing half-open interval notations like "[a, b)" with"[a, b]" or massive "nowiki" block insertions, which are not ideal. Perhaps some serious cautionary wording as well as example of "false positives" to avoid needs to be included on this project page. Paul August 19:10, Nov 11, 2004 (UTC)

Ok having actually read the project page I see now that some of these "issues" are addressed on it. Just not effectively enough apparently. I also see you are recommending the "nowiki" insertions, not sure I agree with this. I think they are confusing at best and problematic at worse. Paul August 19:10, Nov 11, 2004 (UTC)

I tend to agree with you on this matter. I came across those the alst time this dump was ran and left them as is. the problem now has arisen that someone, such as myself, who is trained in math and realize when they are right and wrong, could have put int he nowiki tags and prevented someone who is untrained from invalidating what used to be good equations. I did ask to have the warings added (which you found) but I have been thinking that the best thing to do would be to nowiki tag them as to alert people that the author knew what he was doing when he wrote it that way. Cavebear42 19:50, 11 Nov 2004 (UTC)

Nowiki is unfortunately a long tag, and it can reduce readability. But I agree that if it's not used, there's no way to tell between a legitimate open bracket and a malformed external link. We should add nowiki to math pages and other pages which use brackets and braces. Rhobite 20:04, Nov 11, 2004 (UTC)

Yes. As the main page grows longer and longer, I suggest moving the FAQ to separate page (maybe rewritting some questions to guidelines, along with sample problems and solutions, as you're suggesting) and linking it to top of all syntax-* pages (like "Before fixing anything please read the guidelines"). As for the math, most of the brackets should be now fixed (huge portion of problems are now quotes) so it shouldn't be that much of a problem for future - JohnyDog 20:24, 11 Nov 2004 (UTC)

Missing Links

I'm certain that there are missing links in here. I just can't find them, for example, Wikipedia:WikiProject Wiki Syntax/ordinary-brackets-001.txt is a valid page, that I got to by changing the URL, but it is not listed, nor is it commented out. In the last... day or so, this page has been massacared (heh - sp.) and there are must be *dozens* of valid links that are not mentioned. Am I just being dumb or have they really been removed for no reason? They don't seem to be complete or anything. Estel 19:18, Nov 12, 2004 (UTC)

If you scroll down to the bottom of the How Does This Work? section you'll see a completed pages bullet which shows the pages done so far. JohnyDog cleaned the page yesterday. I'll bold it to make it clearer. -- Martin TB 20:17, 12 Nov 2004 (UTC)

the ordinary brackets pages were removed by Nickj few days ago with (Cull all but one ordinary brackets - probably better to fix the other types of problems first). I'm the one who 'massacred' main page yesterday, but i've double checked that nothing is missing :) - JohnyDog 20:21, 12 Nov 2004 (UTC)

Yes, I removed the links to the ordinary brackets (parentheses) pages. Currently these are all valid pages:

Additionally, I've got files here that would cover the range ordinary-brackets-018 through to ordinary-brackets-134 inclusive (but these aren't in the Wikipedia yet). My thinking was that the other things (i.e. just quotes at the moment) were maybe better to fix before these. Most of the ordinary brackets problems seem to be things that have been scanned-in or created in a text editor, with new lines every 80 characters or so - so most of the time it's a question of just removing the new lines (although maybe about 15 to 25% of the time there really are mismatched parentheses). Also the sheer number of ordinary-brackets pages is pretty huge (and so would probably takes ages and ages to get through), and (if you want to be really pedantic) parentheses aren't a wiki syntax (although they're easy to find if you're already looking for the other stuff). One possibility is to add the links to the ordinary brackets lists to the main page, but only after the quotes are done, and then to not to hold up the next run if there are still ordinary brackets that aren't fixed (i.e. outstanding unfixed ordinary brackets would not block the next run). Another thing I was unsure about was maybe dropping the searching for mismatched parentheses entirely from the next run, but I'm really not sure. One other complication is that intentional mismatched parentheses shouldn't really be put in nowiki tags, unlike the other kinds syntax (since they're not a wiki syntax) - which means that if there are valid uses of mismatched parentheses, then those things are likely to turn up over and over in successive runs. What do you folks reckon? Should ordinary brackets be part of the Wiki Syntax project or not (or should they only be done for a few runs, and then dropped)? If yes, should other types of problems take priority? Should a new run be blocked if there are outstanding unfixed ordinary brackets? All the best, Nickj 23:36, 12 Nov 2004 (UTC)

Should the case where the (ordinary) brackets are separated by newline really be considered as error ? From wikipedia POV it's perfectly valid syntax. In most cases (as you said) it's more problem of the whole article - removing one newline really won't help anything, and fixing all newlines in article is IMHO far beyond the scope of this project. As we cannot exclude valid uses from showing up again and again, i'd say lets drop it for now. Also, does the script list pages with problems like "[link)" in both ordinary-brackets and square-brackets ? If yes, it's another reason for at least postponing it for next run. - JohnyDog 01:56, 13 Nov 2004 (UTC)

Re: "From Wikipedia POV it's perfectly valid syntax" - Didn't know that, and that's a very good reason to not fiddle with them.

Re: "Also, does the script list pages with problems like "[link)" in both ordinary-brackets and square-brackets?" - No - the categories and exactly what they included are:

"ordinary-brackets", array("(", ")")
"double-quotes", array("''")
"triple-quotes", array("'''")
"square-brackets", array("]", "[", "[[", "]]", "[ and ]]", "[[ and ]", "[ and )", "( and ]", "( and ]]", "] and )", "]] and ''", "( and [", ") and ]", "( and [[", "]] and )", "'' and ]]", "] and [")
"headings", array("==", "====", "===", "== and ===", "==== and ==", "== and ====", "=== and ==", "==== and ===", "=== and ====")
"miscellaneous", array(") and (", "'' and )", "( and ''", "'' and (", ") and ''", "]] and [[", "'' and ]]", "] and ==", "'' and ] and )", "'' and [[", "''' and [[", "]] and '''", "'' and ]", "[[ and )", ") and ]]", "( and ==", "[ and (", "<!--", "( and [[ and ]", "[ and ''", "== and )", "[[ and (", ") and [", "== and ]]", "[ and ( and ]]", "[[ and ''", "''' and )", "[[ and ] and [", "] and (", "''' and ==", "[[ and ] and )", "]] and [[ and ]", "] and ]]", ") and [[", "'' and ) and (", "] and [ and )", ") and ==== and ==", "== and (", "[[ and '''", "] and [ and (", ") and ( and [", "'' and [", "] and ''", "]] and (", "( and [ and ]]", "] and [ and ]]", "( and '''", "== and [[ and [", "[ and ]] and )", "]] and ] and ==", "[[ and ==")
"braces-tables", array("{{", "|}", "{|", "}}")

(i.e. the half-open [x,y) interval notation stuff would have come under square-brackets, so they've been done.)

Also the approximate total size of each category for this run was:

ordinary-brackets: 16085
double-quotes: 8546
square-brackets: roughly 3500
triple-quotes: 1637
miscellaneous: roughly 300
headings: roughly 100
braces-tables: roughly 100

So, if we can eliminate parantheses, then we cut the amount of stuff we need to do by more than half. All the best, Nickj 02:45, 13 Nov 2004 (UTC)

Some more things to check

When I edited Varda, Greece for unclosed bold text, I found an unclosed <div> block that narrowed the rest of the page, and character entities without the final semicolons, which are also found on many pages for Japanese towns. Can these be searched for? Susvolans 12:00, 15 Nov 2004 (UTC)

Hi. This was the subject of a bot request by me. See the discussion at Wikipedia:Bot_requests#Simple_regex_bot.3F_2, which led to this offline report about the &sup2 malformed entity: User:Topbanana/Reports/This article contains a malformed HTML entity. I don't know if you want to go fix this by hand, I would think its something a bot would be suited for. Hope this helps. --ChrisRuvolo 10:55, 18 Nov 2004 (UTC)

Good stuff - That works out really well, because I've now added something to report on unclosed or unopened div tags (these problems will be listed in the upcoming third batch). So between this and Topbanana's reports, this should catch both of these types of problems. All the best, Nickj 21:49, 22 Nov 2004 (UTC)

Cool. You should note that this iteration of the report only shows malformed sup2 entities. lt, gt, amp, nbsp, mdash, sup1, sup3, and numeric entities (eg. Α) should also be checked (I expect those to be the most common). Checking for a missing leading ampersand might also be useful. Hope this has been helpful. Thanks. --ChrisRuvolo 22:38, 24 Nov 2004 (UTC)

Wikicode comments

I've noticed a lot of changes recently closing italicization marks ('') that I use for endline comments in pseudocode. I actually left them unclosed on purpose, letting the end-of-line close them — should I avoid doing this? Deco 20:36, 15 Nov 2004 (UTC)

I agree that using a single '' at the beginning of a line or paragraph to italicise the whole of it, which a lot of the double quote issues are, is not entirely obviously wrong. Susvolans 09:26, 16 Nov 2004 (UTC)

But Wilmot N. Hess contained a line like that and it was wrong. Susvolans 11:05, 17 Nov 2004 (UTC)

The problem with this is when someone later appends text which shouldn't be italicised - it's main problem for filmographies - names of films on separate lines with someone later adding year, comment, link etc to the same line, without closing the italicisation. JohnyDog 15:45, 17 Nov 2004 (UTC)

But in end-line comments, the comment always extends to the end of the line. Any added text would be part of the comment, and so should be italicized. This doesn't apply here. Deco 18:33, 17 Nov 2004 (UTC)

Yes, unfortunately we cannot automatically distinguish between cases where it's left on purpose and where not (afaik that is). Any ideas how to solve this ? - JohnyDog 20:21, 17 Nov 2004 (UTC)

I suppose not; in the case of standard wikicode, it suffices to exclude unclosed '' marks which are immediately followed by //. But you already got most of them anyway. Deco 03:36, 18 Nov 2004 (UTC)

Keep up the momentum

Would it be possible to make a fresh search before the existing to-do list is finished, to keep people busy? Susvolans 10:54, 17 Nov 2004 (UTC)

That's up to Nickj.Problem is that we'll have to wait for the next database dump, after last page is fixed (should occur in a week). Of course we could run search on current dump (2 days ago) for the problems that were fixed by the time (iirc everything except double quotes), however i would rather see it all done in sync. - JohnyDog 20:30, 17 Nov 2004 (UTC)

Hmm, I think it would be a bad idea. One of the motivations of a hefty task like this is the feeling when you cross the finishing line. Take that away and it can feel like a never-ending slog - hamsters on a wheel. Let's finish this, bask in the glow, then see what's next. -- Avaragado 20:36, 17 Nov 2004 (UTC)

But something has to be done. With all the links that have been made to this page, people are going to continue to see the project for the first time, right up until the project is completed. It is important that there be something for the new people to do. --Ben Brockert 02:13, Nov 19, 2004 (UTC)

I think one of the long-term aims of the project could hopefully be to make itself obsolete, by fixing the outstanding problems. I'm sure that there will be plenty of stuff in the next run (either things that accidentally weren't fixed, or that have been introduced since the last run, or that we didn't look for previously but now do), but I'm thinking and hoping that the number of things found will go down. So this first huge run was like a big marathon, but the next one might be more like a one-mile run, and eventually it'll hopefully just be like a 100-metre sprint (with a short list that gets done quickly, and with long periods in-between where there's nothing that needs to be done). Nevertheless, I was thinking that in the gap between this run finishing and the next one starting, I would add a message encouraging people to a) take a break ; b) add people who helped out to the list of contributors ; c) tidy up the main page, such as with examples or Q&A or any other cleanups they want to make ; d) If they don't feel like any of the above, then I could add a few more undone ordinary-brackets files left over from this run. -- Nickj 06:02, 19 Nov 2004 (UTC)

inappropriate closing of ")"

I think your algorithm is finding false positives in those situations where a user chooses to itemize arguments or details. For example, at Bee learning and communication, someone wrote "The primary lines of evidence used by the odor plume advocates are 1) clinical experiments with odorless sugar sources which show that worker bees are unable to recruit to those sources and 2) logical difficulties of a small-scale dance..." Wrp103 followed your instructions and changed those to "(1)..." and "(2)...". This may be a small point, but it is stylistically incorrect. Surrounding the number on both sides indicates a footnote, not a segmented argument.

I imagine that this is a rare problem. In many situations, the segmented arguments can be displayed as either a bulleted or numbered list. However, there are some articles where that layout just does not make sense. Please do not arbitrarily close the parentheses unless it really is a grammatical mistake. Thanks. Rossami (talk) 12:18, 19 Nov 2004 (UTC)

Thank you for the feedback, and point taken: I've updated the directions for closing parentheses to clarify that this type of use is grammatically OK, and to encourage people to err on the side of caution if in doubt. All the best, Nickj 05:49, 20 Nov 2004 (UTC)

False positives

It appears that the string ]]) is causing the software to pick up an unclosed "(". For instance, the string and John (who was a [[pilot]]) gained fame will complain of an unclosed opening parenthesis. grendel|khan 07:14, 2004 Nov 20 (UTC)

No worries, I'll certainly look into it, but can I please get the names of one or more example articles which are exhibiting this problem? All the best, Nickj 10:24, 20 Nov 2004 (UTC)

Categories and templates

Will the next run search the Category: and Template: namespaces? Susvolans 13:17, 22 Nov 2004 (UTC)

No. The problem with templates is that they can (validly) contain unclosed or unopened tables or wiki code. An example is Template:Taxobox begin (which is part of the Tree of Life, a new and improved template for taxonomy infoboxes). Taken in isolation, it's invalid (starts a table but doesn't close it) - but when taken in context (such as being followed by Template:Taxobox end), then it's fine. So validating templates requires knowledge of how those templates are used, which is probably beyond the present scope of this project. With categories, I don't know much about them, but looking at "Category:1911 Britannica", it looks like most of the page is auto-generated, with only a very small bit written by people, so I don't know if there would be much benefit. All the best, -- Nickj 01:11, 25 Nov 2004 (UTC)

Category pages are just like normal article pages, except that below the normal text is an autogenerated list of subcategories and articles in the category. It would be useful to check for syntax errors in normal text as much as anywhere else. The only code changes I'm aware of would be that you would have to put an extra : in front of the link when making up the Wiki Syntax pages. If it's not a big coding job, Nickj, please include them. JesseW 02:24, 25 Nov 2004 (UTC)

Third batch commencing

The third batch has commenced, using the database dump from yesterday. Currently there are only redirect problems listed, namely:

1 page of redirects that appear to have non-standard redirect syntax.
14 pages of redirects that appear to either be double-redirects, or redirects to non-existent pages.

These two categories are new to this run. Lists for the other categories are being generated now, and will be added once they're finished - if all goes smoothly, it should take around 34 hours for this happen. All the best, Nickj 06:06, 28 Nov 2004 (UTC)

The next time you do this, could you make it so that redirects aren't automatically followed? It's harder to fix mistakes as it is now. (If you don't know how, linking to http://en.wikipedia.org/w/wiki.phtml?title=PageName&redirect=no is what is needed.) Eric119 06:38, 28 Nov 2004 (UTC)
- Good point - this is something I had clean forgot to do until I read your message (Doh!) - Thankfully there's no need to wait until next time though - the files are so new that they haven't been edited yet, so I've converted all the new files over to this format, with redirects disabled, which should make fixing easier for this batch. All the best, Nickj 07:27, 28 Nov 2004 (UTC)

Folks, the remainder of the third batch has now been added. By the way, if you're wondering whether we're having an effect, the answer is an emphatic "Yes!". Consider the number of problems found in this batch:

double-quotes - 1377 entries.
triple-quotes - 276 entries.
square-brackets - 2080 entries.
miscellaneous - 116 entries.
headings - 20 entries.
braces-tables - 55 entries (these have already been completed).

That gives a total of 3924 entries. The second run found 15000 entries in the same categories. 3924 / 15000 is equal to 26% - I.e. we have eliminated 74% of these problems! Pretty damn impressive! All the best, -- Nickj 05:43, 1 Dec 2004 (UTC)

Unified Validation Project

In response to Nick's call: I will produce a more extensive list of html errors with my Wikipedia to TomeRaider conversion script, with some documentation, and think about the feasibility of a stripped down version of the script for validation purposes only.

It would be nice if a tool and docu set could easibly be applied to other Wikipedias as well. Compare the bot for finding interwiki links, which runs on many Wikipedias now.

Is there a method/procedure to flag warnings as 'false positives', so that they do not reappear in consecutive runs? Erik Zachte 10:27, 1 Dec 2004 (UTC)

Hi Erik, That sounds great - the first list of your HTML table attribute problems was completed within 5 days, and was almost all done by Diberri. Also, I agree it would be good to have tools that apply to other Wiki's - TB's scripts are readily available, whilst the scripts I'm using are still in flux (combines about 3 projects into one), so they're definitely not ready for general release yet. For false positives, we don't currently have a systematic way of flagging these. To date I've been lucky in that the problems that are being detected are either a) real, and can be fixed in the article b) not real, but are arguably incorrect anyway, and can be surrounded with nowiki tags to prevent future detection c) are due to bugs in the software, in which case the software is fixed, or d) are due to parentheses, which I think I'll not check in future runs. All the best, -- Nickj 05:09, 10 Dec 2004 (UTC)

Redirects and URL escapes

Redirects to titles with URL escapes should be detected and fixed, for these redirects fail to work properly. See [[Cimarr%E3o]] for an example: It should redirect to [[Chimarr%E3o]], but doesn't. [[User:Poccil|Peter O. (Talk, automation script)]] 20:49, Dec 1, 2004 (UTC)

Unbalanced Div tags

Moved from the main article, and therefore unsigned

Possible wrinkle: some editors use an empty <div /> to add an "id" anchor for an internal link. If you don't want to put the work in replacing these with more orthodox footnotes, leave them in place and make a note for next time around.

Could these anchor divs be fixed by adding a </div> staight after them? (The W3C spec indicates that divs require both a start and an end tag, so this way the anchor stays, and it's valid).

The correct syntax for these id tags is <div id="xxx"></div>

You sure about that? Wikipedia seems to be tagged as XHTML, so the abbreviated syntax should be allowable

Yes, these tags could be "fixed" by closing the DIV tag in such a manner, but there are two objections:

We have a perfectly good policy on footnotes, and this is a good time to change these unorthodox practices
The current syntax is actually correct XHTML, so we shouldn't actually be detecting them at all

--Phil | Talk 16:28, Dec 17, 2004 (UTC)

I don't currently know anything about Wiki footnotes, so I'm not qualified to comment on them. On XHTML (and I could definitely be wrong here), but aren't div tags supposed to be closed in XHTML though? This bit of the spec says that for non-empty elements, end tags are required, and in the XHTML DTD it defines div tags as a block-level element (like paragraphs, tables, etc, all of which also have to be closed). So doesn't that mean divs must be closed? I could easily be wrong though, and please correct me if I am. The reason I'm trying to clarify this is that if they really are valid when unclosed, then as Phil says it's a bad idea to keep detecting them as malformed syntax. All the best, -- Nickj 21:43, 17 Dec 2004 (UTC)

According to the same spec, <div id="xxx" /> is both the opening and closing tag, because it is a shorthand for <div id="xxx"></div>. – AB CD 17:14, 18 Dec 2004 (UTC)

In the same way that <p/> and (more appositely) <br "clear="all"/> are legitimate. HTH HAND --Phil | Talk 09:23, Dec 20, 2004 (UTC)

Actually w3c specifies a blank before the slash. <p /> and <br "clear="all" />. See [1]. Erik Zachte 12:19, 20 Dec 2004 (UTC)

Ah, OK, I understand now - thank you all for clarifying that. -- All the best, Nickj (t) 22:59, 20 Dec 2004 (UTC)

Fixing non ISO-8859-1 characters?

How about a project to replace non ISO-8859-1 characters with their correct equivalents? For example, € becomes €. These invalid characters are bad because they tend to get replaced by ?, automatically by the browser when someone edits. --Dbenbenn 08:25, 19 Dec 2004 (UTC)

This is work for a bot; Guanabot has done this in the past. Or, of course, the English Wikipedia could be converted to Unicode like the others, and the characters could be typed safely. Susvolans (pigs can fly) 13:24, 20 Dec 2004 (UTC)

Wouldn't this rather depend upon whether a particular user's browser was performing correctly? IMNSHO it would be better to always tend towards caution and replace anything which might break with something which won't. --Phil | Talk 09:45, Jan 24, 2005 (UTC)

URLs with Section specified?

The square bracket pages often included the section name as part of the URL, so that when you clicked on the link, it would position you to the section with the problem. Since starting on the parens section, I noticed that this doesn't do that. It was a great help, and if you could add that to this section, it would make life easier. ;^) wrp103 (Bill Pringle) - Talk 05:12, 22 Dec 2004 (UTC)

I know what you mean ... the sections markers were added for the third & fourth runs - the trouble is that the parentheses lists were generated as a one-off between the second and third runs, before the Wiki Syntax Project knew about section markers, so I just don't have section information for those lists, otherwise I would add it :( Sorry. -- All the best, Nickj (t) 05:53, 22 Dec 2004 (UTC)

As an aside...

After doing 120+ pages of bracket fixing, I hereby declare the term "parenthesis" to be a new form of mental illness... --Plek 03:14, 12 Jan 2005 (UTC)

Me: Doctor, I think I'm suffering from parenthesis.

Doctor: Ah, yes, that happens a lot to people whose colon has come to a full stop. Why don't you take this prescription and dash off to the pharmacy. In the meantime, I know this great quote that might help you get through this grave and difficult period...

Me: AAAAARGHHH!!!!

I know exactly how you feel! Although doing this work I've edited pages I would never otherwise have known about (e.g. List of people on stamps of Denmark), and learned more about certain subjects that I ever imagined - I never knew there were so many characters in Thomas the Tank Engine and Friends, that there was a Marquess of Northampton or a Manitoba Cooperative Commonwealth Federation! --Thryduulf 15:36, 12 Jan 2005 (UTC)

The Wiki Syntax Bar

The last of the parentheses is slain! Huzzah! Free drinks for everybody (rings bell)! --Plek 23:09, 12 Jan 2005 (UTC)

Well done everyone! Thanks for the offer of a drink, Plek - I'll have a Bitter please. Thryduulf 23:24, 12 Jan 2005 (UTC)

Plek pours Thryduulf a pint and happily puts a bowl of roasted parentheses on the counter

Brilliant! I'll have a pint of Guinness, please! - UtherSRG 00:12, Jan 13, 2005 (UTC)

Doctor forbade Guinness, have to take root beer instead. - Skysmith 09:39, 13 Jan 2005 (UTC)

Nice job folks, next <small> run should attest to your efforts. — Davenbelle 00:37, Jan 13, 2005 (UTC)

I arrived late and contributed just one set, but I'm always ready to lift a Guinness with UtherSRG - Eustace Tilley 23:10, 2005 Jan 17 (UTC)

I'll have an exclamation pint. 68.88.234.52 21:53, 22 Jan 2005 (UTC)

I'll have a small Single malt Scotch, although I only did a little. Henry Troup 00:01, 2 Feb 2005 (UTC)

Source code

What queries were used to make this? r3m0t 18:37, 12 Feb 2005 (UTC)

It's not using a database query (other than to fetch the source text of the article). Rather, it's going through the source text from start to finish, and as it does so using a stack whereby any wiki syntax gets pushed onto or popped from the stack. If you get to the end of a line (for wiki links, italics, bolds, etc), or the end of the article (for everything else), and the stack is not empty, then you know that the syntax is malformed (i.e. not closed or opened properly). There's also separate checking for redirects (using a regular expression), and comparing whether any cur_is_redirect = '1' entries don't match the redirect regex - that's a bit of special case though, and in all other regards it's using a stack. That doesn't mean it can't be done as queries (in fact, it's possible it could be better to do so, because then the list of problems could probably be generated more quickly), but that's not how it's done at the moment. Hope that helps. -- All the best, Nickj (t) 00:16, 13 Feb 2005 (UTC)

That's useful, and I could program something like that myself, but only in PHP. PHP is slow. Very slow. In fact, exceedingly slow. Do you have anything faster? r3m0t 13:14, Feb 21, 2005 (UTC)

It's written in PHP currently. IMHO, PHP is fast enough. It does take a while to run (around 60 hours), but it's doing 3 different things at once in that time, to every "proper" (namespace = 0) article in the Wikipedia, namely:

I took care of a few double redirects, and noticed how mindlessly repetitive it was. So mindlessly repetitive that there's no reason they couldn't all be fixed with a script. 21:33, 4 May 2005 (UTC)

The slowest of these is the suggesting wiki links, since it involves checking whether every word (and some word combinations) in every article has a matching article or redirect of the same name. Given this, I don't think 60 hours is unreasonable, and I'm not sure that rewriting it in another language would make it significantly faster (I could definitely be wrong though!). -- All the best, Nickj (t) 22:11, 21 Feb 2005 (UTC)

Brion (I think) once said on wikitech-l that a port of the MediaWiki diff code produced a certain diff in 0.5 secs. PHP made the diff in 45.5 seconds. (This was a special case with almost every line changed.)
Spellchecking took 3.72 seconds in this benchmark - about 3 times slower than Perl or Python, and far slower than compiled C (or C++).
Word frequency count took 6.01 seconds; Perl 1.02; Python 1.36; C 0.36.

I've picked out the benchmarks most obviously involved in string manipulation. Well, I guess I'll reimplement it, for my own entertainment. So the (opening) tokens are: " ( { [ [[ '' ''' {| " <math> <tt> &" and their closing tokens are " ) } ] ]] '' ''' |} " </math> </tt> ;" correct? r3m0t 07:35, Feb 23, 2005 (UTC)

Those are some quite big speed differences! And if you're willing to implement a syntax checker, that's great because the more the merrier as far as I'm concerned ;-) With the wiki tokens, there are some multi-line tokens, and some single line ones. I've copied and pasted the code I'm using below, and tried to remove any stuff that's irrelevant to the area of syntax checking:

<?php

// Purpose: Wiki Syntax functions
// License: GNU Public License (v2 or later)
// Author:  Nickj

// -------- format handling ----------------

/*
** @desc: handles the stack for the formatting
*/
function formatHandler($string, &$formatStack, $reset = false) {
    static $in_nowiki, $in_comment, $in_math, $in_code;
    
    if (!isset($in_nowiki) || $reset) {
        $in_nowiki = false;
        $in_comment = false;
        $in_math = false;
        $in_code = false;
    }
    
    // don't bother processing an empty string.
    $string = trim($string);
    if ($string == "") return;
    
    $pattern      = "%(''')|('')|"            // Wiki quotes
                  . "(\[\[)|(\[)|(]])|(])|"   // Wiki square brackets
                  . "(\{\|)|(\|\}\})|(\|\})|" // Wiki table open & Close + infobox close.
                  . "(\{\{)|(\}\})|"          // Transclude open and close
                  . "(<!--)|(-->)|"           // Comment open and close
                  . "(====)|(===)|(==)|"      // Wiki headings
                  . "(&lt;math&gt;)|(&lt;/math&gt;)|"     // Math tags
                  . "(&lt;nowiki&gt;)|(&lt;/nowiki&gt;)|" // Nowiki tags
                  . "(<code>)|(</code>)|"     // Code tags
                  . "(<div)|(</div>)%i";      // div tags
                  
    $matches = preg_split ($pattern, strtolower($string), -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
    
    
    foreach ($matches as $format) {
        
        if ($format == "&lt;nowiki&gt;") {
            if ($in_nowiki == false) addRemoveFromStack($format, $format, false, $formatStack, $string);
            $in_nowiki = true;
        }
        else if ($format == "&lt;/nowiki&gt;") {
            if ($in_nowiki == true) addRemoveFromStack($format, "&lt;nowiki&gt;", false, $formatStack, $string);
            $in_nowiki = false;
        }
        else if ($format == "&lt;math&gt;") {
            if ($in_math == false) addRemoveFromStack($format, $format, false, $formatStack, $string);
            $in_math = true;
        }
        else if ($format == "&lt;/math&gt;") {
            if ($in_math == true) addRemoveFromStack($format, "&lt;math&gt;", false, $formatStack, $string);
            $in_math = false;
        }
        else if ($format == "<!--") {
            if ($in_comment == false) addRemoveFromStack($format, $format, false, $formatStack, $string);
            $in_comment = true;
        }
        else if ($format == "-->") {
             if ($in_comment == true)  addRemoveFromStack($format, "<!--", false, $formatStack, $string);
             $in_comment = false;
        }
        else if ($format == "<code>") {
            if ($in_code == false) addRemoveFromStack($format, $format, false, $formatStack, $string);
            $in_code = true;
        }
        else if ($format == "</code>") {
            if ($in_code == true)  addRemoveFromStack($format, "<code>", false, $formatStack, $string);
            $in_code = false;
        }
        
        else if (!$in_math && !$in_nowiki && !$in_comment && !$in_code) {
            
            if ($format == "'''") {
                addRemoveFromStack($format, $format, true, $formatStack, $string);
            }
            else if ($format == "''") {
                addRemoveFromStack($format, $format, true, $formatStack, $string);
            }
            else if ($format == "[[") {
                addRemoveFromStack($format, $format, false, $formatStack, $string);
            }
            else if ($format == "[") {
                addRemoveFromStack($format, $format, false, $formatStack, $string);
            }
            else if ($format == "]]") {
                addRemoveFromStack($format, "[[", false, $formatStack, $string);
            }
            else if ($format == "]") {
                addRemoveFromStack($format, "[", false, $formatStack, $string);
            }
            else if ($format == "{|") {
                addRemoveFromStack($format, $format, false, $formatStack, $string);
            }
            else if ($format == "|}") {
                addRemoveFromStack($format, "{|", false, $formatStack, $string);
            }
            else if ($format == "====") {
                addRemoveFromStack($format, $format, true, $formatStack, $string);
            }
            else if ($format == "===") {
                addRemoveFromStack($format, $format, true, $formatStack, $string);
            }
            else if ($format == "==") {
                addRemoveFromStack($format, $format, true, $formatStack, $string);
            }
            else if ($format == "{{") {
                addRemoveFromStack($format, $format, false, $formatStack, $string);
            }
            else if ($format == "}}") {
                addRemoveFromStack($format, "{{", false, $formatStack, $string);
            }
            else if ($format == "|}}") {
                addRemoveFromStack($format, "{{", false, $formatStack, $string);
            }
            else if ($format == "<div") {
                addRemoveFromStack("<div>", "<div>", false, $formatStack, $string);
            }
            else if ($format == "</div>") {
                addRemoveFromStack($format, "<div>", false, $formatStack, $string);
            }
        }
    }
}


/*
** @desc: Given a type of formatting, this adds it to, or removes it from, the stack (as appropriate).
*/
function addRemoveFromStack($format, $start_format, $same_start_and_end, &$stack, $string) {
    // if it is there, remove it from the stack, as long is not start of format
    if (isset($stack[$start_format]) && ($same_start_and_end || $format != $start_format)) {
        array_pop($stack[$start_format]);
        if (empty($stack[$start_format])) unset($stack[$start_format]);
    }
    // otherwise, add it, and the string responsible for it.
    else {
        $stack[$format][] = $string;
    }
}


/*
** @desc: returns whether a format is a multi-line or a single line format.
*/
function is_single_line_format($format) {
    if ($format == "'''"  || $format == "''"  ||
        $format == "[["   || $format == "]]"  || 
        $format == "["    || $format == "]"   || 
        $format == "====" || $format == "===" || $format == "==" ||
        $format == "("    || $format == ")"  ) {
            return true;
    }
    return false;
}


/*
** @desc: takes a wiki string, and removes the newlines, &'s, >'s, and <'s.
*/
function neuterWikiString($string) {
    // remove newline chars, and escape '<' and '>' and '&' (note that & needs to come first)
    return str_replace( array ("\n", "&", "<", ">"), array(" ", "&amp;amp;", "&amp;lt;", "&amp;gt;"), $string);
}


/*
** @desc: checks the formatting of a line, and logs an errors found.
*/
function checkLineFormatting($page_title, $full_line, &$formatting_stack) {

    // the temp array for storing the section heading parsing output
    $section_array = array();
	  
    // If this is a section heading, then store this.
    if (preg_match("/^={2,4}([^=]+)={2,4}$/", trim($full_line), $section_array)) {
        $section = trim($section_array[1]);
        $heading_line = true;
    }
    
    
    // if we are still formatting
    if (!empty($formatting_stack)) {
  
        // don't report any heading problems if we're not in a heading line.
        if (!$heading_line) {
            if (isset($formatting_stack["=="]))   unset($formatting_stack["=="]);
            if (isset($formatting_stack["==="]))  unset($formatting_stack["==="]);
            if (isset($formatting_stack["===="])) unset($formatting_stack["===="]);
        }
        
        $format_string = "";
        // for each misplaced bit of formatting
        foreach (array_keys($formatting_stack) as $format) {
            
            // only consider single-line formatting at this point
            if (is_single_line_format($format)) {
               
                // save this format string.
                if ($format_string != "") {
                    $format_string .= " and ";
                }
                
                $format_string .= "$format";
                
                // remove it from the stack
                unset($formatting_stack[$format]);
            }
        }
        
        // if there were any formatting problems, then save those now.
        if ($format_string != "") {
            // save the formatting problem to the DB.
            dbSaveMalformedPage(addslashes($page_title), addslashes($format_string), addslashes(neuterWikiString($full_line)), addslashes($section));
        }
    }
}


// --------------------------------------------------------


/*
Then the usage is like this:

// for each article in the wikipedia, set $page_title

    $formatting_stack = array();
    // reset the static vars in the format handler
    formatHandler("", $formatting_stack, true);

    // for each $line in the article text of the $page_title article
    	   formatHandler($line, $formatting_stack);
    	   checkLineFormatting($page_title, $line, $formatting_stack);
    // end for
    
    // then save any full-page formatting problems.
    foreach (array_keys($formatting_stack) as $format) {
        dbSaveMalformedPage(addslashes($page_title), addslashes($format), "", "");
    }

// end for   

*/


?>

Here's everything that I'm currently aware of that's wrong in the above code, or potentially missing from it:

Need to add <pre> tags to the list of tags to check.
Add <tt> tags?
Add a special case for "[[image:" and "]]" to allow multi-line syntax, because image tags can run across lines ?
Improve handling of div tags for XHTML compliance - <div id="xxx" /> is valid as both the opening and closing tag, because it is a shorthand for <div id="xxx"></div>
Add a special case for ''''', which combines both the '' and ''' cases? Otherwise cases like ''''''85-'86''''' get handled wrong (the first 6 quotes get treated as a bold open and close, whereas the Wikipedia treats is as a bold open, then an italics open, then a single quote).
For nowiki, comment, math, and code tags, doubled up opening tags are not detected as an error, when the should be. For example <code><code></code> should be listed as an error, but is not.
Actually, I'm not sure this is always the case. For example <nowiki><nowiki></nowiki> is a valid way of generating the text '<nowiki>'. --HappyDog 02:27, 30 Mar 2005 (UTC) (PS - just take a look at the source to see what I had to type to generate that first string!)
The easiest way to generate those strings would have been <nowiki><nowiki></nowiki> and <nowiki> – AB CD 02:58, 30 Mar 2005 (UTC)
Dash it all, you're right. I had wiki-markup on the brain! --HappyDog 03:32, 30 Mar 2005 (UTC)

Hope that helps! -- All the best, Nickj (t) 23:17, 24 Feb 2005 (UTC)

please use the edit summary

can you please add a summary of the change instead of just saying "fixed wiki syntax"? say "[[test] --> [[test]] Fix wikilink syntax blah blah blah". Then we don't have to go to each article to search every small thing you changed to find bad "fixes". - Omegatron 04:55, Mar 12, 2005 (UTC)

Can you please be more specific about what we did that was bad? For example, is there a particular error that we're misdetecting? If so, please let me know. Please realise that we're not perfect, but we're honestly not trying to introduce problems.

With the current batch, there's just two types of errors listed at the moment, namely:

Redirects that had slightly bad syntax (e.g. "#REDIRECT([[Blah)", instead of "#REDIRECT [[Blah]]"; For these the suggested summary is "Fix Redirect Syntax".
Redirects that were double redirects (e.g. A → B → C, which gets changed to A → C); For these the suggested summary is "Fix Double Redirect".

Which one of these was wrong? They're both fairly straightforward transformations, and hopefully neither should introduce new errors (but for example in the double-redirect case, if it was wrong for A to redirect B, and we then change A to redirect to C, then the source of the error was that A redirected to B, not that we changed A to redirect to C). -- All the best, Nickj (t) 07:42, 12 Mar 2005 (UTC)

Long image descriptions -- OK to remove?

I've hit a wall on Wikipedia:WikiProject Wiki Syntax/square-brackets-001.txt, regarding pages containing several multi-line image descriptions, such as Apollodotus I, Apollodotus II and Apollophanes. I doubt that squashing these descriptions into a single line would be an acceptable solution. Would it therefore be alright to consider these pages "fixed", and rip them out accordingly? Fbriere 20:03, 23 Mar 2005 (UTC)

Yes, please do remove these from the list. Fixing this is on the to-do list. Basically the Wikipedia wants normal links to start and end on the same line, but image tags are allowed to run over multiple lines. A special case needs to be added to detect "[[image:" tags so that they are treated differently from "[[" tags (currently they're both treated the same), but this hasn't been added yet. Until this is added multi-line "[[image:" tags will be listed as malformed, even though they're OK. -- All the best, Nickj (t) 23:13, 23 Mar 2005 (UTC)

As long as this is corrected before the next run; otherwise, we'll be removing them from the list over and over... (Especially since they end up appearing twice; I had to get rid of something like 15—20 occurences on 001.) Fbriere 00:38, 24 Mar 2005 (UTC)

<!-->

This seems to be valid syntax for opening and closing a comment, i.e.

<!-->comment<!-->

and should probably be ignored, as they're valid and a complete waste of time to "fix". --Jshadias 23:07, 23 Mar 2005 (UTC)

:-) Interesting! That's a somewhat tricky one to parse correctly (at least with the current KISS approach, which will detect it as two open comment tags, rather than an open and then a close). It is valid HTML. However, it's definitely quicker to change a handful of articles than it is to change the parser (at least from my perspective), given that the total number of articles in the Wikipedia that use this construct must be quite small (e.g. less than 10). Do you have the titles of the articles that use this, and I'll get them changed? -- All the best, Nickj (t) 23:31, 23 Mar 2005 (UTC)

eh, I'll just change them when I come across them. As long as it's uncommon it's not a huge deal. --Jshadias 14:36, 24 Mar 2005 (UTC)

Victoria Cross recipients

The following appears on several articles:

recipient of the [[Victoria Cross]], the highest and most prestigious award for gallantry in the face of the enemy that can be awarded to [[United Kingdom|British] and [[Commonwealth]] forces.

Would it be possible to create a bot to take care of these? (Though I notice Google only shows 20 such articles. If this is accurate, I guess manual work would still be cheaper...)

Probably not worth adding a bot for this by a long shot. Getting permission to run and operate a bot is a political process (see Wikipedia:Bots). You need to two levels of permission / non-objection (one from en, one from meta). The burden of proof that the bot is non-harmful is on the author. Many more regulations on bots have been discussed and may be added at some point. For 20 articles, it's (IMHO) really really really not even remotely worth the grief, hassle, and red tape. -- All the best, Nickj (t) 01:26, 24 Mar 2005 (UTC)

QC instead of ship-and-fix?

This is a very cool project, but...

It's commonly accepted in software development that it's a lot cheaper to fix a bug before the product is released than to release it and then have to go back and fix problems. It seems we would do well to do this sort of syntax checking right on the edit page (make it part of the "Show preview" function) instead of finding them in batch mode later. --RoySmith 01:02, 26 Mar 2005 (UTC)

I agree. The functions that provide the Wiki Syntax checking are in the #Source code section above (just added licensing to indicate that they're under the GPL), and I plan to release a very slightly updated version of those functions soon. I would like the see these incorporated into MediaWiki in some way (e.g. either a "Check Wiki Syntax" link in the "toolbox" section, or as part of "Show preview"), as the GPL licensing would allow this, and I encourage the MediaWiki developers to add this. Even with this though, there are still going to be errors that need to be cleaned up in batch mode, but it would be an improvement. -- All the best, Nickj (t) 04:27, 26 Mar 2005 (UTC)

msg: links

The {{msg:}} syntax for templates is deprecated as of 1.5 where {{msg:foo}} will simply transclude Template:Msg:foo instead of Template:Foo, here's a list of pages from the 2005-03-09 dump that still use the syntax:

—Ævar Arnfjörð Bjarmason 02:05, 2005 May 15 (UTC)

In that list there are a lot of user pages, talk pages, and pages that have nowiki around the msg template. Should we just delete those out of your list or what? --Kenyon 05:08, May 16, 2005 (UTC)

Looks like a bot has been written / is being written to resolve these (i.e. it now has an SEP field around it) -- All the best, Nickj (t) 02:55, 3 Jun 2005 (UTC)

one left on div-tags-000

The one left on Wikipedia:WikiProject Wiki Syntax/div-tags-000.txt is Main Page/French, and it looks too hard to do by hand. So if anybody has a good HTML fixing program to use on that, go ahead. Or I suppose we could just forget about that page, since it seems to be dead (last edit was Dec 14, 2004). --Kenyon 04:15, May 16, 2005 (UTC)

I had a go at it, hope it's fixed now, but if it's not I guess we'll know because it'll turn up in the next batch. ;-) -- All the best, Nickj (t) 02:55, 3 Jun 2005 (UTC)

Project page "Completed pages" table

Is this at all necessary? Moving the links of the entries to a completely different table and also striking them out. I could see just keeping them in the one table and strikinging them out, or maybe moving them to a separate table, but not both. I'd like to join the two tables and keep the strike-outs. Anyone have an opinion on the matter? – Quoth 09:59, 20 May 2005 (UTC)[reply]

I don't feel too strongly about it, but by having a separate table, plus striking out, it makes sure that the uncompleted stuff is quite visible and all grouped together, and by also striking completed pages out it makes it doubly-clear that those things are already done (in case people are just skim-reading). (In other words: Yes, it is redundant, but maybe the redundancy is sometimes helpful). -- All the best, Nickj (t) 02:55, 3 Jun 2005 (UTC)

New double redirect cleanup project

Hello,

I've generated another list of double redirects as of the 20050516 database dump at User:triddle/double_redirect/20050516. I did not want to edit the project page since I'm not a member of this project. Perhaps someone else knows the best way to try to integrate my list with this project? Thanks. Triddle 21:18, Jun 23, 2005 (UTC)

Hi Triddle, Go for it! Please feel free to edit away to add your double redirect list. I'm not precious about what people list (as long as it's relevant, which this clearly is), and I tend to be pretty slack about running the current script that finds the syntax problems and generates the lists (typically I run it once every 2 or 3 months), so any help with producing up-to-date lists is more than welcome. In terms of integrating them, maybe just edit the pages (example) and add the updated lists (if you're happy to do this), and update the main page accordingly. Also, please be sure to add yourself to the "credits - software" section of the page. Add if you're feeling like doing some other extra stuff, two related things that may interest you are the listing of malformatted redirects, and broken redirects (i.e. redirects that point to something which isn't there). (e.g. to illustrate, here's an old example that's already been fixed up). The trick with the malformed ones is just applying a regex to all redirects and listing what doesn't match, and the main trick with the second is that people sometimes include a # in their redirect targets, so everything after the # has to be ignored. All the best, Nickj (t) 02:21, 27 Jun 2005 (UTC)

Working together?

Hello,

I've been getting pretty good at analyzing the dump files with perl and getting useful stuff done. I am curious if I could help work with this project? How are you preparing your lists? If you are having problems beating really hard on SQL databases then I might be able to help by having it done through analysis of the dump files. Let me know if you think I can help. Triddle 06:42, Jun 26, 2005 (UTC)

Absolutely you can help! The current problem finding script is a serious mess, as it integrates three different projects into the one script (suggester.php), so it lacks the clean separation that it really should have (and I tend not to have the time to do anything with it for extended periods, including add the separation that it really requires). It's also got quite slow (probably as a result of doing too much at once, plus I think I've maxed out the memory on the box I'm using, so it's starting to thrash to disk), taking now around 7 or 8 days to do a complete run. The current code is in PHP, but a cleaner and quicker reimplementation (in any language, such as perl) would be a very good thing. You can get the current source code here. This includes the source code for preparing the lists (in output_malformed_pages.php). All the best, Nickj (t) 02:21, 27 Jun 2005 (UTC)

Standards

I have some AlMac observations about possible similar interests between this project and the usability project. AlMac

For example, for the usability project, I suggested that there might be value in adding to the Tool box.

I just edited this page, please run some standard software to identify common typing errors, that I could fix right now. AlMac 4 July 2005 18:56 (UTC)

The edit suggestor bot

What was the script used to generate the vast lists of edit links for this project? It's needed desperately at WikiProject Disambiguation. --Smack (talk) 00:03, 24 July 2005 (UTC)[reply]

Smack - read this page. JesseW 16:22, 24 July 2005 (UTC)[reply]

Thanks. I didn't figure to look there :) --Smack (talk) 02:55, 25 July 2005 (UTC)[reply]

You want either of the output_*.php files from the ZIP file. They both generate a series of text files that can then be copied and pasted straight into the Wikipedia as lists of things that need doing. (I never got around to writing a bot to upload the files without human intervention). If you're suggesting disambiguations then probably the most similar file will be output_malformed_pages.php - you probably want the "outputToFile" function, and the global defines, and then delete the rest of the file, and then go from there to add the stuff that's specific to disambiguations. Hope that helps. All the best -- Nickj (t) 06:49, 30 July 2005 (UTC)[reply]