PDA

View Full Version : Parsing the WTA Ranking PDFs


Cineast
Apr 13th, 2011, 08:32 PM
The last time I downloaded a ranking PDF from the WTA was during the Australian Open, and I see they've changed the format again. They've replaced 0's with blank fields, which looks like a giant problem for me.

Consider Wozniacki's line in the current WTA rankings:

1 (1) WOZNIACKI, CAROLINE DEN 9930 23 470 200 280 200

The various fields are in order,
This week's ranking,
Last week's ranking,
Name,
Nationality,
Total ranking points,
Events played,
Points earned last week,
Points coming off,
16th event, and
17th event

The problem is that players don't have values in every field, especially the "Points added" field; see Vera Zvonareva's line:

3 (3) ZVONAREVA, VERA RUS 7815 20 320 125 60

On the other hand, there are people like Shahar Peer:

11 (11) PEER, SHAHAR ISR 3030 22 60 60 60

Which "60" goes in which field?

Is there any good way to parse the new WTA PDFs? I used to use a Perl script, but now I can't create a list for each player based on separating on spaces. Yes, I know there are varying numbers of spaces in names, but one could get around that problem by reversing the list when necessary, to get at the last itmes in the list. Now, however, thanks to the blank fields, it's no longer obvious what the last item in the list is. :(

stangtennis
Apr 13th, 2011, 08:43 PM
Try top ask TheBoiledEgg in this topic, I think he extract all the ranking numbers weekly from the WTA PDF rankings: http://www.tennisforum.com/showthread.php?t=402496

But what do you use it for?

Cineast
Apr 13th, 2011, 09:00 PM
I convert each line on the rankings to a tab-delimited list, so that I can open the list in a spreadsheet to do the new rankings, although I usually only do it during the Slams and a few other weeks during the year, such as the weeks that will determine the seeds for the French Open and US Open. I didn't get around to doing it during Miami this year, and during the Australian Open, the rankings PDF was in the old format.

(No offense to TBE, but I don't need a ranking list with 9843735876980543798673205472 colors. Frankly, I preferred the late 90s when the WTA actually printed its rankings as a text file in a fixed-width font. PDFs are, despite their name, much less portable.)

Meelis
Apr 13th, 2011, 09:38 PM
PDF to Excel converter should do the job. If your desired result is supposed to look like...

http://i55.tinypic.com/fdxchs.jpg

perseus2006
Apr 14th, 2011, 12:57 AM
Meelis,

Where does one get the "PDF to Excel converter"?

networthy
Apr 14th, 2011, 04:16 AM
Meelis,

Where does one get the "PDF to Excel converter"?

Have you tried Googling "PDF to Excel converter"?

perseus2006
Apr 14th, 2011, 07:03 AM
Yeah, there's lots of them! I tried two and neither worked correctly...

Thought someone would have a "tried and true" suggestion.

Marlene
Apr 14th, 2011, 08:01 AM
You can modify your Perl script to count the number of elements on each line - 1, (1), Wozniacki, etc. - and then insert "0" if an element is missing. It's only "points last week" and "points coming off" that can be 0, right?

Marlene
Apr 14th, 2011, 08:24 AM
Hm, OK, but I suppose you can use the number of tournaments to figure out if 16th and/or 17th are blank.

pov
Apr 14th, 2011, 02:43 PM
Save the PDF as a text file. Open the text file in MS Excel using Data -> Get External Data and run through the wizard.

Cineast
Apr 14th, 2011, 09:43 PM
Except that if you look at players like Zvonareva and Peer, how do you know which column is the blank one? That's the problem. When the WTA used to have actual 0's (well, it was 0.00 which was an artifact from the days they had fractions of points) it was easy to parse the file. Not now.

The one PDF to Excel converter I found only converts one page and wanted me to go through a registration process every time I tried to convert that one page.

Meelis
Apr 14th, 2011, 10:26 PM
Pdf saved as txt file > opened with excel (everything in one column) > text to columns (fixed width) worked also fine for me.