Using Perl and Regular Expressions to Process HTML Files - Part 3
The script we looked at in Part 2 (script1.pl - repeated below for convenience) has one major drawback, making it unusable in real terms: the name of the web page (HTML file) that the script processes is hard coded into the script itself. For the script to be useful, we need to be able to run it on any web page. Changing the script so that it can do this is fairly straightforward.
Below, I've given two scripts: script1.pl, which was our original script from Part 2, and script2.pl, which is a new script that will process a list of files.
Note: Due to display considerations, in the example code shown in this article, square brackets '[..]' are used in HTML/script tags instead of angle brackets ''.
script1.pl
1 open (IN, "file1.htm");
2 open (OUT, ">new_file1.htm");
3 while ($line = [IN]) {
4 $line =~ s/[h1]/[h1 class="big"]/;
5 (print OUT $line);
6 }
7 close (IN);
8 close (OUT);
script2.pl
1 foreach $file (@ARGV) {
2 rename $file, "$file.bak";
3 open (IN, "$file");
5 while ($line = [IN]) {
6 $line =~ s/[h1]/[h1 class="big"]/;
7 (print OUT $line);
8 }
9 close IN;
10 close OUT;
11 }
Before looking at each line of the script in detail, let's just quickly establish what script2.pl does. Well, it processes one or more files entered at the command line prompt (for example, the MS-DOS prompt) and then, for each file entered, the script initially makes a backup copy before changing every occurrence of [h1] to [h1 class="big"].
A few quick definitions:
Variable A temporary storage place for a value. In the above script, $file is a variable. The filename file1.htm, which will be entered at the command line prompt, is a value that will be temporarily stored in that variable when the script is run.
Array A storage place for a list of values.
Let's take a look at each line of script2.pl.
Line 1
This line enables one or more files to be entered at the command line and processed by the script. We only have one file, 'file1.htm', so when we run the script we'll only enter one file to be processed.
Line 2
This line makes a backup copy of each file before processing it. So, for 'file1.htm', the backup file would be 'file1.htm.bak'.
Line 3
This line opens a filehandle for the file being processed. Part 2 of this series of articles gives more information about filehandles.
Line 4
This line opens another filehandle, but this time for the output from the script.
Note: file1.htm.bak will contain the contents of the file from before the script is run. file1.htm will contain the updated contents, that's to say, the output from the script.
Line 5
This line sets up a loop in which each line in the input file (the file being processed) will be examined individually.
Line 6
This is the regular expression. It searches for one occurrence of [h1] on each line of the input file and, if it finds one, changes it to [h1 class="big"].
See Part 2 for a full description of the actual regular expression.
Line 7
This line takes the contents of the $line variable and, via the OUT file handle, writes the line to the output file.
Line 8
This line closes the 'while' loop. The loop is repeated until all the lines in the file currently being processed have been examined.
Lines 9 and 10
These two lines close the two file handles that have been used in the script.
Line 11
This line closes the 'foreach' loop. The loop is repeated until all the files entered at the command line prompt have been processed.
Running the script
To run the script, at the command line type:
C:>perl script2.pl file1.htm
If the script executes successfully, a new file should be created called file1.htm.bak, which is a backup of the orginal file (ie before it was processed). A new version of file1.htm should also have been produced, containing the modified [h1] tag.
In Part 4 we'll look at an alternative way of inputting/selecting files for processing.
Related Tags: html, files, perl, regular, expressions, convert, conversion, process, file
About the Author: John Dixon is a web developer and technical author. These days, John spends most of his time developing dynamic database-driven websites using PHP and MySQL.
Go to http://www.computernostalgia.net to view one of John's sites. This site contains articles and photos relating to the history of the computer.
To find out more about John's work, go to http://www.dixondevelopment.co.uk.
Recent articles in this category:
- Improper Way Of Marketing Reflects Poorly On A Company.
New business, product or service everything requires visibility, awareness in order to come into the - Replacing Paper Prints With Online Versions
Nowadays saving out on resources and being additionally informative are both aspects that are in. In - Stop Smoking Effectively
If I told you of a way that you could stop smoking harmful tobacco would you believe it? Most people - What Is Runtime Error 182? And How To Fix It
Are you finding an effective way to fix runtime error 182? Do you think fixing runtime error 182 is - Do You Know How To Fix Runtime Error 87 In Minutes?
Are you finding an effective way to fix runtime error 87? Do you think fixing runtime error 87 is to - Knowledge About Avi, Avi Player, Avi Converter On Mac
Knowledge about AVI, AVI player, AVI Converter on MacWhat is an AVI?AVI, an acronym for Audio Video - Buy Your Highly Successful Email Survey Software Today
Email Survey Software- Boost Your Business and Increase ProfitsAn email survey software could be one - Xrm - The Anything Relationship Management Solution
I recently attended the Microsoft Dynamics West Region FY11 Sales Planning Retreat. This year's meet - What Are The Benefits Of Working With Electronic Medical Records
Recording medical information is a vital part of health care services. These records are necessary f - Basic Factor To Make Website Business Oriented
Internet is home for millions of websites. The online business is becoming more and more competitive
Most viewed articles in this category:
- Parental Control Software
Parental control software is software that can help parents protect their children when they are onl - Digital Asset Management Software
Managing and organizing your organization's documents is a critical component to your business's suc - AdobeRGB vs. sRGB
Understanding color spaces I'll try to explain it very simplified, but understandable for everyone - Confessions of a Prankster
I wanted to get a jump on April Fool's Day, partially because of the long, cold winter blues, and pa - Malicious Thoughts About The Spyware Ills Of My PC
Who would think I was capable of such revengeful thoughts about the parties responsible for inflicti - Recover File and Recover Deleted File Tools
Data recovery software is a very effective way of retrieving data from a worn or damaged hard disk d - Life without Windows
Ubuntu, a user-friendly version of Linux, has been running so nicely on my home PC that I decided to - What Benefit Does an Online Software Download Site Offer You?
Are you having a problem that where you find a good softeware when you consider to have a try or wan - Maintaining A Website
There was an era when people were talking about how to create a website using html coding or some ea - Benefits Of Proper Time Tracking
Have you ever written down time when you have started and finished your work? Maybe you have had mul