PHP开发 加入小组

298个成员 3006个话题 创建时间:2011-05-30

Caching in PHP using the filesystem, APC and Memcached

发表于 2012-03-28 7274 次查看

原文地址:http://www.rooftopsolutions.nl/article/107

Caching is very important and really pays off in big internet applications. When you cache the data you're fetching from the database, in a lot of cases the load on your servers can be reduced enormously.

One way of caching, is simply storing the results of your database queries in files.. Opening a file and unserializing is often a lot faster than doing an expensive SELECT query with multiple joins.

Here's a simple file-based caching engine.

  1.     <?php
  2.     // Our class
  3.     class FileCache {
  4.       // This is the function you store information with
  5.       function store($key,$data,$ttl) {
  6.         // Opening the file
  7.         $h = fopen($this->getFileName($key),'w');
  8.         if (!$h) throw new Exception('Could not write to cache');
  9.         // Serializing along with the TTL
  10.         $data = serialize(array(time()+$ttl,$data));
  11.         if (fwrite($h,$data)===false) {
  12.           throw new Exception('Could not write to cache');
  13.         }
  14.         fclose($h);
  15.       }

  1.     // General function to find the filename for a certain key
  2.       private function getFileName($key) {
  3.           return '/tmp/s_cache' . md5($key);
  4.       }
  1.     // The function to fetch data returns false on failure
  2.       function fetch($key) {
  3.           $filename = $this->getFileName($key);
  4.           if (!file_exists($filename) || !is_readable($filename)) return false;
  5.           $data = file_get_contents($filename);
  6.           $data = @unserialize($data);
  7.           if (!$data) {
  8.              // Unlinking the file when unserializing failed
  9.              unlink($filename);
  10.              return false;
  11.           }
  12.           // checking if the data was expired
  13.           if (time() > $data[0]) {
  14.              // Unlinking
  15.              unlink($filename);
  16.              return false;
  17.           }
  18.           return $data[1];
  19.         }
  20.     }
  21.     ?>

Key strategies

All the data is identified by a key. Your keys have to be unique system wide; it is therefore a good idea to namespace your keys. My personal preference is to name the key by the class thats storing the data, combined with for example an id.
example

Your user-management class is called My_Auth, and all users are identified by an id. A sample key for cached user-data would then be "My_Auth:users:1234". '1234' is here the user id.
Some reasoning behind this code

I chose 4096 bytes per chunk, because this is often the default inode size in linux and this or a multiple of this is generally the fastest. Much later I found out file_get_contents is actually faster.

Lots of caching engines based on files actually don't specify the TTL (the time it takes before the cache expires) at the time of storing data in the cache, but while fetching it from the cache. This has one big advantage; you can check if a file is valid before actually opening the file, using the last modified time (filemtime()).

The reason I did not go with this approach is because most non-file based cache systems do specify the TTL on storing the data, and as you will see later in the article we want to keep things compatible. Another advantage of storing the TTL in the data, is that we can create a cleanup script later that will delete expired cache files.
Usage of this class

The number one place in web applications where caching is a good idea is on database queries. MySQL and others usually have a built-in cache, but it is far from optimal, mainly because they have no awareness of the logic of you application (and they shouldn't have), and the cache is usually flushed whenever there's an update on a table. Here is a sample function that fetches user data and caches the result for 10 minutes.

  1.     <?php
  2.     // constructing our cache engine
  3.     $cache = new FileCache();
  4.     function getUsers() {
  5.         global $cache;
  6.         // A somewhat unique key
  7.         $key = 'getUsers:selectAll';
  8.         // check if the data is not in the cache already
  9.         if (!$data = $cache->fetch($key)) {
  10.            // there was no cache version, we are fetching fresh data
  11.            // assuming there is a database connection
  12.            $result = mysql_query("SELECT * FROM users");
  13.            $data = array();
  14.            // fetching all the data and putting it in an array
  15.            while($row = mysql_fetch_assoc($result)) { $data[] = $row; }
  16.            // Storing the data in the cache for 10 minutes
  17.            $cache->store($key,$data,600);
  18.         }
  19.         return $data;
  20.     }
  21.     $users = getUsers();
  22.     ?>



The reason i picked the mysql_ set of functions here, is because most of the readers will probably know these.. Personally I prefer PDO or another abstraction library. This example assumes there's a database connection, a users table and other issues.
Problems with the library

The first problem is simple, the library will only work on linux, because it uses the /tmp folder. Luckily we can use the php.ini setting 'session.save_path'.

  1.     <?php
  2.       private function getFileName($key) {
  3.           return ini_get('session.save_path') . '/s_cache' . md5($key);
  4.       }
  5.     ?>

The next problem is a little bit more complex. In the case where one of our cache files is being read, and in the same time being written by another process, you can get really unusual results. Caching bugs can be hard to find because they only occur in really specific circumstances, therefore you might never really see this issue happening yourself, somewhere out there your user will.

PHP can lock files with flock(). Flock operates on an open file handle (opened by fopen) and either locks a file for reading (shared lock, everybody can read the file) or writing (exclusive lock, everybody waits till the writing is done and the lock is released). Because file_get_contents is the most efficient, and we can only use flock on filehandles, we'll use a combination of both.

The updated store and fetch methods will look like this

  1.     <?php
  2.       // This is the function you store information with
  3.       function store($key,$data,$ttl) {
  4.         // Opening the file in read/write mode
  5.         $h = fopen($this->getFileName($key),'a+');
  6.         if (!$h) throw new Exception('Could not write to cache');
  7.         flock($h,LOCK_EX); // exclusive lock, will get released when the file is closed
  8.         fseek($h,0); // go to the beginning of the file
  9.         // truncate the file
  10.         ftruncate($h,0);
  11.         // Serializing along with the TTL
  12.         $data = serialize(array(time()+$ttl,$data));
  13.         if (fwrite($h,$data)===false) {
  14.           throw new Exception('Could not write to cache');
  15.         }
  16.         fclose($h);
  17.       }

  1.     function fetch($key) {
  2.           $filename = $this->getFileName($key);
  3.           if (!file_exists($filename)) return false;
  4.           $h = fopen($filename,'r');
  5.           if (!$h) return false;
  6.           // Getting a shared lock
  7.           flock($h,LOCK_SH);
  8.           $data = file_get_contents($filename);
  9.           fclose($h);
  10.           $data = @unserialize($data);
  11.           if (!$data) {
  12.              // If unserializing somehow didn't work out, we'll delete the file
  13.              unlink($filename);
  14.              return false;
  15.           }
  16.           if (time() > $data[0]) {
  17.              // Unlinking when the file was expired
  18.              unlink($filename);
  19.              return false;
  20.           }
  21.           return $data[1];
  22.        }
  23.     ?>

Well that actually wasn't too hard.. Only 3 new lines.. The next issue we're facing is updates of data. When somebody updates, say, a page in the cms; they usually expect the respecting page to update instantly.. In those cases you can update the data using store(), but in some cases it is simply more convenient to flush the cache.. So we need a delete method.

  1.     <?php
  2.         function delete( $key ) {
  3.             $filename = $this->getFileName($key);
  4.             if (file_exists($filename)) {
  5.                 return unlink($filename);
  6.             } else {
  7.                 return false;
  8.             }
  9.         }
  10.     ?>

Abstracting the code

This cache class is pretty straight-forward. The only methods in there are delete, store and fetch.. We can easily abstract that into the following base class. I'm also giving it a proper prefix (I tend to prefix everything with Sabre, name yours whatever you want..). A good reason to prefix all your classes, is that they will never collide with other classnames if you need to include other code. The PEAR project made a stupid mistake by naming one of their classes 'Date', by doing this and refusing to change this they actually prevented an internal PHP-date class to be named Date.

  1.     <?php
  2.         abstract class Sabre_Cache_Abstract {
  3.             abstract function fetch($key);
  4.             abstract function store($key,$data,$ttl);
  5.             abstract function delete($key);
  6.         }
  7.     ?>

The resulting FileCache (which I'l rename to Filesystem) is:

  1.     <?php
  2.     class Sabre_Cache_Filesystem extends Sabre_Cache_Abstract {
  3.       // This is the function you store information with
  4.       function store($key,$data,$ttl) {
  5.         // Opening the file in read/write mode
  6.         $h = fopen($this->getFileName($key),'a+');
  7.         if (!$h) throw new Exception('Could not write to cache');
  8.         flock($h,LOCK_EX); // exclusive lock, will get released when the file is closed
  9.         fseek($h,0); // go to the start of the file
  10.         // truncate the file
  11.         ftruncate($h,0);
  12.         // Serializing along with the TTL
  13.         $data = serialize(array(time()+$ttl,$data));
  14.         if (fwrite($h,$data)===false) {
  15.           throw new Exception('Could not write to cache');
  16.         }
  17.         fclose($h);
  18.       }
  19.       // The function to fetch data returns false on failure
  20.       function fetch($key) {
  21.           $filename = $this->getFileName($key);
  22.           if (!file_exists($filename)) return false;
  23.           $h = fopen($filename,'r');
  24.           if (!$h) return false;
  25.           // Getting a shared lock
  26.           flock($h,LOCK_SH);
  27.           $data = file_get_contents($filename);
  28.           fclose($h);
  29.           $data = @unserialize($data);
  30.           if (!$data) {
  31.              // If unserializing somehow didn't work out, we'll delete the file
  32.              unlink($filename);
  33.              return false;
  34.           }
  35.           if (time() > $data[0]) {
  36.              // Unlinking when the file was expired
  37.              unlink($filename);
  38.              return false;
  39.           }
  40.           return $data[1];
  41.        }
  42.        function delete( $key ) {
  43.           $filename = $this->getFileName($key);
  44.           if (file_exists($filename)) {
  45.               return unlink($filename);
  46.           } else {
  47.               return false;
  48.           }
  49.        }
  50.       private function getFileName($key) {
  51.           return ini_get('session.save_path') . '/s_cache' . md5($key);
  52.       }
  53.     }
  54.     ?>

There you go, a complete, proper OOP, file-based caching class... I hope I explained things well.
Memory based caching through APC

If files aren't fast enough for you, and you have enough memory to spare.. Memory-based caching might be the solution. Obviously, storing and retrieving stuff from memory is a lot faster. The APC extension not only does opcode cache (speeds up your php scripts by caching the parsed php script), but it also provides a simple mechanism to store data in shared memory.

Using shared memory in APC is extremely simple, I'm not even going to explain it, the code should tell enough.

  1.     <?php
  2.        
  3.         class Sabre_Cache_APC extends Sabre_Cache_Abstract {
  4.             function fetch($key) {
  5.                 return apc_fetch($key);
  6.             }
  7.             function store($key,$data,$ttl) {
  8.                 return apc_store($key,$data,$ttl);
  9.             }
  10.             function delete($key) {
  11.                 return apc_delete($key);
  12.             }
  13.         }
  14.     ?>

My personal problem with APC that it tends to break my code.. So if you want to use it.. give it a testrun.. I have to admit that I haven't checked it anymore since they fixed 'my' bug.. This bug is now fixed, APC is amazing for single-server applications and for the really often used data.
Memcached

Problems start when you are dealing with more than one webserver. Since there is no shared cache between the servers situations can occur where data is updated on one server and it takes a while before the other server is up to date.. It can be really useful to have a really high TTL on your data and simply replace or delete the cache whenever there is an actual update. When you are dealing with multiple webservers this scheme is simply not possible with the previous caching methods.

Introducing memcached. Memcached is a cache server originally developed by the LiveJournal people and now being used by sites like Digg, Facebook, Slashdot and Wikipedia.
How it works

    * Memcached consists of a server and a client part.. The server is a standalone program that runs on your servers and the client is in this case a PHP extension.
    * If you have 3 webservers which all run Memcached, all webservers connect to all 3 memcached servers. The 3 memcache servers are all in the same 'pool'.
    * The cache servers all only contain part of the cache. Meaning, the cache is not replicated between the memcached servers.
    * To find the server where the cache is stored (or should be stored) a so-called hashing algorithm is used. This way the 'right' server is always picked.
    * Every memcached server has a memory limit. It will never consume more memory than the limit. If the limit is exceeded, older cache is automatically thrown out (if the TTL is exceed or not).
    * This means it cannot be used as a place to simply store data.. The database does that part. Don't confuse the purpose of the two!
    * Memcached runs the fastest (like many other applications) on a Linux 2.6 kernel.
    * By default, memcached is completely open.. Be sure to have a firewall in place to lock out outside ip's, because this can be a huge security risk.

Installing

When you are on debian/ubuntu, installing is easy:

  1.     apt-get install memcached


You are stuck with a version though.. Debian tends to be slow in updates. Other distributions might also have a pre-build package for you. In any other case you might need to download Memcached from the site and compile it with the usual:

  1.     ./configure
  2.     make
  3.     make install


There's probably a README in the package with better instructions.

After installation, you need the Pecl extension. All you need to do for that (usually) is..

  1.     pecl install Memcache



You also need the zlib development library. For debian, you can get this by entering:

  1.     apt-get install zlib1g-dev

However, 99% of the times automatic pecl installation fails for me. Here's the alternative installation instructions.

  1.     pecl download Memcache
  2.     tar xfvz Memcache-2.1.0.tgz #version might be changed
  3.     cd Memcache-2.1.0
  4.     phpize
  5.     ./configure
  6.     make
  7.     make install


Don't forget to enable the extension in php.ini by adding the line extension=memcache.so and restarting the webserver.
The good stuff

After the Memcached server is installed, running and you have PHP running with the Memcache extension, you're off.. Here's the Memcached class.

  1.     <?php
  2.         class Sabre_Cache_MemCache extends Sabre_Cache_Abstract {
  3.             // Memcache object
  4.             public $connection;
  5.             function __construct() {
  6.                 $this->connection = new MemCache;
  7.             }
  8.             function store($key, $data, $ttl) {
  9.                 return $this->connection->set($key,$data,0,$ttl);
  10.             }
  11.             function fetch($key) {
  12.                 return $this->connection->get($key);
  13.             }
  14.             function delete($key) {
  15.                 return $this->connection->delete($key);
  16.             }
  17.             function addServer($host,$port = 11211, $weight = 10) {
  18.                 $this->connection->addServer($host,$port,true,$weight);
  19.             }
  20.         }
  21.     ?>


Now, the only thing you have to do in order to use this class, is add servers. Add servers consistently! Meaning that every server should add the exact same memcache servers so the keys will distributed in the same way from every webserver.

If a server has double the memory available for memcached, you can double the weight. The chance that data will be stored on that specific server will also be doubled.
Example

  1.     <?php
  2.         $cache = new Sabre_Cache_MemCache();
  3.         $cache->addServer('www1');
  4.         $cache->addServer('www2',11211,20); // this server has double the memory, and gets double the weight
  5.         $cache->addServer('www3',11211);
  6.         // Store some data in the cache for 10 minutes
  7.         $cache->store('my_key','foobar',600);
  8.        
  9.         // Get it out of the cache again
  10.         echo($cache->fetch('my_key'));
  11.        
  12.     ?>

Some final tips

    * Be sure to check out the docs for Memcache and APC to and try to determine whats right for you.
    * Caching can help everywhere SQL queries are done.. You'd be surprised how big the difference can be in terms of speed..
    * In some cases you might want the cross-server abilities of memcached, but you don't want to use up your memory or have your items automatically get flushed out.. Wikipedia came across this problem and traded in fast memory caching for virtually infinite size file-based caching by creating a memcached-compatible engine, called Tugela Cache, so you can still use the Pecl Memcache client with this, so it should be pretty easy. I don't have experience with this or know how stable it is.
    * If you have different requirements for different parts of your cache, you can always consider using the different types alongside.

5回复
  • 2楼 kent 2012-03-29

    APC四五年前用过。

  • 3楼 丫丫个呸 2012-03-29

    @kent  我完全落后了!

  • 4楼 kent 2012-03-29
    @丫丫个呸:@kent  我完全落后了!

    场景不多,一般性应用用不上。不必介怀呢。只有在高负载的情况下,用这个才有意义。

  • 5楼 丫丫个呸 2012-03-29

    @kent  嗯,觉得自己知道的太少太少,一定要好好学

  • 6楼 kent 2012-03-29
    @丫丫个呸:@kent  嗯,觉得自己知道的太少太少,一定要好好学

    好好学习天天向上。

发表回复
功能维护升级中,维护完成完后将再次开放,非常抱歉给您学习造成的不便。
话题作者
丫丫个呸
头衔: